NSA's text search algorithm

17 Dec 2003

      "Ian Farquhar" <ianf@sydney.sgi.com>:
...
I always imagined that the development of [NSA's text scanning]
algorithm itself predated email, and started back with cable and 
telex traffic.
Stat text scanning is ancient, but has probably not been used on the scale
and efficiency that the NSA would require for net traffic.
...
...
Earlier this year, the agency began soliciting collaborations from
business to develop commercial applications of their technique.
Has anyone got any further information about how this algorithm works?
It sounds like Rishab has somewhat better info than was publicly
available months ago when we last discussed this particular NSA
"technology transfer".
Actually my 'info' about NSA's thing was mainly deduction put together 
with some (limited) specs on Architext (http://www.atext.com
graham@atext.com). If you read NSA's note carefully, you easily rule out
NLP ("independent of...language") and sophisticated neural nets ("very fast").
The Economist story I mentioned in my last post (on the fact that I beat 
them to the story!) goes into some detail on BT and Cornell's programs that
summarize textual matter. These are apparently successful (included is an
pretty good example of a computer-generated summary of the article), but 
also quite different from NSA's.

BT uses basic NLP to get past articles, conjunctions etc (making it
language-dependent), and stems (removes -ing, -ed, -s etc, unlike NSA
which denies stemming, dictionaries etc; obviously language-dependent),
before creating statistical table of word frequencies which are used to
determine the subject of a sentence or the similarities between texts.
Cornell can search "gigabytes of data ... in a few seconds [for] a
subject" or similarity to an example text. It can figure out which
sentences are 'important' (by comparing frequency tables).

I suspect NSA's is much more pattern-oriented, as its USP is document 
clustering; maybe it uses some NN at some level. Of course you don't really
need to know grammar to filter out articles and pronouns; you could do that
statistically too.

Rishab

-----------------------------------------------------------------------------
Rishab Aiyer Ghosh                                "In between the breaths is
rishab@dxm.ernet.in                                  the space where we live"
rishab@arbornet.org                                        - Lawrence Durrell
Voice/Fax/Data +91 11 6853410  
Voicemail +91 11 3760335                 H 34C Saket, New Delhi 110017, INDIA

rishab＠dxm.ernet.in

Ian Farquhar

tags

participants (2)