NSA's text search algorithm
"Ian Farquhar" <ianf@sydney.sgi.com>:
I always imagined that the development of [NSA's text scanning] algorithm itself predated email, and started back with cable and telex traffic.
Stat text scanning is ancient, but has probably not been used on the scale and efficiency that the NSA would require for net traffic.
Earlier this year, the agency began soliciting collaborations from business to develop commercial applications of their technique.
Has anyone got any further information about how this algorithm works? It sounds like Rishab has somewhat better info than was publicly available months ago when we last discussed this particular NSA "technology transfer".
Actually my 'info' about NSA's thing was mainly deduction put together with some (limited) specs on Architext (http://www.atext.com graham@atext.com). If you read NSA's note carefully, you easily rule out NLP ("independent of...language") and sophisticated neural nets ("very fast"). The Economist story I mentioned in my last post (on the fact that I beat them to the story!) goes into some detail on BT and Cornell's programs that summarize textual matter. These are apparently successful (included is an pretty good example of a computer-generated summary of the article), but also quite different from NSA's. BT uses basic NLP to get past articles, conjunctions etc (making it language-dependent), and stems (removes -ing, -ed, -s etc, unlike NSA which denies stemming, dictionaries etc; obviously language-dependent), before creating statistical table of word frequencies which are used to determine the subject of a sentence or the similarities between texts. Cornell can search "gigabytes of data ... in a few seconds [for] a subject" or similarity to an example text. It can figure out which sentences are 'important' (by comparing frequency tables). I suspect NSA's is much more pattern-oriented, as its USP is document clustering; maybe it uses some NN at some level. Of course you don't really need to know grammar to filter out articles and pronouns; you could do that statistically too. Rishab ----------------------------------------------------------------------------- Rishab Aiyer Ghosh "In between the breaths is rishab@dxm.ernet.in the space where we live" rishab@arbornet.org - Lawrence Durrell Voice/Fax/Data +91 11 6853410 Voicemail +91 11 3760335 H 34C Saket, New Delhi 110017, INDIA
On Dec 20, 2:33am, rishab@dxm.ernet.in wrote:
Subject: NSA's text search algorithm "Ian Farquhar" <ianf@sydney.sgi.com>:
I always imagined that the development of [NSA's text scanning] algorithm itself predated email, and started back with cable and telex traffic.
Stat text scanning is ancient, but has probably not been used on the scale and efficiency that the NSA would require for net traffic.
Earlier this year, the agency began soliciting collaborations from business to develop commercial applications of their technique.
Has anyone got any further information about how this algorithm works? It sounds like Rishab has somewhat better info than was publicly If you read NSA's note carefully, you easily rule out NLP ("independent of...language") and sophisticated neural nets ("very fast").
You can rule out both of them on the grounds that the original release claimed that it was ammendable to hardware implementation. I speculated some clever form of CAM plus stats engine. Ian.
participants (2)
-
Ian Farquhar -
rishab@dxm.ernet.in