[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NSA's text search algorithm



"Ian Farquhar" <[email protected]>:
> I always imagined that the development of [NSA's text scanning]
> algorithm itself predated email, and started back with cable and 
> telex traffic.

Stat text scanning is ancient, but has probably not been used on the scale
and efficiency that the NSA would require for net traffic.

> > Earlier this year, the agency began soliciting collaborations from
> > business to develop commercial applications of their technique. 
>
> Has anyone got any further information about how this algorithm works?
> It sounds like Rishab has somewhat better info than was publicly
> available months ago when we last discussed this particular NSA
> "technology transfer".

Actually my 'info' about NSA's thing was mainly deduction put together 
with some (limited) specs on Architext (http://www.atext.com
[email protected]). If you read NSA's note carefully, you easily rule out
NLP ("independent of...language") and sophisticated neural nets ("very fast").
The Economist story I mentioned in my last post (on the fact that I beat 
them to the story!) goes into some detail on BT and Cornell's programs that
summarize textual matter. These are apparently successful (included is an
pretty good example of a computer-generated summary of the article), but 
also quite different from NSA's.

BT uses basic NLP to get past articles, conjunctions etc (making it
language-dependent), and stems (removes -ing, -ed, -s etc, unlike NSA
which denies stemming, dictionaries etc; obviously language-dependent),
before creating statistical table of word frequencies which are used to
determine the subject of a sentence or the similarities between texts.
Cornell can search "gigabytes of data ... in a few seconds [for] a
subject" or similarity to an example text. It can figure out which
sentences are 'important' (by comparing frequency tables).

I suspect NSA's is much more pattern-oriented, as its USP is document 
clustering; maybe it uses some NN at some level. Of course you don't really
need to know grammar to filter out articles and pronouns; you could do that
statistically too.

Rishab

-----------------------------------------------------------------------------
Rishab Aiyer Ghosh                                "In between the breaths is
[email protected]                                  the space where we live"
[email protected]                                        - Lawrence Durrell
Voice/Fax/Data +91 11 6853410  
Voicemail +91 11 3760335                 H 34C Saket, New Delhi 110017, INDIA