On Mon, 10 Dec 2001, Faustine wrote:
What are N-Grams?
N-Gram Analysis is a a method patented by the NSA to compare the semantic of two texts or audio or video data files. The algorithm is pretty simple, all you have to do, is take a sliding window of length N and move it over the text, and remember, how often which text-fragment of length N occured in the text. This implementation of the N-Gram Method is a pretty simple ANSI-C-Program, I wrote to distract me from my end-of-semester exams. It would be nice, if you send me patches, comments or so to rhoehndo@imn.htwk-leipzig.de. I will do some more to this code as soon as I finished my exams.
Here's a thought. Given a comprehensive collection of public Internet communication, including Usenet, mailing list traffic, weblog entries, etc, and an advanced semantic analysis algorithm, it should be fairly trival to take the n-gram (or other semantic signature) of a remailed message, and search a database of these signatures for possible matches. Google's got the raw material, and surely the NSA does as well. (If the FBI hadn't been collecting this information all along, would they even need to ask a court to get the NSA to share it with them? I'm talking about purely public info -- nothing that Google wouldn't have.) A program could be written to run over time, generating n-grams which would then be stored in the database alongside the original text. The program should be smart enough to ignore mail headers, footers, etc., but none of this would be difficult given a good semantic analysis algo. (N-grams appear to be less effective on small documents, though if two documents were known to be authored by the same person, they should be able to be treated as one.) When an anonymous text's n-gram would be entered into the search engine, the database would return all documents with similar n-grams. This should reveal the likely identity of the author in a large number of cases. *Then* you could Magic Lantern him or whatever. What's the current state of public research in this area? Does anything exist that would be useful for practical application at this point? (I'm not sure how reliable n-grams would be on this kind of scale, and I haven't been able to find much via Google that really answers that.) I don't think I am saying anything new here. I'm bringing this because it seems like the solution to defeating remailers that involves the least legal hassle, can be applied retroactively, does not involve an unreasonable amount of computing power or deployed equipment, and has a decent chance of success for a good number of messages. (It won't work if the LEA doesn't have the plain text message that was sent through the remailer, or if the message was simply a binary file, news report, or something else not of the sender's own words, but it would work on messages exchanged discussing plots, drug deals, threats, kiddie-porn solicitation, naughty fantasies on alt.personals.bondage, etc. And then there's the added bonus of still working, even if the things Tim says are needed exist (much greater number of remailers, more traffic, etc.) and working on *any* form of anonymous communication, including missives deposited in postal drop boxes (assuming tomorrow's unibombers post to Usenet.) There could also be a commercial or individual demand for such a system. Suppose I wanted to read everything that Eric Hughes has written and published publicly online over the past 20 years. How would I go about such a search? Searching by name or email address will miss quite a lot. If Google had n-gram searching, and a "submit your text sample for n-gramification" cgi, I'd have more luck. Anyone at Google want to take me up on this? I'm sure there's other more practical uses that I'm missing as well. -Arnold