Remailers, N-Grams, and Google

Wed Dec 12 16:29:08 PST 2001

On Mon, 10 Dec 2001, Faustine wrote:

> What are N-Grams?
>
> N-Gram Analysis is a a method patented by the NSA to compare the
> semantic of two texts or audio or video data files. The algorithm is
> pretty simple, all you have to do, is take a sliding window of length
> N and move it over the text, and remember, how often which
> text-fragment of length N occured in the text. This implementation of
> the N-Gram Method is a pretty simple ANSI-C-Program, I wrote to
> distract me from my end-of-semester exams. It would be nice, if you
> send me patches, comments or so to rhoehndo at imn.htwk-leipzig.de. I
> will do some more to this code as soon as I finished my exams.

Here's a thought.

Given a comprehensive collection of public Internet communication,
including Usenet, mailing list traffic, weblog entries, etc, and an
advanced semantic analysis algorithm, it should be fairly trival to take
the n-gram (or other semantic signature) of a remailed message, and search
a database of these signatures for possible matches.

Google's got the raw material, and surely the NSA does as well. (If the
FBI hadn't been collecting this information all along, would they even
need to ask a court to get the NSA to share it with them? I'm talking
about purely public info -- nothing that Google wouldn't have.)

A program could be written to run over time, generating n-grams which
would then be stored in the database alongside the original text. The
program should be smart enough to ignore mail headers, footers, etc., but
none of this would be difficult given a good semantic analysis algo.
(N-grams appear to be less effective on small documents, though if two
documents were known to be authored by the same person, they should be
able to be treated as one.) When an anonymous text's n-gram would be
entered into the search engine, the database would return all documents
with similar n-grams. This should reveal the likely identity of the author
in a large number of cases. *Then* you could Magic Lantern him or
whatever.

What's the current state of public research in this area? Does anything
exist that would be useful for practical application at this point? (I'm
not sure how reliable n-grams would be on this kind of scale, and I
haven't been able to find much via Google that really answers that.)

I don't think I am saying anything new here. I'm bringing this because it
seems like the solution to defeating remailers that involves the least
legal hassle, can be applied retroactively, does not involve an
unreasonable amount of computing power or deployed equipment, and has
a decent chance of success for a good number of messages. (It won't work
if the LEA doesn't have the plain text message that was sent through the
remailer, or if the message was simply a binary file, news report, or
something else not of the sender's own words, but it would work on
messages exchanged discussing plots, drug deals, threats, kiddie-porn
solicitation, naughty fantasies on alt.personals.bondage, etc.

And then there's the added bonus of still working, even if the things Tim
says are needed exist (much greater number of remailers, more traffic,
etc.) and working on *any* form of anonymous communication, including
missives deposited in postal drop boxes (assuming tomorrow's unibombers
post to Usenet.)

There could also be a commercial or individual demand for such a system.
Suppose I wanted to read everything that Eric Hughes has written and
published publicly online over the past 20 years. How would I go about
such a search? Searching by name or email address will miss quite a lot.
If Google had n-gram searching, and a "submit your text sample for
n-gramification" cgi, I'd have more luck. Anyone at Google want to take me
up on this?

I'm sure there's other more practical uses that I'm missing as well.

-Arnold