Re: ID of anonymous posters via word analysis?
For the past few years I've looked at this issue (author identification through text content analysis) a bit from a psycholinguistic point of view. According to an occasional electronic digest coordinated by a woman from the UK named Blackwell (I apologize that I don't remember her name or have her email address handy), a technique that sums the probabilities of various word occurrences (CUSUM) has come under fire recently and, if I remember correctly, is not accepted in UK courts. A 1983 paper (which I also do not have the cite handy for) by Dr. Murray Miron of Syracuse University gave his equations for analyzing two texts (of roughly similar lengths) and establishing a probability that the two writings were produced by the same individual. In his paper, Dr. Miron related the story of a trial where he was summoned as an expert witness and was not allowed to testify as to whether an extortion note was authored by the defendant based on analysis of the note vis a vis a known letter from the defendant. However, the jury ended up finding the defendant guilty based on identical misspellings of a word in each message. Dr. Miron noted that the jury's decision agreed with the overall findings of the computer analysis; however, the jury returned a guilty verdict based on a single coincident misspelling that could happen (with relatively high probability) in any two random messages. The same idea applies here - for CUSUM or similar analysis to be valid, an analyst needs large volumes of messages where one of the authors is known (an anonymous id counts) and the documents compared are of similar lengths. One note a while back indicated that matching anonymous id's could be done through tracing misspellings and uncommon word usage. Definitely not true without a large base of known messages from both id's and a high score on an evaluation function as described in the literature. Curtis D. Frye cfrye@ciis.mitre.org "If you think I speak for MITRE, I'll tell you how much they pay me and make you feel foolish."
Curtis Frye and many others have written about the ways anonymous or pseuodonymous posts can be identified. Graham Toal's comments were especially cogent (even if he tweaked my for some of my characteristic writing patterns and whatnot (hint: I use "whatnot" more than most people here). I want to briefly mention another way of looking at this issue, and will use Curtis' comments to start:
For the past few years I've looked at this issue (author identification through text content analysis) a bit from a psycholinguistic point of view. ... A 1983 paper (which I also do not have the cite handy for) by Dr. Murray Miron of Syracuse University gave his equations for analyzing two texts (of roughly similar lengths) and establishing a probability that the two writings were produced by the same individual. In his paper, Dr. Miron ... The same idea applies here - for CUSUM or similar analysis to be valid, an analyst needs large volumes of messages where one of the authors is known
One can view this problem in terms of Shannon's theorem about the transmission of a message in the presence of noise: * Signal -- the identity of the poster (true name, pseudonym, whatever) - characteristic usage of words, of punctuation, and whatnot (see) - even the ideologies expressed (which LD incorrectly used to conclude Jamie Dinkelacker and I "must" be the same person) * Noise -- variations in spelling, usage, etc. - many people use similar constructions and whatnot (like this) Now Shannon's theorem, which can be applied here if some care is taken (that is, don't apply it too simplistically or too mechanistically), says that no matter how much noise is present, one can extract the signal if one samples enough. (Caveats: for a stationary sequence, etc., whereas one's writings may change with time, with the topic at hand, etc.) This means that one can "communicate" the "message"--which in this case is the message "I am Tim May" or "Jamie and Tim are distinct posters" and so forth--if enough messages are analyzed. But to Shannon's basic view one must also add _intereference_, whether deliberate (spoofing) or not. If I try to emulate the style of S. Boxx, for example, by writing in the form "I am becoming INCREASINGLY DISGUSTED by the blatant disregard for the Cypherpunks CAUSE and ...", then this "intereference" could greatly complicate the signal extraction. In fact, more obscure correlations would have to be looked at, ones which might require many more messages to analyze...possibly more message samples than exist. Text analysis tools have presumably gotten a lot more powerful than they were 30 years ago when the "Did Marlowe writes Shakespeare's plays?" question was being computer-analyzed. Anyway, like others have said, there are several programs available which do this kind of analysis, and I don't think it's paranoid to say that the CIA and the NSA must have extremely sophisticated tools for such analysis. An interesting area. Anybody else interested in building a "nymalizer" which sorts posts into likely bins? --Tim May -- .......................................................................... Timothy C. May | Crypto Anarchy: encryption, digital money, tcmay@netcom.com | anonymous networks, digital pseudonyms, zero 408-688-5409 | knowledge, reputations, information markets, W.A.S.T.E.: Aptos, CA | black markets, collapse of governments. Higher Power: 2^756839 | Public Key: PGP and MailSafe available. Note: I put time and money into writing this posting. I hope you enjoy it.
participants (2)
-
cfrye@ciis.mitre.org -
tcmay@netcom.com