For the past few years I've looked at this issue (author identification through text content analysis) a bit from a psycholinguistic point of view. According to an occasional electronic digest coordinated by a woman from the UK named Blackwell (I apologize that I don't remember her name or have her email address handy), a technique that sums the probabilities of various word occurrences (CUSUM) has come under fire recently and, if I remember correctly, is not accepted in UK courts. A 1983 paper (which I also do not have the cite handy for) by Dr. Murray Miron of Syracuse University gave his equations for analyzing two texts (of roughly similar lengths) and establishing a probability that the two writings were produced by the same individual. In his paper, Dr. Miron related the story of a trial where he was summoned as an expert witness and was not allowed to testify as to whether an extortion note was authored by the defendant based on analysis of the note vis a vis a known letter from the defendant. However, the jury ended up finding the defendant guilty based on identical misspellings of a word in each message. Dr. Miron noted that the jury's decision agreed with the overall findings of the computer analysis; however, the jury returned a guilty verdict based on a single coincident misspelling that could happen (with relatively high probability) in any two random messages. The same idea applies here - for CUSUM or similar analysis to be valid, an analyst needs large volumes of messages where one of the authors is known (an anonymous id counts) and the documents compared are of similar lengths. One note a while back indicated that matching anonymous id's could be done through tracing misspellings and uncommon word usage. Definitely not true without a large base of known messages from both id's and a high score on an evaluation function as described in the literature. Curtis D. Frye cfrye@ciis.mitre.org "If you think I speak for MITRE, I'll tell you how much they pay me and make you feel foolish."