Re: spam detector algorithm?
Greg Broiles writes:
I've been mulling over algorithmic/computational ways to spot spams for some time now. I think I might've come up with a way to represent messages (and compare representations) that would be useful to remailer operators who don't want to let spams (where "spam" == many messages with identical or very similar content) through their remailers. [many details elided...] Any thoughts about this? Interesting? Stupid? Like I said, my math is weak. My intention is to try to cobble up a 2d version of this to see how it runs but I thought I'd see if anyone can point out why it can't work, or if it's useful enough that someone with a better math background than I've got wants to take this idea somewhere better.
It sounds like you are liable to start reinventing parts of the field of information retrieval. The automatic construction and comparison of vectors of document parameters, as you suggested in the part I omitted, is one approach that has met with some success. (The common problem is, given a set of query attributes or a model document, to find relevant documents matching the query or similar to the model document. A variety of relevance measures has been considered.) I can't give you any specific pointers, but I advise you to check out existing implementations of these and other techniques for information retrieval before you spend too much time writing new code. FWIW, I _do_ think that such tactics would be very effective in combatting much of the spam served up these days.
One side effect to the deployment of spam detectors may be that the remailer pinging services will need to move to using encrypted packets. [...] My impression is (speak up if I'm wrong) that requiring encryption for the ping packets wouldn't be an enormous burden on the pinging services because the new generation of software sends fewer pinging packets such that the CPU time required isn't an issue.
Last time I looked, Raph's software already encrypts ping messages to remailers that have PGP keys. I assume you intend to perform the spam check after removing the optional outer layer of encryption on each incoming message. Perhaps the ping messages would survive unscathed if you only applied the spam scan to messages larger than some minimum size. I haven't seen too many 1 or 2 line spams. -Futplex <futplex@pseudonym.com>
futplex@pseudonym.com writes:
Subject: Re: spam detector algorithm? Date: Tue, 10 Oct 1995 23:03:45 -0400 (EDT)
Greg Broiles writes: [many details elided...]
Any thoughts about this? Interesting? Stupid? Like I said, my math is weak. My intention is to try to cobble up a 2d version of this to see how it runs but I thought I'd see if anyone can point out why it can't work, or if it's useful enough that someone with a better math background than I've got wants to take this idea somewhere better.
It sounds like you are liable to start reinventing parts of the field of information retrieval. The automatic construction and comparison of vectors of document parameters, as you suggested in the part I omitted, is one approach that has met with some success. (The common problem is, given a set of query attributes or a model document, to find relevant documents matching the query or similar to the model document. A variety of relevance measures has been considered.)
I can't give you any specific pointers, but I advise you to check out existing implementations of these and other techniques for information retrieval before you spend too much time writing new code.
Check out SMART, which was originally developed by Gerard Salton at Cornell. (He is one of the pioneers of IR.) The current release is maintained by Chris Buckley (chrisb@balder.chrisb.com). Check out: ftp://ftp.cs.cornell.edu/pub/smart If you don't feel like installing the whole thing but are interested in testing it out on some spam, then I could run some tests for you. Here are some literary references for SMART: @article{SB88-weight, author = {Gerard Salton and Chris Buckley}, journal = ipm, number = {5}, pages = {513-523}, title = {Term-Weighting Approaches in Automatic Text Retrieval}, volume = {24}, year = {1988} } @inproceedings{BSA-trec1, author = {Chris~Buckley and Gerard~Salton and James~Allan}, title = {Automatic Retrieval With Locality Information Using {SMART}}, booktitle = {Proceedings of the First Text REtrieval Conference (TREC-1)}, editor = {D. K. Harman}, publisher = {NIST Special Publication 500-207}, month = {March}, year = {1993}, pages = {59--72} } -jon
participants (2)
-
futplexï¼ pseudonym.com -
Jonathan Litt