Random musing about words and spam

Wed Sep 3 01:44:53 PDT 2003

On Wed, 3 Sep 2003, John Kozubik wrote:
> Try to write the logic that distinguishes this:
>
> if_gre in the tree passes the mbuf to netisr_dispatch(), which in turn
> calls if_handoff(), which does something similar.
>
> (hackers at freebsd.org)
>
> from this:
>
> dyeiluykxoer dyeiluykcqkutknig dyeiluykkrpmhrku dyeiluykngeqx
> dyeiluykoybim dyeiluykbihlyrelg dyeiluyktwucinmdyeiluykwenmttwvm
>
> (actual spam)

Quality vs quantity. The ratio of machine-generated words to real-looking
ones. The first one has far more negative hits than positive ones, the
second one has all positive. (However, this is easy to beat by using
randomly selected dictionary words instead. The following step is using a
syntactical parser on the level of sentences. The countermove is borrowing
random paragraphs of otherwise meaningful text from random websites.
Following move is employing of semantical parsers, and then we're
waist-deep in artificial intelligence and natural language analysis. It
will end there anyway.) Won't work too reliably on its own, at least in
the simple version, but could help a Bayesian filter to make a decision.

> I must reiterate that, given the relentless efficiency of spam-spiders,
> merely publishing a shadow email address on all web documents that your
> real email address reside on, and deleting all email sent to both accounts
> is my current favorite anti-spam mechanism.  Simple to DIY, and requires
> no centralization.

This approach assumes you are able to detect duplicates (which may be
difficult to do if each spam sent out would be different, eg. using
different sets of pseudowords - which is already being done in some cases,
from the day antispam systems based on hashes of known spams were
introduced), and depends on the duplicates actually reaching your both
addresses within reasonable timeframe.