Random musing about words and spam
Spammers recently adopted tactics of using randomly generated words, eg. "wryqf", in both the subject and the body of the message. These "pseudowords" are random, which makes them different from real words that are made of syllables. Could the pseudowords be easily detected by their characteristics, eg. presence of syllables, wovel-consonant sequences/ratio, something like that? This could shift the balance of force in spam detection again, until the adversary will be forced to adopt the tactics of generating the random words from syllables instead of characters. Presence of pseudowords then could be added as one of spam characteristics.
Hello, On Wed, 3 Sep 2003, Thomas Shaddack wrote:
Spammers recently adopted tactics of using randomly generated words, eg. "wryqf", in both the subject and the body of the message. These "pseudowords" are random, which makes them different from real words that are made of syllables.
Could the pseudowords be easily detected by their characteristics, eg. presence of syllables, wovel-consonant sequences/ratio, something like that? This could shift the balance of force in spam detection again, until the adversary will be forced to adopt the tactics of generating the random words from syllables instead of characters. Presence of pseudowords then could be added as one of spam characteristics.
I have, for a year or so now, been wondering about all the odd character strings I am finding in the subjects and body of my spam, and I too thought about keying on these for detection. However, I immediately abandoned the idea, as a quick glance over the content of my legitimate email - to and from developers, technical mailing lists, etc., revealed that almost all of my legitimate email also contains seemingly random bits of gibberish and pseudowords. Try to write the logic that distinguishes this: if_gre in the tree passes the mbuf to netisr_dispatch(), which in turn calls if_handoff(), which does something similar. (hackers@freebsd.org) from this: dyeiluykxoer dyeiluykcqkutknig dyeiluykkrpmhrku dyeiluykngeqx dyeiluykoybim dyeiluykbihlyrelg dyeiluyktwucinmdyeiluykwenmttwvm (actual spam) I must reiterate that, given the relentless efficiency of spam-spiders, merely publishing a shadow email address on all web documents that your real email address reside on, and deleting all email sent to both accounts is my current favorite anti-spam mechanism. Simple to DIY, and requires no centralization. ----- John Kozubik - john@kozubik.com - http://www.kozubik.com
On Wed, 3 Sep 2003, John Kozubik wrote:
Try to write the logic that distinguishes this:
if_gre in the tree passes the mbuf to netisr_dispatch(), which in turn calls if_handoff(), which does something similar.
(hackers@freebsd.org)
from this:
dyeiluykxoer dyeiluykcqkutknig dyeiluykkrpmhrku dyeiluykngeqx dyeiluykoybim dyeiluykbihlyrelg dyeiluyktwucinmdyeiluykwenmttwvm
(actual spam)
Quality vs quantity. The ratio of machine-generated words to real-looking ones. The first one has far more negative hits than positive ones, the second one has all positive. (However, this is easy to beat by using randomly selected dictionary words instead. The following step is using a syntactical parser on the level of sentences. The countermove is borrowing random paragraphs of otherwise meaningful text from random websites. Following move is employing of semantical parsers, and then we're waist-deep in artificial intelligence and natural language analysis. It will end there anyway.) Won't work too reliably on its own, at least in the simple version, but could help a Bayesian filter to make a decision.
I must reiterate that, given the relentless efficiency of spam-spiders, merely publishing a shadow email address on all web documents that your real email address reside on, and deleting all email sent to both accounts is my current favorite anti-spam mechanism. Simple to DIY, and requires no centralization.
This approach assumes you are able to detect duplicates (which may be difficult to do if each spam sent out would be different, eg. using different sets of pseudowords - which is already being done in some cases, from the day antispam systems based on hashes of known spams were introduced), and depends on the duplicates actually reaching your both addresses within reasonable timeframe.
Thomas:
I must reiterate that, given the relentless efficiency of spam-spiders, merely publishing a shadow email address on all web documents that your real email address reside on, and deleting all
email sent to both accounts is my current favorite anti-spam mechanism. Simple to DIY, and requires no centralization.
This approach assumes you are able to detect duplicates (which may be difficult to do if each spam sent out would be different, eg. using different sets of pseudowords - which is already being done in some cases, from the day antispam systems based on hashes of known spams were introduced), and depends on the duplicates actually reaching your both addresses within reasonable timeframe.
If one of the addresses was not ever used for legitimate purposes, then blocking all addresses that sent to this address should be an effective filter. Also, with the low cost of storage today, storing message hashes of known spam wouldn't take much space (not to say that this would be a good way of identifying spam). I was pondering recently the usage of a "web of trust"-type system whereby one could use communal whitelists with decreasing trust going outward as well as the opportunity to selected trusted sources - perhaps using authentication authorities for PK's as authoratitive whitelists, or not, as per ones choice. (Since PK's require identification for the issue of certs, it at least provides some chain of evidence. However, this negates the opportunity for anonymity). How feasible are implementations of such 'distributed' whitelists? (I'm assuming that entries from non-whitelist identified emails are permitted to send through on a challenge-response basis, and that once identified, users have the opportunity to add to such whitelist). And, is it possible to indentify a bit of information as coming from a trusted source, without identifying that trusted source and without resorting to the use of a TTP? -- Andrew G. Thomas Hobbs & Associates Chartered Accountants (SA) (o) +27-(0)21-683-0500 (f) +27-(0)21-683-0577 (m) +27-(0)83-318-4070
On Wed, 3 Sep 2003, John Kozubik wrote:
I must reiterate that, given the relentless efficiency of spam-spiders, merely publishing a shadow email address on all web documents that your real email address reside on, and deleting all email sent to both accounts is my current favorite anti-spam mechanism. Simple to DIY, and requires no centralization.
There is a high potential to falsely block innocent addresses. The most common reason these days will be a worm activity. To quote from spamNEWS 09/02/03: ooooo SOBIG.F OBESERVATION - Lockergnome 8/31/2003 http://click.wh5.com/redirect.php?c=17825&u=46r9niwjatrv4g6m I observed back on Tuesday that my Symantec SMTP gateway was stopping SoBig.F subject lines coming from spammers (i.e., blocked via DNSBL) at over 3 times the rate that I was seeing them from Joe user types. Further, I noticed that they were sending even more SoBig.F emails than they were spam. So, why would spammers who make their living be generating emails allow their servers to be compromised? They didn't. They are doing this on purpose and I have a theory for this. I call it my echo theory. Say that, as a spammer, you know one or more of the addresses in your database is to a spam trap - but you don't know which one. You generate LOTS of SoBig.F emails on purpose, using your database for the forged-from addresses. Now, JoeUser has his server or client antivirus filter setup to send a reply when it encounters a virus (which is a very BAD thing, after Klez taught us about the virtues of forged addresses). Dutifully, JoeUser's email server or client automatically sends a helpful note off to "SpamTrap," informing them that they are infected. Often these replies even extol how much smarter they are than "SpamTrap" because they caught it, but "SpamTrap" did not. Heck, let's even send an email to the postmaster at SpamBait's ISP, telling him / her how much better the BrandX filter is that JoeUser is using... but I digress. The email server at SpamBait's ISP sees an email to SpamTrap and says "Ah hah, JoeUser's ISP must obviously be a spammer, so load his IP address into our DNSBL servers." JoeUser now sends a legitimate email to me SmartUser at IuseDNSBL.com and it, of course, bounces. JoeUser now calls me and asks why he was blacklisted. After some diligent effort on my part, I find that DNSBL.SpamBait.com is saying half of my customers and suppliers are spammers. I have a business to run, so I turn off DNSBL on my gateway and - lo and behold - all of the spammers emails that were being blocked due to DNSBL are no allowed to come though. That is my echo theory. That is why spammers are using half their bandwidth to send SoBig.F. [Thanks to reader Stephen Whitis for the tip - ed.]
On Tuesday 02 September 2003 19:00, Thomas Shaddack wrote:
Spammers recently adopted tactics of using randomly generated words, eg. "wryqf", in both the subject and the body of the message. ... Could the pseudowords be easily detected by their characteristics, ... Presence of pseudowords then could be added as one of spam characteristics.
Wouldn't work for me. For one thing, I'm a programmer; as John Kozubik noted, identifiers in code look a lot like random strings. For another, I routinely receive email in non-English languages. Not only European languages, which probably have characteristics close enough to English to do matching, but also in Chinese and Korean. And Lojban, too, which itself looks an awful lot like random strings. (And getting legit mail from .cn and .kr prevents me from just blocking the entire TLDs of those national spam factories. My life sucks.) -- Steve Furlong Computer Condottiere Have GNU, Will Travel "If someone is so fearful that, that they're going to start using their weapons to protect their rights, makes me very nervous that these people have these weapons at all!" -- Rep. Henry Waxman
On Thu, Sep 04, 2003 at 09:02:30PM -0400, Steve Furlong wrote:
On Tuesday 02 September 2003 19:00, Thomas Shaddack wrote:
Spammers recently adopted tactics of using randomly generated words, eg. "wryqf", in both the subject and the body of the message. ... Could the pseudowords be easily detected by their characteristics, ... Presence of pseudowords then could be added as one of spam characteristics.
Many of them space the code words away from the rest of the subject text, i.e. "Subject: what if it were true? 5258pf2" I think this is to hide the code word since many mail readers only show 40-60 characters of the Subject. I've been id'ing spam by looking for excess whitespace in the Subject line for a couple years (it's one of about 200 checks my program makes). I'm sure other spam-recognition software does this as well. Eric
participants (5)
-
Andrew Thomas
-
Eric Murray
-
John Kozubik
-
Steve Furlong
-
Thomas Shaddack