
On 9 Aug 2001, Dr. Evil is alleged to have written:
I have a question for you c'punks. If you wanted to generate some bitmaps of text which would be difficult or impossible to OCR, but not too difficult for humans to read, how would you do that? Basically, I want to create GIFs of text which can't be OCRed in a reliable way. I've thought about some things: I can put in noise pixels, I can blur the text, I can rotate, shear, and otherwise distort it.
It depends a lot on your threat model. If the people who want a copy are determined enough, they'll just retype it :-) If you're trying to make signs that video-cameras can't read, that's a different problem than trying to publish comic books that teenagers with too much time on their hands can't scan, or trying to publish source code on paper so that your customers can inspect the crypto without being able to scan/modify/compile it. (The latter may satisfy the Gnu Public License (:-), but isn't particularly useful for crypto, because people can't use it to produce a binary they can trust...) If your problem is to make the OCR job require enough manual tweaking that the reader might as well just retype it, here's what I'd do: split up each letter into multiple pieces, using different colors for the different parts of the letter, and vary the color maps across the page. Also do this for the background space. And dither the pieces! OCRs usually work by identifying features of the letter (vertical on the left, horizontal in the middle, vertical in the lower right, etc.), after deciding what parts are in the letter and what aren't. So instead of having to find the black stuff on the white background, or the yellow stuff on the blue background, it's having to find the green and cyan dither stuff and the aqua and turquoise dither stuff on the blue and indigo dither background and the indigo and purple background, and further down the page you've shuffled other colors in and out of the mix. So even if it's smart enough to edge-detect blobs of dithered stuff on top of other dithered stuff, the blobs don't add up to recognizable letters - they add up to fragments that only become a letter if you put them all together successfully.

It depends a lot on your threat model. If the people who want a copy are determined enough, they'll just retype it :-)
Which is exactly what I want! Basically, I need to create a web page which is "humans only beyond this point". One task that humans can do easily and reliably is read messy characters. Computers can't. But computers can generate messy characters. Therefore, computers can detect whether they are interacting with other computers, or humans. There are all kinds of other threat models. I know that JYA often receives redacted stuff, and he puts it on a photocopier, and he is often able to enhance the contrast and read stuff that has been blacked out. Cool! But I'm working on a different problem. Basically, I have a web site that lets you reserve domain names before you pay for them. I want to make sure that no loser out there decides to be cool and write a script which reserves every word in the dictionary, or every sequence of eight characters, or some moronic thing like that. So I will have the page display three characters, somewhat blurry, and say, "type these characters here!" If they don't match, you're not human! (Why didn't they think of this simple method in Terminator and Blade Runner?) This same moron could sit there and type domain names all day long, but that's enough punishment in itself. This use would apply to any kind of site that lets you register (or otherwise consume resources) for free, and people might have some motive for creating an auto-script. I'm also going to use the same system on a financial system I'm working on to prevent automated transactions, as part of my anti-money-laundry effort.
So instead of having to find the black stuff on the white background, or the yellow stuff on the blue background, it's having to find the green and cyan dither stuff and the aqua and turquoise dither stuff on the blue and indigo dither background and the indigo and purple background, and further down the page you've shuffled other colors in and out of the mix.
That's a good idea. If some moron decides to somehow come up with an OCR good enough to read stuff like this: http://www.sidereal.kz/~drevil/anti-ocr.png, then I'll have to move to more advanced methods.

On 10 Aug 2001, Dr. Evil wrote:
blacked out. Cool! But I'm working on a different problem. Basically, I have a web site that lets you reserve domain names before you pay for them. I want to make sure that no loser out there decides to be cool and write a script which reserves every word in the dictionary, or every sequence of eight characters, or some moronic thing like that. So I will have the page display three characters, somewhat blurry, and say, "type these characters here!" If they don't match, you're not human! (Why didn't they think of this simple method in Terminator and Blade Runner?) This same moron could sit there and type domain names all day long, but that's enough punishment in itself.
This is a case where I'd make them do some kind of computation before they could register a name. Frex, -- "here's a number, and here's a downloadable utility that does squaring under a modulus. Tell me what this number is, squared N times, under modulus X, and I'll let you register a domain name. " So, your typical user has to wait thirty seconds, which is no big deal, but the guy who's trying to register every word in a million- word dictionary is going to have to harness truly massive computing resources in order to do so. You can even linearize the computation (meaning it won't do them any good to sic multiple cpu's on it) if you make them submit numbers in a sequence for multiple registrations. (ie, first registration is number squared N times, second is number squared 2N times, third is number squared 6N times, etc....) Or, if you are keeping track of who registers what, which of course you must be for "register" to have any meaning, why not just refuse the tenth and subsequent registrations for any particular address? Even if the addresses are masked, you can still compare hashes of them. Bear
participants (3)
-
Bill Stewart
-
Dr. Evil
-
Ray Dillinger