I have a question for you c'punks. If you wanted to generate some bitmaps of text which would be difficult or impossible to OCR, but not too difficult for humans to read, how would you do that? Basically, I want to create GIFs of text which can't be OCRed in a reliable way. I've thought about some things: I can put in noise pixels, I can blur the text, I can rotate, shear, and otherwise distort it. Anything else I should do? Will these tricks work? Thanks
Write by hand. Change your handwriting style part-way through. The other option, which will take you *much* longer to set up, is to use contrasting colours of background and text, but make them change, like those cards for detecting colour blindness. So the background would be a smoothly changing mess of colours and the text would, in effect, be a mask into a different smoothly changing mess of colours. If you randomised it you could get stuff where the chance of suitable contrast was high enough that enough text had a sharp enough edge to read, but enough of it was blurred to cause confusion to OCR. Not to mention migraines to humans. If the colours were chosen carefully, off-the-shelf OCR would also be confused because the contrast of the edges would be colour, not intensity. OTOH it would be circumventable by enhancing colour contrast then switching to monochrome & enhancing edges. Anything you can do in 10 minutes with Paint Shop Pro is probably doable by the Men In Black sooner or later. To be honest, the whole thing is likely to be worth more as a tool for generating retro-Op Art or Pop Art images. Write the code & pray for a revival of free festivals. For obfuscation, stick to handwriting. As an afterthought, the experts in this must be the people who print banknotes. Real ones, I mean, not your boring US green ones that are all the same size and colour so foreigners can't tell them apart and you have to employ millions of Secret Service agents to stop forgers. I bet the Bank of England go on about it on their website. Ken Brown "Dr. Evil" wrote:
I have a question for you c'punks. If you wanted to generate some bitmaps of text which would be difficult or impossible to OCR, but not too difficult for humans to read, how would you do that? Basically, I want to create GIFs of text which can't be OCRed in a reliable way.
pardon me for top-posting...
On Thu, 9 Aug 2001, Ken Brown wrote:
As an afterthought, the experts in this must be the people who print banknotes. Real ones, I mean, not your boring US green ones that are all the same size and colour so foreigners can't tell them apart and you have to employ millions of Secret Service agents to stop forgers. I bet the Bank of England go on about it on their website.
Try the Euro notes. The firms contracted to print them are rumoured to have experienced serious trouble with the first batches. But that is hardly the same problem. I don't think the same precautions apply, as we aren't trying to protect the information on top of the note but the identity of the physical note itself. Clearly you cannot do that when dealing with GIFs. Sampo Syreeni, aka decoy, mailto:decoy@iki.fi, gsm: +358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
On 9 Aug 2001, Dr. Evil wrote:
I have a question for you c'punks. If you wanted to generate some bitmaps of text which would be difficult or impossible to OCR, but not too difficult for humans to read, how would you do that? Basically, I want to create GIFs of text which can't be OCRed in a reliable way. I've thought about some things: I can put in noise pixels, I can blur the text, I can rotate, shear, and otherwise distort it.
Some ideas: Start with a highly ornate script font. Anti-alias. Try a font with lots of gaps and other topology breaking features. Pluck out a decent perceptual model from one of the better image compressors and try doing maximum modifications beneath a given perceptual error bound. Low contrast, with information encoded in the hue channel. (Dead trees: Use a colorless, fluorescent ink, or a combination of such inks to throw off the scanner. Print your stuff on extremely heat and/or light sensitive paper.) Sampo Syreeni, aka decoy, mailto:decoy@iki.fi, gsm: +358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
Decoy writes:
Some ideas: Start with a highly ornate script font. Anti-alias. Try a font with lots of gaps and other topology breaking features. Pluck out a decent perceptual model from one of the better image compressors and try doing maximum modifications beneath a given perceptual error bound. Low contrast, with information encoded in the hue channel.
Thank you, excellent sugestions. I put in the anti-aliasing thing and that is clearly something which will piss off an OCR device. Here's an example of my current attempt at something which would be neigh-impossible to OCR reliably: http://www.sidereal.kz/~drevil/anti-ocr.png What do you think? It's got some shear, some rotate, some spread, and some swirl. And anti-aliasing, which was a great sugestion. I'm going to experiment with adding some dotted lines, too. Ken Brown writes:
Write by hand. Change your handwriting style part-way through.
As you may be aware, Doctors have notoriously bad handwriting, but alas, it must be machine-generated.
If the colours were chosen carefully, off-the-shelf OCR would also be confused because the contrast of the edges would be colour, not intensity. OTOH it would be circumventable by enhancing colour contrast then switching to monochrome & enhancing edges. Anything you can do in
Right, screwing around with color contrast may not be effective at all, because those things are very easy to take out by just editing the color map, and suddenly it's sharp and clear.
10 minutes with Paint Shop Pro is probably doable by the Men In Black sooner or later.
Ah, but the Men in Black are not the threat in this case. Au contraire! This is part of an anti-money-laundry thing.
As an afterthought, the experts in this must be the people who print banknotes. Real ones, I mean, not your boring US green ones that are all the same size and colour so foreigners can't tell them apart and you have to employ millions of Secret Service agents to stop forgers. I bet the Bank of England go on about it on their website.
I'll check there. They are mostly concerned with a different problem, which is duplication. I want to make stuff that humans can read, and machines can't. It's all going to be PNGs, so machines can duplicate it no problem.
At 06:14 PM 8/9/01 -0000, Dr. Evil wrote:
If the letters were *overlapping* it would be *much* tougher to parse. The computation is just addition. Plus you don't waste all that kerning space :-) Jittering the baseline would also help. Eventually you can sell your barely-readable font to _Wired_
At 08:08 AM 8/9/01 -0000, Dr. Evil wrote:
I have a question for you c'punks. If you wanted to generate some bitmaps of text which would be difficult or impossible to OCR, but not too difficult for humans to read, how would you do that? Basically, I want to create GIFs of text which can't be OCRed in a reliable way. I've thought about some things: I can put in noise pixels, I can blur the text, I can rotate, shear, and otherwise distort it. Anything else I should do? Will these tricks work?
Ultimately if humans can read it, a machine can, unless you believe humans are supernatural. However, we're frequently ignorant of how to tell machines to perform as well as us. If you create letters by staggering stripes, the OCR will have a hell of a time. The letter I: ----__--- ----__--- ----__--- ----__--- ----__--- ----__--- Also reversing the contrast (in stripes across the letter) will disrupt simpler OCR edge tracers, though this camoflage may impair human readability too.
participants (5)
-
David Honig
-
Dr. Evil
-
Ken Brown
-
Phillip H. Zakas
-
Sampo Syreeni