Re: OCR and Machine Readable Text

At 08:06 PM 1/2/97 -0800, Bill Stewart wrote:
It's really embarassing to have to pay salaries of "public employees" who can't come up with better arguments than the paper/magnetic/OCR nonsense but don't have the guts to stop trying and admit they've wrong. Does the President still make $200K/year salary? You'd think he'd either read what he signs or tell his employees to only ask him to sign at least half-way credible stuff. The old regulations used to pretend that foreigners were too dumb to implement computer programs from algorithms; now they're pretending that foreigners are too dumb to type.* People used to say we have the best politicians money can buy, but you ought to be able to buy better politicians than that.
Or beter excuses.
At 10:31 AM 12/30/96 -0800, Tim May wrote:
And not only is OCR able these days to handle general fonts easily enough, but almost all printed code is in fixed-width fonts, i.e., non-proportional fonts. This makes OCR easy.
The basic difference between "easily OCRed source code" and "not easily OCRed source code" is pretty much limited to two things 1) Half-decent print quality (black on white in Courier at 300dpi should do....) As Tim says, this stuff is child's play. Back when OCRs were $10,000 machines with cutting-edge 68010 processors, reading Courier was pretty easy but it helped to put in checksums; these days you don't really need that. (It also didn't like wet-process 240-dpi laser printing or faxes, but modern OCR software can generally deal with good-quality faxes and 2) Bound pages vs. loose pages (printing with perforated pages or selling the source code in loose-leaf might count as an "attractive nuisance" :-), but a band-saw can solve that problem for the OCR user unless it's printed on Tyvek or something silly :-)
Even an exacto-knife would work. For proportional fonts it depends on how nasty the kerning gets and the shape of the characters. (And a san serif font without too many kerning pairs should go though fine.) The technology for this has progressed quite a bit in the last few years. Next thing you know, OCR software will be export controlled as well. (Or they will require something silly, like having all code samples in caligraphy fonts.)
In the Karn case, the Feds made the silly argument that the floppy disk version had the files neatly separated, while the paper version split files between pages and had page numbers at the bottoms of the pages that weren't part of the source code. Even the $10K 68010 wonderbox could handle page headers/footers and margins, and modern software can do decent translations into different word processor formats.
And even if it didn't, just selecting and deleting the margin areas would not be all that difficult. (Ooohhh... A couple of extra hours is really going to slow someone down.)
For just the amount of money we've spent (in our consulting fees) on discussing just this issue of OCRing, the entire content of the MIT PGP source code book AND Schneier's AC could have been manually inputted by Barbadans or Botswanas, or probably even by Europeans.
I used to work for a company that would transfer entire archives of medical journals. Much of it we would just OCR. Some of it we would send off shore. The OCR software was about 95% reliable and this was over 5 years ago. (And we were using 286 boxes for much of the OCR work. Not a heavy technoligical investment.) I am sure that things have improved a great deal since then. (My new scanner included OCR software. I will have to run a test and report the findings.
There's one German university that OCRed the MIT PGP source code book. The PGP folks passed out copies of their new 3.0 Pre-Alpha and an update at a recent Cypherpunks meeting. See http://www.pgp.com/newsroom/sourcebook.cgi for ordering informaiton. It's been donated to some local libraries, such as San Jose CA, and I hope they'll send it to the Library of Congress and various non-US university and other public libraries - the recent rules change clarifying that it's ok to export source code should make this much easier.
The page listed does not contain order information. Do you know costs and/or order info?
[* OK, it's not really possible to type or proofread perl code accurately :-)
Yeah, just look at what happened with John Orwant's _Perl 5 interactive Course_. The book is being recalled due to all the typographical errors from the pubisher. Reading some Perl code is also quite impossible. (For the reasons behind this, i recommend Charlie Stross's article on the tpoic on page 36 of _The Perl Journal_ #4.)
More to the point, OCRs aren't always real good about `backquotes' and other little blotchy marks that some languages use, and even humans don't always get them right.]
Many character sets are not very good at displaying "little used" characters clearly. (Some of the cheaper fonts do not even include them.) Backticks are a special problem. The latest Camel book has all sorts of problems with hard to recognise backticks. BTW, there is an article on Perl and randomness in The Perl Journal #4 by John Orwant. Pretty basic for most Cypherpunks, but good reading none the less... --- | If you're not part of the solution, You're part of the precipitate. | |"The moral PGP Diffie taught Zimmermann unites all| Disclaimer: | | mankind free in one-key-steganography-privacy!" | Ignore the man | |`finger -l alano@teleport.com` for PGP 2.6.2 key | behind the keyboard.| | http://www.ctrl-alt-del.com/~alan/ |alan@ctrl-alt-del.com|

Alan Olsen wrote:
I used to work for a company that would transfer entire archives of medical journals. Much of it we would just OCR. Some of it we would send off shore. The OCR software was about 95% reliable and this was over 5 years ago. (And we were using 286 boxes for much of the OCR work. Not a heavy technoligical investment.) I am sure that things have improved a great deal since then. (My new scanner included OCR software. I will have to run a test and report the findings.
I'd like to know what OCR software you were using. All tests we completed at my place of employment were very poor quality wise. We showed a %65 accuracy rate. Not very good when you need to transfer a five year backlog of medical and technical journals. This was using a high resolution scanner with a package that was bundled along with it. About a year ago, my employer considered transfering data taken off of forms into a relational database using an OCR program. Again, we found the findings to be too innacurate for our needs. I may have just been using the wrong programs for the job, but the findings were depressing... panther
--- | If you're not part of the solution, You're part of the precipitate. | |"The moral PGP Diffie taught Zimmermann unites all| Disclaimer: | | mankind free in one-key-steganography-privacy!" | Ignore the man | |`finger -l alano@teleport.com` for PGP 2.6.2 key | behind the keyboard.| | http://www.ctrl-alt-del.com/~alan/ |alan@ctrl-alt-del.com|

Accuracy will depend on the quality of the original being scanned, as well as the capability of the OCR system; flat originals scan much better than the "bent open" pages of a book or magazine, heavy stock tends to let less "bleed" through from the reverse side, fonts with extreme kerning are more difficult, point size is a factor, etc. I've seen 97%+ w/ Calera, (about 2 years ago) when using flat, first generation high quality photocopies w/ minimal skew and courier or similar typeface. OTOH, the same system did not scan well at all w/ badly skewed photocopies (caused by the "bend" induced by the binding of the original). If you are scanning medical journals, take a look at your originals and also at where the errors are occuring. You can also use a spell checker (after building up a suitable dictionary for your application) to cut out some of the error. I'd guess your results to be less satisfactory for other applications where extreme accuracy is a must. "3", "8", and "B" for example, are often confused; not a big problem w/ a medical journal, but plays havoc w/ code, accouting data, etc. -r.w. On Fri, 3 Jan 1997, /**\anonymous/**\ wrote:
Alan Olsen wrote:
I used to work for a company that would transfer entire archives of medical journals. Much of it we would just OCR. Some of it we would send off shore. The OCR software was about 95% reliable and this was over 5 years ago. (And we were using 286 boxes for much of the OCR work. Not a heavy technoligical investment.) I am sure that things have improved a great deal since then. (My new scanner included OCR software. I will have to run a test and report the findings.
I'd like to know what OCR software you were using. All tests we completed at my place of employment were very poor quality wise. We showed a %65 accuracy rate. Not very good when you need to transfer a five year backlog of medical and technical journals. This was using a high resolution scanner with a package that was bundled along with it. About a year ago, my employer considered transfering data taken off of forms into a relational database using an OCR program. Again, we found the findings to be too innacurate for our needs. I may have just been using the wrong programs for the job, but the findings were depressing...
panther
--- | If you're not part of the solution, You're part of the precipitate. | |"The moral PGP Diffie taught Zimmermann unites all| Disclaimer: | | mankind free in one-key-steganography-privacy!" | Ignore the man | |`finger -l alano@teleport.com` for PGP 2.6.2 key | behind the keyboard.| | http://www.ctrl-alt-del.com/~alan/ |alan@ctrl-alt-del.com|

/**\\anonymous/**\\ allegedly said:
Alan Olsen wrote:
I used to work for a company that would transfer entire archives of medical journals. Much of it we would just OCR. Some of it we would send off shore. The OCR software was about 95% reliable and this was over 5 years ago. (And we were using 286 boxes for much of the OCR work. Not a heavy technoligical investment.) I am sure that things have improved a great deal since then. (My new scanner included OCR software. I will have to run a test and report the findings.
I'd like to know what OCR software you were using. All tests we completed at my place of employment were very poor quality wise. We showed a %65 accuracy rate. Not very good when you need to transfer a five year backlog of medical and technical journals. This was using a high resolution scanner with a package that was bundled along with it. About a year ago, my employer considered transfering data taken off of forms into a relational database using an OCR program. Again, we found the findings to be too innacurate for our needs. I may have just been using the wrong programs for the job, but the findings were depressing...
My understanding is that the most efficient way of inputting text is "double typing" where two people type the same document, and a mechanical comparison of the result is used to find errors. -- Kent Crispin "No reason to get excited", kent@songbird.com,kc@llnl.gov the thief he kindly spoke... PGP fingerprint: 5A 16 DA 04 31 33 40 1E 87 DA 29 02 97 A3 46 2F

I have used OCR a fair bit, and I agree with you, I think you're being generous by saying even a 65% accuracy rate. I think our OCR technology today is pathetic, and it would be quicker just to type the damn documents ourselves. I've used a bunch of different packages from guys like HP, and others. I certainly don't know what Alan Olsen was using. Then again, it obviously depends on the quality of the documents you are scanning. So If you had perfect crisply printed, beautiful documents, then maybe you'd get a good accuracy rate. But nice documents, are usually ones generated recently, therefore probably already on the computer, and so they don't even need to scanned. You see what I'm getting at, all the documents we don't have on the computer are, usually, older ones and therefore of lesser quality, so that's why our OCR fails almost more often than not. I guess I'm being a little harsh, I mean this type of technology is quite revolutionary and actually quite amazing, but it's far, far from perfect. Just my 2 cents... Steve

Steve Stewart wrote:
I have used OCR a fair bit, and I agree with you, I think you're being generous by saying even a 65% accuracy rate. I think our OCR technology today is pathetic, and it would be quicker just to type the damn documents ourselves. I've used a bunch of different packages from guys like HP, and others. I certainly don't know what Alan Olsen was using.
[snip] I needed OCR to create indexed text databases of federal documents, particularly legislation. The amount of hand editing required is enormous. That alone would justify (in a sense) the use of off- shore labor.
participants (6)
-
/**\anonymous/**\
-
Alan Olsen
-
Dale Thorn
-
Kent Crispin
-
Rabid Wombat
-
Steve Stewart