New subject: OCR and Machine Readable Text

17 Dec 2003

      At 08:06 PM 1/2/97 -0800, Bill Stewart wrote:
...
It's really embarassing to have to pay salaries of "public employees"
who can't come up with better arguments than the paper/magnetic/OCR nonsense
but don't have the guts to stop trying and admit they've wrong.
Does the President still make $200K/year salary?  You'd think he'd either
read what he signs or tell his employees to only ask him to sign
at least half-way credible stuff.  The old regulations used to pretend that
foreigners were too dumb to implement computer programs from algorithms;
now they're pretending that foreigners are too dumb to type.*
People used to say we have the best politicians money can buy, 
but you ought to be able to buy better politicians than that.
Or beter excuses.
...
At 10:31 AM 12/30/96 -0800, Tim May wrote:
...
And not only is OCR able these days to handle general fonts easily enough,
but almost all printed code is in fixed-width fonts, i.e., non-proportional
fonts. This makes OCR easy.
The basic difference between "easily OCRed source code" and "not easily
OCRed source code" is pretty much limited to two things
1) Half-decent print quality (black on white in Courier at 300dpi should
do....)
       As Tim says, this stuff is child's play.  Back when OCRs were
       $10,000 machines with cutting-edge 68010 processors,
       reading Courier was pretty easy but it helped to put in checksums;
       these days you don't really need that.  (It also didn't like
       wet-process 240-dpi laser printing or faxes, but modern OCR software
       can generally deal with good-quality faxes and 
2) Bound pages vs. loose pages (printing with perforated pages
       or selling the source code in loose-leaf might count as an
       "attractive nuisance" :-), but a band-saw can solve that problem
       for the OCR user unless it's printed on Tyvek or something silly :-)
Even an exacto-knife would work.  For proportional fonts it depends on how
nasty the kerning gets and the shape of the characters.  (And a san serif
font without too many kerning pairs should go though fine.)  The technology
for this has progressed quite a bit in the last few years.  

Next thing you know, OCR software will be export controlled as well.  (Or
they will require something silly, like having all code samples in
caligraphy fonts.)
...
In the Karn case, the Feds made the silly argument that the
floppy disk version had the files neatly separated, while the
paper version split files between pages and had page numbers at the
bottoms of the pages that weren't part of the source code.
Even the $10K 68010 wonderbox could handle page headers/footers and margins, 
and modern software can do decent translations into different
word processor formats.
And even if it didn't, just selecting and deleting the margin areas would
not be all that difficult.  (Ooohhh...  A couple of extra hours is really
going to slow someone down.)
...
...
For just the amount of money we've spent (in our consulting fees) on
discussing just this issue of OCRing, the entire content of the MIT PGP
source code book AND Schneier's AC could have been manually inputted by
Barbadans or Botswanas, or probably even by Europeans.
I used to work for a company that would transfer entire archives of medical
journals.  Much of it we would just OCR.  Some of it we would send off
shore.  The OCR software was about 95% reliable and this was over 5 years
ago.  (And we were using 286 boxes for much of the OCR work.  Not a heavy
technoligical investment.)  I am sure that things have improved a great
deal since then.  (My new scanner included OCR software.  I will have to
run a test and report the findings.
...
There's one German university that OCRed the MIT PGP source code book.
The PGP folks passed out copies of their new 3.0 Pre-Alpha and an update
at a recent Cypherpunks meeting.  See
http://www.pgp.com/newsroom/sourcebook.cgi 
for ordering informaiton.  It's been donated to some local libraries, 
such as San Jose CA, and I hope they'll send it to the Library of Congress
and various non-US university and other public libraries - the recent
rules change clarifying that it's ok to export source code should make
this much easier.
The page listed does not contain order information.  Do you know costs
and/or order info?
...
[* OK, it's not really possible to type or proofread perl code accurately :-)
Yeah, just look at what happened with John Orwant's _Perl 5 interactive
Course_.  The book is being recalled due to all the typographical errors
from the pubisher.  Reading some Perl code is also quite impossible.  (For
the reasons behind this, i recommend Charlie Stross's article on the tpoic
on page 36 of _The Perl Journal_ #4.)
...
More to the point, OCRs aren't always real good about `backquotes' and other
little blotchy marks that some languages use, and even humans don't always
get them right.]
Many character sets are not very good at displaying "little used"
characters clearly.  (Some of the cheaper fonts do not even include them.)
Backticks are a special problem.  The latest Camel book has all sorts of
problems with hard to recognise backticks.

BTW, there is an article on Perl and randomness in The Perl Journal #4 by
John Orwant. Pretty basic for most Cypherpunks, but good reading none the
less...

---
|   If you're not part of the solution, You're part of the precipitate.  |
|"The moral PGP Diffie taught Zimmermann unites all| Disclaimer:         |
| mankind free in one-key-steganography-privacy!"  | Ignore the man      |
|`finger -l alano@teleport.com` for PGP 2.6.2 key  | behind the keyboard.|
|         http://www.ctrl-alt-del.com/~alan/       |alan@ctrl-alt-del.com|

Re: OCR and Machine Readable Text

Alan Olsen

/\anonymous/\

Rabid Wombat

Kent Crispin

Steve Stewart

Dale Thorn

tags

participants (6)