Comments on PGP5.0 OCR (was Re: fyi, pgp source now available , internationally)
Charlie Root (root@cypherpunks.campsite.hip.nl) wrote:
http://cypherpunks.campsite.hip97.nl/pgp/ and http://www.pgpi.com/
(The former no longer seems to work, presumably because the machine is packed up and on its way home.) I just wanted to make a few comments on the proofreading, in case anyone feels like releasing software in a similar manner in future: The original printed and OCR-ed source gave a single checksum for each page, with four bits per line. It also ignored whitespace except in strings and comments. This meant that people could rapidly process the majority of the code to produce something which wasn't terribly pretty but functioned correctly. However, because there were only four bits per line an incorrect line could pass the checksum; this would still be detected because the checksums were chained, but it could mean that when an error was detected you had to check several lines to find the invalid one. Presumably because of this the OCR-ed pages at HIP included a per-line checksum. This was good... but... it also checksummed the whitespace. This wasn't a problem in theory, because tabs were indicated by a special character. However, most lines had both tabs *and* spaces and there was no way to see where the spaces were because they were overrriden by the tab (e.g. "mov<sp><tab>ax,23<sp><sp><tab><sp><tab>; Stuff"). As a consequence the proofreading went very slowly until some valiant folks (who may or may not wish to be identified, so I won't) worked overnight to put together a program to brute-force the checksum by trying all possible combinations of tabs and spaces until it found the right one. So for a future effort could we please have the per-line checksums but ignore the whitespace unless it's important (e.g. comments and strings again)? Or if you want to ensure that the whitespace is identical between versions, please either strip out unneccesary spaces or use a special character for them so we can see precisely where they are. If all we want is functioning code, then it doesn't have to look pretty; we can feed it through a code prettifier like indent when it's functionally correct. Mark
Mark Grant <mark@unicorn.com> writes:
I just wanted to make a few comments on the proofreading, in case anyone feels like releasing software in a similar manner in future:
[...] the OCR-ed pages at HIP included a per-line checksum. This was good... but... it also checksummed the whitespace. This wasn't a problem in theory, because tabs were indicated by a special character. However, most lines had both tabs *and* spaces and there was no way to see where the spaces were because they were overrriden by the tab (e.g. "mov<sp><tab>ax,23<sp><sp><tab><sp><tab>; Stuff").
How about a book full of 2D barcodes? As a plus perhaps the book would be more compact, as you could gzip it first -- the full source tree looks to be over a foot of doublesided paper! Adam -- Have *you* exported RSA today? --> http://www.dcs.ex.ac.uk/~aba/rsa/ print pack"C*",split/\D+/,`echo "16iII*o\U@{$/=$z;[(pop,pop,unpack"H*",<> )]}\EsMsKsN0[lN*1lK[d2%Sa2/d0<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<J]dsJxp"|dc`
On Mon, 11 Aug 1997, Adam Back wrote:
How about a book full of 2D barcodes?
As a plus perhaps the book would be more compact, as you could gzip it first -- the full source tree looks to be over a foot of doublesided paper!
Well, remember the reason we did this: to get the code out of the US in a way that the government couldn't screw with at all. Readable text is clearly a publication, and thus unrestrictable. There is a chance, however small, that gzip (and tarring I'd assume) the tree and then putting it in as text (or bar-coding it) would cloud the issue some. (Isn't part of this to do with human-readable as opposed to machien readable?) Besides, this way it's easier to spot the errors simply by comparing, with bar-codes and such you'd never ever be able to look at the errors yourself and find them. ----------------------------------------------------------------------- Ryan Anderson - <Pug Majere> "Who knows, even the horse might sing" Wayne State University - CULMA "May you live in interesting times.." randerso@ece.eng.wayne.edu Ohio = VYI of the USA PGP Fingerprint - 7E 8E C6 54 96 AC D9 57 E4 F8 AE 9C 10 7E 78 C9 -----------------------------------------------------------------------
On Mon, 11 Aug 1997, Ryan Anderson wrote:
On Mon, 11 Aug 1997, Adam Back wrote:
How about a book full of 2D barcodes?
As a plus perhaps the book would be more compact, as you could gzip it first -- the full source tree looks to be over a foot of doublesided paper!
It is about that girth, although I only have the first 5 volumes. They should have hand-huffman-coded the source :).
Well, remember the reason we did this: to get the code out of the US in a way that the government couldn't screw with at all. Readable text is clearly a publication, and thus unrestrictable. There is a chance, however small, that gzip (and tarring I'd assume) the tree and then putting it in as text (or bar-coding it) would cloud the issue some. (Isn't part of this to do with human-readable as opposed to machien readable?)
Not quite. If you read closely, the EAR says something about reserving judgment on OCR publications. You didn't use a specific OCR font, but you did put all kinds of other OCR helps in, which should by itself cloud the issue. It would be nice if it was resolved. Or if PGP came out with the "PGP crypto source quarterly", now that I have munge and unmunge :).
Besides, this way it's easier to spot the errors simply by comparing, with bar-codes and such you'd never ever be able to look at the errors yourself and find them.
You would normally bury a lot of ECC within the bar codes, so that unless the dog would eat the page, it would be able to reconstruct the whole, or even take the "munge" images and barcode those lines. --- reply to tzeruch - at - ceddec - dot - com ---
On Mon, 11 Aug 1997 nospam-seesignature@ceddec.com wrote:
Not quite. If you read closely, the EAR says something about reserving judgment on OCR publications. You didn't use a specific OCR font, but you did put all kinds of other OCR helps in, which should by itself cloud the issue. It would be nice if it was resolved.
Um, how about a CRC for every character of every line published electronically? (hehehhe... Oh, and of course we'll use 32 bit CRC's of 8 bit characters, of course...) Hidden text of this message not visible to feds for those without imagination: (yeah, right) all one would need is to build a table of 255 CRC's, take the 32 bit CRC code and reverse lookup the data. :) =====================================Kaos=Keraunos=Kybernetos============== .+.^.+.| Ray Arachelian |Prying open my 3rd eye. So good to see |./|\. ..\|/..|sunder@sundernet.com|you once again. I thought you were |/\|/\ <--*-->| ------------------ |hiding, and you thought that I had run |\/|\/ ../|\..| "A toast to Odin, |away chasing the tail of dogma. I opened|.\|/. .+.v.+.|God of screwdrivers"|my eye and there we were.... |..... ======================= http://www.sundernet.com ==========================
On Mon, 11 Aug 1997, Ray Arachelian wrote:
On Mon, 11 Aug 1997 nospam-seesignature@ceddec.com wrote:
Not quite. If you read closely, the EAR says something about reserving judgment on OCR publications. You didn't use a specific OCR font, but you did put all kinds of other OCR helps in, which should by itself cloud the issue. It would be nice if it was resolved.
Um, how about a CRC for every character of every line published electronically? (hehehhe... Oh, and of course we'll use 32 bit CRC's of 8 bit characters, of course...)
Hidden text of this message not visible to feds for those without imagination: (yeah, right) all one would need is to build a table of 255 CRC's, take the 32 bit CRC code and reverse lookup the data. :)
This sounds absurd but similar things have happened. The translation team for the Dead Sea Scrolls tried to keep the actual texts secret so they would be the only ones with the "Official" translation. They did, however, publish tables of what words were used and their location for use by researchers. A couple of them got the idea to use the lookup table to reconstruct the text. The results were a copy of the original text. (Needless to say, the "official" translation team was quite upset.) It did finally result in the publication of the scrolls, since the information had been "leaked". [This sounds like something from RISKS...] I wonder if it is legal to provide comprehensive cross-references of code. (Probably not, as the laws seem to be formulated under the legal principle of "I win, You lose". alan@ctrl-alt-del.com | Note to AOL users: for a quick shortcut to reply Alan Olsen | to my mail, just hit the ctrl, alt and del keys.
At 12:07 PM 8/17/97 -0700, Bill Stewart wrote:
At 03:21 PM 8/11/97 -0400, tzeruch - at - ceddec - dot - com wrote:
Not quite. If you read closely, the EAR says something about reserving judgment on OCR publications. You didn't use a specific OCR font, but you did put all kinds of other OCR helps in, which should by itself cloud the issue. It would be nice if it was resolved.
Of course they say the "reserve judgement" - they'd really like to control it, but they know their chances of getting it past the First Amendment are extremely low, so it's just FUD.
It is pretty absurd any way you look at it. Most text scanning jobs for commercial use are sent off-shore. For the government to act like OCR scanning does not exist for those off the continental US is absurd.
I thought the PGP source code was printed in nice, friendly OCR-B font, but OCR equipment is good enough that Courier 10 or random popular fonts from Laserjets will do. (Proportional spaced is still a bit harder to recognize than constant-width, but not by much.) Reading text typed on an IBM Selectric was practical 10 years ago, when cheap ($10K) 68000-based OCR machines were starting to come out which weren't made by Kurzweil (who made great $30K machines.) If they want to block OCR-readable stuff, they're blocking just about everything printed today.
I used to work for a company that made CD-ROMs of medical journals. These were proportional fonts out of magazines. (Once in a while we would get the origianl article information, but that was not always assured.) Most of the text would be sent to somewhere in Asia to be scanned and proofread. Text scanning is big business in some parts of SE Asia. (And has for many years.) Goes to show you just how disconnected from the real world the White House and its fellow travelers have become. --- | "That'll make it hot for them!" - Guy Grand | |"The moral PGP Diffie taught Zimmermann unites all| Disclaimer: | | mankind free in one-key-steganography-privacy!" | Ignore the man | |`finger -l alano@teleport.com` for PGP 2.6.2 key | behind the keyboard.| | http://www.ctrl-alt-del.com/~alan/ |alan@ctrl-alt-del.com|
-----BEGIN PGP SIGNED MESSAGE----- In <3.0.2.32.19970817123535.04265df0@ctrl-alt-del.com>, on 08/17/97 at 12:35 PM, Alan <alan@ctrl-alt-del.com> said:
Goes to show you just how disconnected from the real world the White House and its fellow travelers have become.
I doubt that you would find anyone outside of the beltway that didn't know that DC was out of touch with the rest of the world (and reality for that matter). They are all living in there own little fantasy world there; unfortunately people are dieing in the real world because of it. - -- - --------------------------------------------------------------- William H. Geiger III http://www.amaranth.com/~whgiii Geiger Consulting Cooking With Warp 4.0 Author of E-Secure - PGP Front End for MR/2 Ice PGP & MR/2 the only way for secure e-mail. OS/2 PGP 2.6.3a at: http://www.amaranth.com/~whgiii/pgpmr2.html - --------------------------------------------------------------- -----BEGIN PGP SIGNATURE----- Version: 2.6.3a Charset: cp850 Comment: Registered_User_E-Secure_v1.1b1_ES000000 iQCVAwUBM/dLF49Co1n+aLhhAQHw1gQAkaVr5bGpe4YURkOtdtpcTZp6Uw7sq2RO tcRbMy0RC1O19RxfnJQM4yIzoZZAq26ggnHt9vVw07xrME/ywmzysoOaUdpbRF6U V+1iVV5Q2bIfJ6ImvAlrLu+N9bNpDIWF1U0glyXcAhrVPOGZ9yyownpNmCHk3WWO 2zm7pphaXEQ= =wvD8 -----END PGP SIGNATURE-----
At 03:21 PM 8/11/97 -0400, tzeruch - at - ceddec - dot - com wrote:
Not quite. If you read closely, the EAR says something about reserving judgment on OCR publications. You didn't use a specific OCR font, but you did put all kinds of other OCR helps in, which should by itself cloud the issue. It would be nice if it was resolved.
Of course they say the "reserve judgement" - they'd really like to control it, but they know their chances of getting it past the First Amendment are extremely low, so it's just FUD. I thought the PGP source code was printed in nice, friendly OCR-B font, but OCR equipment is good enough that Courier 10 or random popular fonts from Laserjets will do. (Proportional spaced is still a bit harder to recognize than constant-width, but not by much.) Reading text typed on an IBM Selectric was practical 10 years ago, when cheap ($10K) 68000-based OCR machines were starting to come out which weren't made by Kurzweil (who made great $30K machines.) If they want to block OCR-readable stuff, they're blocking just about everything printed today. # Thanks; Bill # Bill Stewart, +1-415-442-2215 stewarts@ix.netcom.com # You can get PGP outside the US at ftp.ox.ac.uk/pub/crypto/pgp # (If this is a mailing list or news, please Cc: me on replies. Thanks.)
At 12:35 PM 8/17/97 -0700, Alan wrote:
I used to work for a company that made CD-ROMs of medical journals. These were proportional fonts out of magazines. (Once in a while we would get the origianl article information, but that was not always assured.) Most of the text would be sent to somewhere in Asia to be scanned and proofread. Text scanning is big business in some parts of SE Asia. (And has for many years.)
Goes to show you just how disconnected from the real world the White House and its fellow travelers have become.
I watched the attorney for the USG claim in federal court during the recent Bernstein hearing that foreigners were incapable of retyping or scanning in crypto source code. He stated that even retyping the source for DES was too difficult to be done successfully. I couldn't help but groan. Luckyly, the judge wasn't nearly as stupid as I had feared. She knew that he was trying to snow her. --Lucky Green <shamrock@netcom.com> PGP encrypted mail preferred. DES is dead! Please join in breaking RC5-56. http://rc5.distributed.net/
On Mon, 11 Aug 1997, Adam Back wrote:
Mark Grant <mark@unicorn.com> writes:
I just wanted to make a few comments on the proofreading, in case anyone feels like releasing software in a similar manner in future:
[...] the OCR-ed pages at HIP included a per-line checksum. This was good... but... it also checksummed the whitespace. This wasn't a problem in theory, because tabs were indicated by a special character. However, most lines had both tabs *and* spaces and there was no way to see where the spaces were because they were overrriden by the tab (e.g. "mov<sp><tab>ax,23<sp><sp><tab><sp><tab>; Stuff").
How about a book full of 2D barcodes?
Or just put everything through GNU indent and publish the .indent.pro file, so that after whatever is scanned in, all the .c and .h files will automagically be fixed. --- reply to tzeruch - at - ceddec - dot - com ---
At 15:32 11.08.97 +0100, Adam Back wrote:
Mark Grant <mark@unicorn.com> writes:
I just wanted to make a few comments on the proofreading, in case
anyone
feels like releasing software in a similar manner in future:
[...] the OCR-ed pages at HIP included a per-line checksum. This was good... but... it also checksummed the whitespace. This wasn't a problem in theory, because tabs were indicated by a special character. However, most lines had both tabs *and* spaces and
way to see where the spaces were because they were overrriden by
(e.g. "mov<sp><tab>ax,23<sp><sp><tab><sp><tab>; Stuff").
How about a book full of 2D barcodes?
As a plus perhaps the book would be more compact, as you could gzip it first -- the full source tree looks to be over a foot of doublesided paper!
How about importing the scanned in source (in electronic form) back into the States and doing a 'diff' there. This could produce an electronic patchfile to repair the mistakes in the scanned in code, meaning
-----BEGIN PGP SIGNED MESSAGE----- At 17:46 11.08.97 +0200, you wrote: there was no the tab that the
whole of the code could be cleaned up in one go. This patchfile could then be exported as it holds no crypto source code. (Somehow this seems *too* simple. Would this perhaps get up the US gubmint's nose? Have I missed some nuance or implicit limitation?)
How far could this be pushed? In the extreme case we could supply a file full of junk (random bytes) and then apply a patch to it to turn it into source code.
Adam -- Have *you* exported RSA today? -->
http://www.dcs.ex.ac.uk/~aba/rsa/
print pack"C*",split/\D+/,`echo
"16iII*o\U@{$/=$z;[(pop,pop,unpack"H*",<>
)]}\EsMsKsN0[lN*1lK[d2%Sa2/d0<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<J]dsJxp "|dc`
-----BEGIN PGP SIGNATURE----- Version: PGP for Personal Privacy 5.0 Charset: noconv iQCVAwUBM/BtnbgTZRKKFcAJAQFNlwP/ZhB2NZZv0qAuytMf2VLfLGV6mtY9vq/H J4Z5q3wBzhoLPNaXJ3exdQ1+z+5CdHYFS9hvmeDCEi0wKLNzMZMZPIRVAsgCUgbo I7lMvrRmV6Ajl/vuw7dLerv7oWDjI+G9kOpWLGrMdySUrYrVZlqm4o+hGb7/NPxE uWVqFBBI9CU= =wkOi -----END PGP SIGNATURE-----
At 15:32 11.08.97 +0100, Adam Back wrote:
Mark Grant <mark@unicorn.com> writes:
I just wanted to make a few comments on the proofreading, in case anyone feels like releasing software in a similar manner in future:
[...] the OCR-ed pages at HIP included a per-line checksum. This was good... but... it also checksummed the whitespace. This wasn't a problem in theory, because tabs were indicated by a special character. However, most lines had both tabs *and* spaces and there was no way to see where the spaces were because they were overrriden by the tab (e.g. "mov<sp><tab>ax,23<sp><sp><tab><sp><tab>; Stuff").
How about a book full of 2D barcodes?
As a plus perhaps the book would be more compact, as you could gzip it first -- the full source tree looks to be over a foot of doublesided paper!
How about importing the scanned in source (in electronic form) back into the States and doing a 'diff' there. This could produce an electronic patchfile to repair the mistakes in the scanned in code, meaning that the whole of the code could be cleaned up in one go. This patchfile could then be exported as it holds no crypto source code. (Somehow this seems *too* simple. Would this perhaps get up the US gubmint's nose? Have I missed some nuance or implicit limitation?)
Adam -- Have *you* exported RSA today? --> http://www.dcs.ex.ac.uk/~aba/rsa/
print pack"C*",split/\D+/,`echo "16iII*o\U@{$/=$z;[(pop,pop,unpack"H*",<> )]}\EsMsKsN0[lN*1lK[d2%Sa2/d0<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<J]dsJxp"|dc`
participants (10)
-
Adam Back -
Alan -
Bill Stewart -
Ian Sparkes -
Lucky Green -
Mark Grant -
nospam-seesignature@ceddec.com -
Ray Arachelian -
Ryan Anderson -
William H. Geiger III