[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: OCR and Machine Readable Text



It's really embarassing to have to pay salaries of "public employees"
who can't come up with better arguments than the paper/magnetic/OCR nonsense
but don't have the guts to stop trying and admit they've wrong.
Does the President still make $200K/year salary?  You'd think he'd either
read what he signs or tell his employees to only ask him to sign
at least half-way credible stuff.  The old regulations used to pretend that
foreigners were too dumb to implement computer programs from algorithms;
now they're pretending that foreigners are too dumb to type.*
People used to say we have the best politicians money can buy, 
but you ought to be able to buy better politicians than that.

At 10:31 AM 12/30/96 -0800, Tim May wrote:
>And not only is OCR able these days to handle general fonts easily enough,
>but almost all printed code is in fixed-width fonts, i.e., non-proportional
>fonts. This makes OCR easy. 

The basic difference between "easily OCRed source code" and "not easily
OCRed source code" is pretty much limited to two things
1) Half-decent print quality (black on white in Courier at 300dpi should do....)
        As Tim says, this stuff is child's play.  Back when OCRs were
        $10,000 machines with cutting-edge 68010 processors,
        reading Courier was pretty easy but it helped to put in checksums;
        these days you don't really need that.  (It also didn't like
        wet-process 240-dpi laser printing or faxes, but modern OCR software
        can generally deal with good-quality faxes and 
2) Bound pages vs. loose pages (printing with perforated pages
        or selling the source code in loose-leaf might count as an
        "attractive nuisance" :-), but a band-saw can solve that problem
        for the OCR user unless it's printed on Tyvek or something silly :-)

In the Karn case, the Feds made the silly argument that the
floppy disk version had the files neatly separated, while the
paper version split files between pages and had page numbers at the
bottoms of the pages that weren't part of the source code.
Even the $10K 68010 wonderbox could handle page headers/footers and margins, 
and modern software can do decent translations into different
word processor formats.

>For just the amount of money we've spent (in our consulting fees) on
>discussing just this issue of OCRing, the entire content of the MIT PGP
>source code book AND Schneier's AC could have been manually inputted by
>Barbadans or Botswanas, or probably even by Europeans.

There's one German university that OCRed the MIT PGP source code book.
The PGP folks passed out copies of their new 3.0 Pre-Alpha and an update
at a recent Cypherpunks meeting.  See
http://www.pgp.com/newsroom/sourcebook.cgi 
for ordering informaiton.  It's been donated to some local libraries, 
such as San Jose CA, and I hope they'll send it to the Library of Congress
and various non-US university and other public libraries - the recent
rules change clarifying that it's ok to export source code should make
this much easier.   

[* OK, it's not really possible to type or proofread perl code accurately :-)
More to the point, OCRs aren't always real good about `backquotes' and other
little blotchy marks that some languages use, and even humans don't always
get them right.]

#			Thanks;  Bill
# Bill Stewart, +1-415-442-2215 [email protected]
# You can get PGP outside the US at ftp.ox.ac.uk/pub/crypto/pgp
#     (If this is a mailing list, please Cc: me on replies.  Thanks.)