www.nsa-observer.net

Sat Jan 31 23:16:28 PST 2015

On 1/31/15, Jason McVetta <jason.mcvetta@gmail.com> wrote:
> ...
> For Ubuntu users:
>
> sudo apt-get install libimage-exiftool-perl
> exiftool -a -G1
> adobe-acrobat-xi-scan-paper-to-pdf-and-apply-ocr-tutorial-ue.pdf  | less -S

per the python PDF tools, (with varied options),
 or reduced option command line pdf2txt, or pdftotext, or
   also:

strings --bytes=$varlength ... with varying --encoding= ... , for as
John mentioned, all the metadatas and annotations typically unseen,

consider that the specific "configuration and input parsing" as a
"profile" for a given "input document" identified by "self certifying
identifier" for all of the above results in collaborative simplified
text paragraphs as a working base.

so sha256(generated corpora) == sha256(sha256(doc)  ^ sha256(config of
parse opts) ^ sha256(parse-product) )

if i use a convenient generated slang, ...

this means at least a dozen "to text" engines with configuration,
(parse opts and parse products) per input document as a working state.

and ten to twenty times the input pages as simplified output text
paragraphs (common base) collected from the useful parts of the best
transformations, used for subsequent text based natural language
processing.

in a sense, this is devops come to document processing, where the
process itself is embodied in version controlled and complete archives
with self certifying integrity. this means boring, and also done
decades ago, more or less, in varying contexts. everything old is new
again ;P

there are a whole field of customer parser and data sets and scrapers
all dedicated to variations on this theme, although sadly they don't
live public lives, for the most part.

best regards,