Forensics of paper docs, by, say, FBI, examined paper constituents (ecology of trees, soil, tree-cutters, haulers, pulp mill, paper mill, coating mill, shipping containers, distributors, sellers, buyers, lenders), inks and their constituents, human and machine handling and using and transmitting detritus, attempts to camouflage, divert and hoax. Forensics of digital docs do all this and much more from creation to transceiving, forging, hoaxing, tracking, calling home, so forth. Coupled with the Internet and commodiously ID'd digital processing devices from manufacturers of programs and devices to the poor user blocked from seeing the galore of peeping toms, diverted by promises of privacy and comsec from, sad to say, promoted by orgs receiving funds from the.manufacturers to play that very user narcosis role, what can be done? If a bio-hazard suit promises protection from ecological hazards, what digital-hazard suit is available not contaminated with data siphoning of the wearer like products tagged for sale to end of world believers. Crypto is trap- and back-doored and corrupt, so it is warned by those offering an NSA career in the womb of Rosemary's Baby, privacy is delusionary, so it is preached by those inviting into OTR communities filled with Google-SM-informants and XXers, openness worse deception than official secrecy, so blind-justice visionaries reveal and beckon to get off the grid and underground deep and dark far away from the electromagnetic spectrum -- quanta-land, teleportation nirvana, across rivernet of Styx Stux. Remember when cpunk seers cautioned commodiously of sinister authorities and their vilainous contractors, and encouraged heroically to assassinate them anonymously? Remember the gradual hiring of those seers to remain in place while aiding and abetting the authorities as contractors to invent and promise comsec and privacy and anonymity, generously trap- and back-doored and trojaned and Call Homed tracing the arc of Snowden and gobs of others requiring forenics to counter and counter-counter forensics of fora like this, like Post-Snowden journalism enthralled with the adopting of secure drop boxes, leak sites, secure comms, PK swaps and signings, to camouflage long-standing lunches and briefings with officials to agree on what can be slipped into public perception of acceptable corruption to hide the unacceptable. Adobe brags PDFs can simulate paper docs exactly. Indeed, and much more forensically easy. At 02:16 AM 2/1/2015, you wrote:
On 1/31/15, Jason McVetta <jason.mcvetta@gmail.com> wrote:
... For Ubuntu users:
sudo apt-get install libimage-exiftool-perl exiftool -a -G1 adobe-acrobat-xi-scan-paper-to-pdf-and-apply-ocr-tutorial-ue.pdf | less -S
per the python PDF tools, (with varied options), or reduced option command line pdf2txt, or pdftotext, or also:
strings --bytes=$varlength ... with varying --encoding= ... , for as John mentioned, all the metadatas and annotations typically unseen,
consider that the specific "configuration and input parsing" as a "profile" for a given "input document" identified by "self certifying identifier" for all of the above results in collaborative simplified text paragraphs as a working base.
so sha256(generated corpora) == sha256(sha256(doc) ^ sha256(config of parse opts) ^ sha256(parse-product) )
if i use a convenient generated slang, ...
this means at least a dozen "to text" engines with configuration, (parse opts and parse products) per input document as a working state.
and ten to twenty times the input pages as simplified output text paragraphs (common base) collected from the useful parts of the best transformations, used for subsequent text based natural language processing.
in a sense, this is devops come to document processing, where the process itself is embodied in version controlled and complete archives with self certifying integrity. this means boring, and also done decades ago, more or less, in varying contexts. everything old is new again ;P
there are a whole field of customer parser and data sets and scrapers all dedicated to variations on this theme, although sadly they don't live public lives, for the most part.
best regards,