On 1/31/15, Jason McVetta <jason.mcvetta@gmail.com> wrote:
... For Ubuntu users:
sudo apt-get install libimage-exiftool-perl exiftool -a -G1 adobe-acrobat-xi-scan-paper-to-pdf-and-apply-ocr-tutorial-ue.pdf | less -S
per the python PDF tools, (with varied options), or reduced option command line pdf2txt, or pdftotext, or also: strings --bytes=$varlength ... with varying --encoding= ... , for as John mentioned, all the metadatas and annotations typically unseen, consider that the specific "configuration and input parsing" as a "profile" for a given "input document" identified by "self certifying identifier" for all of the above results in collaborative simplified text paragraphs as a working base. so sha256(generated corpora) == sha256(sha256(doc) ^ sha256(config of parse opts) ^ sha256(parse-product) ) if i use a convenient generated slang, ... this means at least a dozen "to text" engines with configuration, (parse opts and parse products) per input document as a working state. and ten to twenty times the input pages as simplified output text paragraphs (common base) collected from the useful parts of the best transformations, used for subsequent text based natural language processing. in a sense, this is devops come to document processing, where the process itself is embodied in version controlled and complete archives with self certifying integrity. this means boring, and also done decades ago, more or less, in varying contexts. everything old is new again ;P there are a whole field of customer parser and data sets and scrapers all dedicated to variations on this theme, although sadly they don't live public lives, for the most part. best regards,