A sample 1-page PDF by Adobe with over 500 metadata visible under "Properties" "Additional Metadata" "Advanced" http://cryptome.org/adobe-acrobat-xi-scan-paper-to-pdf-and-apply-ocr-tutoria... Acrobat needed to see the Additional Metadata, Reader does not show. Curious 464 metadata under "xmpMM:History (bag container)". And many more hidden but perusable with text scrutiny of the PDF.
On Sat, Jan 31, 2015 at 4:58 PM, John Young <jya@pipeline.com> wrote:
Acrobat needed to see the Additional Metadata, Reader does not show.
For Ubuntu users: sudo apt-get install libimage-exiftool-perl exiftool -a -G1 adobe-acrobat-xi-scan-paper-to-pdf-and-apply-ocr-tutorial-ue.pdf | less -S
On 1/31/15, Jason McVetta <jason.mcvetta@gmail.com> wrote:
... For Ubuntu users:
sudo apt-get install libimage-exiftool-perl exiftool -a -G1 adobe-acrobat-xi-scan-paper-to-pdf-and-apply-ocr-tutorial-ue.pdf | less -S
per the python PDF tools, (with varied options), or reduced option command line pdf2txt, or pdftotext, or also: strings --bytes=$varlength ... with varying --encoding= ... , for as John mentioned, all the metadatas and annotations typically unseen, consider that the specific "configuration and input parsing" as a "profile" for a given "input document" identified by "self certifying identifier" for all of the above results in collaborative simplified text paragraphs as a working base. so sha256(generated corpora) == sha256(sha256(doc) ^ sha256(config of parse opts) ^ sha256(parse-product) ) if i use a convenient generated slang, ... this means at least a dozen "to text" engines with configuration, (parse opts and parse products) per input document as a working state. and ten to twenty times the input pages as simplified output text paragraphs (common base) collected from the useful parts of the best transformations, used for subsequent text based natural language processing. in a sense, this is devops come to document processing, where the process itself is embodied in version controlled and complete archives with self certifying integrity. this means boring, and also done decades ago, more or less, in varying contexts. everything old is new again ;P there are a whole field of customer parser and data sets and scrapers all dedicated to variations on this theme, although sadly they don't live public lives, for the most part. best regards,
participants (3)
-
coderman
-
Jason McVetta
-
John Young