Re: [DIYbio] Removing watermarks from pdfs (pdfparanoia)

On Wed, Feb 6, 2013 at 12:12 PM, Cathal Garvey wrote:
For example, to remove a frontpage, you might need to "explode" the PDF into images, discard the first image, and recompress into a new PDF.
I don't recommend this method, because converting most pdfs into images will cause loss of text. You can delete entire pages in the pdf format by deleting the "stream" objects and modifying the xref table.
To remove text/images embedded on the bottom of each PDF page, you could do the same except use imagemagick on each image before recompression.
Most text in a pdf document is "semantic", surrounded by pdf markup that can be directly deleted. I can imagine there might be one or two cases where publishers are adding an image to a pdf with your ip address, in which case you can delete that single image. However, if the page content is an image itself (no selectable text), then they might have chosen to add the image into the page, in which case the only way to remove the watermark would be to use imagemagick as you say, and draw over the offending image. So far I haven't seen this yet in any of the documents I have read over the years.
Major disadvantage to this route is that it would convert a text + images PDF (high compression ratio, easy to extract text for re-use) into an images-only PDF (large file size, poor compression, impossible to extract text without OCR).
right..
If you can extract text of course, you could try extracting text + images and perhaps script the creation of an entirely new PDF file. This is the opposite approach; instead of blacklisting content ("This bit contains IP address info"), you're whitelisting content ("These bits are the text and images that form the actual paper").
How would you whitelist content you've never seen before? - Bryan http://heybryan.org/ 1 512 203 0507 -- -- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en Learn more at www.diybio.org --- You received this message because you are subscribed to the Google Groups "DIYbio" group. To unsubscribe from this group and stop receiving emails from it, send an email to diybio+unsubscribe@googlegroups.com. To post to this group, send email to diybio@googlegroups.com. Visit this group at http://groups.google.com/group/diybio?hl=en. For more options, visit https://groups.google.com/groups/opt_out. ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
participants (1)
-
Bryan Bishop