[DIYbio] Removing watermarks from pdfs (pdfparanoia)

Bryan Bishop kanzure at gmail.com
Thu Feb 7 00:26:09 PST 2013


On Wed, Feb 6, 2013 at 12:12 PM, Cathal Garvey wrote:
> For example, to remove a frontpage, you might need to "explode" the PDF
> into images, discard the first image, and recompress into a new PDF.

I don't recommend this method, because converting most pdfs into
images will cause loss of text. You can delete entire pages in the pdf
format by deleting the "stream" objects and modifying the xref table.

> To remove text/images embedded on the bottom of each PDF page, you could
> do the same except use imagemagick on each image before recompression.

Most text in a pdf document is "semantic", surrounded by pdf markup
that can be directly deleted. I can imagine there might be one or two
cases where publishers are adding an image to a pdf with your ip
address, in which case you can delete that single image. However, if
the page content is an image itself (no selectable text), then they
might have chosen to add the image into the page, in which case the
only way to remove the watermark would be to use imagemagick as you
say, and draw over the offending image. So far I haven't seen this yet
in any of the documents I have read over the years.

> Major disadvantage to this route is that it would convert a text +
> images PDF (high compression ratio, easy to extract text for re-use)
> into an images-only PDF (large file size, poor compression, impossible
> to extract text without OCR).

right..

> If you can extract text of course, you could try extracting text +
> images and perhaps script the creation of an entirely new PDF file. This
> is the opposite approach; instead of blacklisting content ("This bit
> contains IP address info"), you're whitelisting content ("These bits are
> the text and images that form the actual paper").

How would you whitelist content you've never seen before?

- Bryan
http://heybryan.org/
1 512 203 0507

-- 
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio at googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe at googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
--- 
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+unsubscribe at googlegroups.com.
To post to this group, send email to diybio at googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.



----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE





More information about the cypherpunks-legacy mailing list