Removing watermarks from pdfs

6 Jul 2018

      How about getting rid of those pesky watermarks in pdfs?

As far as I can tell, there are only visible watermarks. Invisible
watermarks can be detected by comparing the same pdf retrieved through
two different gateways (like from two different libraries). I have
checked Nature Publishing Group and Elsevier (specifically
ScienceDirect) and found no checksum differences.

But there are some culprits out there that do some nasty things to documents:

* lines of text added to the document containing an ip address,
timestamp, university name, etc. (IEEE Xplore)

* entire pages added to documents with tracking information (Wiley? I
can't remember exactly.)

* possibly some might be using CVE-2010-0188 to phone home to
publishers. PDF supports javascript and flash and other terrible
things, so it would be interesting to check if any publishers have
attempted to use these vulnerabilities to their advantage.

* there might be "hidden" information inside a pdf that changes when
you download a document, but so far no evidence of this has been found
(so I don't believe it's likely, but it's worth keeping in mind).

I think it would be useful to work on some ways to remove watermarks
from pdfs. I am aware of largely two types of pdfs that publishers
distribute. One is the feared "collection of images", which may or may
not have extra images slapped on with ip address information. The
second is a pdf with actual selectable text. The first type, with just
images everywhere, can be de-watermarked by just drawing images over
the offensive text. The second type requires some other creative
thinking, maybe just a collection of regular expressions.

For instance, here's a line that IEEE Xplore once added to a paper
that I was reading:

"Authorized licensed use limited to: University of Getting Schooled.
Downloaded on July 39, 2009 at 15:10 from IEEE Xplore. Restrictions
apply."

In fact, you can see this line appearing in other (4,000) papers that
other people have been reading:

http://scholar.google.com/scholar?q=%22Authorized+licensed+use+limited+to%22

Here's another example. AAAS/Science is of particular interest. They
attach an entire front page and add text in the margins everywhere:

"Downloaded from www.sciencemag.org on November 30, 1912"

So I think a good first step would be to collect examples of text
added to documents that need be detected by any eraser we write. In
fact, maybe all identifying information for an article should be
removed, and just replace it with an easy-to-copy-down text code (like
"blue-apple-oranges" to refer to a specific document in an index).

Does anyone else have some samples to share of nasty watermarks worth
removing? Also, any favorite ways to manipulate pdfs?

- Bryan
http://heybryan.org/
1 512 203 0507

-- 
You received this message because you are subscribed to the Google Groups "science-liberation-front" group.
To unsubscribe from this group, send email to science-liberation-front+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

Bryan Bishop

tags

participants (1)