[hackerspaces] Academic scraping

Bryan Bishop kanzure at gmail.com
Mon Jan 14 09:59:38 PST 2013


On Mon, Jan 14, 2013 at 11:51 AM, Lokkju Brennr wrote:
> see:
> http://scraperwiki.org
> http://scrapy.org/
>
> Once you have the raw data in a central location, it becomes much easier for
> someone specialized in data processing to convert it to usable form - even
> if it is hard to parse.  It does help to keep the metadata though...

One of my favorite scraping methods at the moment is phantomjs, a
headless wrapper around webkit.

http://phantomjs.org/
https://github.com/ariya/phantomjs
https://github.com/kanzure/pyphantomjs

But for academic projects, I highly recommend zotero's translators.

https://github.com/zotero/translators

Here's why. There's already a huge userbase of zotero users actively
updating these scrapers. When they break, they fix them immediately.
They are all written in javascript and they extract not only the link
to the pdf but also the maximum amount of metadata. With the help of
the zotero/translation-server project, they can be used headlessly.

https://github.com/zotero/translation-server

I have a demo of this working in irc.freenode.net ##hplusroadmap
(paperbot), he just grabs links from our conversation and posts the
pdfs so that we don't have to ask each other for copies.

- Bryan
http://heybryan.org/
1 512 203 0507

-- 
You received this message because you are subscribed to the Google Groups "science-liberation-front" group.
To unsubscribe from this group, send email to science-liberation-front+unsubscribe at googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE





More information about the cypherpunks-legacy mailing list