Re: [hackerspaces] Academic scraping
On Mon, Jan 14, 2013 at 11:51 AM, Lokkju Brennr wrote:
see: http://scraperwiki.org http://scrapy.org/
Once you have the raw data in a central location, it becomes much easier for someone specialized in data processing to convert it to usable form - even if it is hard to parse. It does help to keep the metadata though...
One of my favorite scraping methods at the moment is phantomjs, a headless wrapper around webkit. http://phantomjs.org/ https://github.com/ariya/phantomjs https://github.com/kanzure/pyphantomjs But for academic projects, I highly recommend zotero's translators. https://github.com/zotero/translators Here's why. There's already a huge userbase of zotero users actively updating these scrapers. When they break, they fix them immediately. They are all written in javascript and they extract not only the link to the pdf but also the maximum amount of metadata. With the help of the zotero/translation-server project, they can be used headlessly. https://github.com/zotero/translation-server I have a demo of this working in irc.freenode.net ##hplusroadmap (paperbot), he just grabs links from our conversation and posts the pdfs so that we don't have to ask each other for copies. - Bryan http://heybryan.org/ 1 512 203 0507 -- You received this message because you are subscribed to the Google Groups "science-liberation-front" group. To unsubscribe from this group, send email to science-liberation-front+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
participants (1)
-
Bryan Bishop