JavaScript Capable Webpage Download Tool?
Any new command line webpage downloaders capable of handling javascript, json, etc? Old CLI web tools won't render the text that GUI browsers will automatically decode / fetch and display.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, September 3rd, 2021 at 12:50 AM, grarpamp <grarpamp@gmail.com> wrote:
Any new command line webpage downloaders capable of handling javascript, json, etc?
try selenium webdriver: WebDriver driver = new ChromeDriver(); driver.get("https://en.wikipedia.org/wiki/Main_Page"); String str = driver.findElement(By.xpath("//*[@id='some-article-element-id']/p")).getText(); System.out.println(str); note that you may just want the root document for the entire page: By.xpath("/). you can do this in other language bindings for selenium webdriver as well. good luck! best regards, -----BEGIN PGP SIGNATURE----- iNUEAREKAH0WIQRBwSuMMH1+IZiqV4FlqEfnwrk4DAUCYTJNyF8UgAAAAAAuAChp c3N1ZXItZnByQG5vdGF0aW9ucy5vcGVucGdwLmZpZnRoaG9yc2VtYW4ubmV0NDFD MTJCOEMzMDdEN0UyMTk4QUE1NzgxNjVBODQ3RTdDMkI5MzgwQwAKCRBlqEfnwrk4 DNMqAP0RFp2WOhkxwtmKX42tpnHj5YfzoFoX+26RL/aygPQrJgEAjWq897obRxEn ef/xP1ocqURUfuTmTSXRruqRpxsC+Aw= =6mdz -----END PGP SIGNATURE-----
On 9/3/21, coderman <coderman@protonmail.com> wrote:
try selenium webdriver:
Wasn't really wanting to write some new tool, from a library, that someone out there has probably already done better. Wget / wget2 / curl / fetch / elinks / * old CLI tools don't interpret javascript / json, thus the pages they save to disk are 'source', not pulled out of the webserver via the pages embedded js yielding the text and links one would see in a GUI browser. Then there are those pages that do crazy js based variables mathfuscation, bloatfeaturing, etc. One of the old tools used to compile against ecma/js, but it was disclaimed as broken and capable of only basic js. Maybe can run ./firefox -<options> <url> > <file> but is inefficient, no spidering options, etc. In return, people should know that 'gron' and similar tools exist.
Thinking on different tools. Content scraping also missing, in addition to spidering. What kind of tool would be helpful here? Assuming that we need to build it. Maybe a local http server that provides access to a selenium session as if it is non-js html? Elements with mouse events could be converted to links. My interest in this missing area relates to ecommerce scraping. I'd like for people to be able to work around the product search engines that use marketing algorithms for all their result orders. I'd like for it to be easy for reseller to seed the inventory of decentralized marketplaces from existing platforms. The larger corps have been winning the scraping tech race recently, I believe. PhantomJS, which itself was another cultural polyfill for an ongoing situation, had a number of libraries and tools, but Selenium Webdriver has now replaced by it. A web search for "phantomjs OR casperjs spider" turns up some hits. "selenium spider" appears to as well. For example: https://scalpel.readthedocs.io/en/latest/selenium-spider/
Maybe a local http server that provides access to a selenium session as if it is non-js html?
I found https://github.com/alexandernst/headless_browser Headless browser based on WebKit ================ This tool will help you make your AJAX applications crawlable. Webpages based on JavaScript MVC libraries can't be positioned by default because search engines can't run (yet) all the JavaSript code that your page needs to execute in order to *show* anything. That's why you need a headless browser that will fetch the page, run the JavaSript and output the resulting HTML to the crawler, which will then be able to index your page. from https://github.com/dhamaniasad/HeadlessBrowsers , a list which attempts to enumerate all headless browsers and apis. I'm on mobile and haven't tried it. I'm interested if anyone finds a free-beer ecommerce scraping solution or other tools for connecting the cli to the web.
[sorry, I misremembered the question and thought grarpamp was asking regarding spidering. if you just want to archive a page it could be helpful to be aware that interfaces like selenium usually have a single function to take a screenshot, and comparably to dump the rendered page body. so a downloader would be like 3-7 lines of code. that's probably why coderman's reply appeared short.]
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Saturday, September 4th, 2021 at 10:39 PM, grarpamp <grarpamp@gmail.com> wrote: ...
Wasn't really wanting to write some new tool,
headless chrome webdriver scripts are not "writing a new tool", but if you really want something simple how about exporting your browser sessions as alternative? e.g. an archival system like: https://archivebox.io/ i've also done this in python with https://code.google.com/archive/p/pyv8/ as well, but this is probably still too much "writing a new tool" for you :P v8 support in curl or wget would be a good public bounty contender, if such a service existed at potential... best regards,
This one was easy enough write to fetch, parse, and then fetch the real page data. GUI browsers do all that behind the scenes. Wget '-r' style spidering would be needed for users to copy sites that use js/json frameworks sitewide. I'll look into this thread to handle more complex sites, and to keep from writing and maintaining site specific parsers. Endless scroll, pages with 1MB of var encoded obfuscated js spread across multiple script sources just to display 100 lines of text... such wow. Whatever happened to plain old html, some href links, form fields to submit. Ecom marketing cluttering search... :( Search engines... There might be use for a spider to stream everything it's scraping in realtime into one formatted output pipe, instead of first to disk then into the indexer. Something like continuous WARC to stdout/pipe. Parallel threads working different trees of the namespace. Auto nudge in case one gets stuck in the weeds. Seeding inventory requires the destination market to provide api for loading, or the seeder to write the shim loader, which you could then sell. Connect the CLI to the darknets, one giant API :)
Connect the CLI to the darknets, one giant API :)
http://web.archive.org/web/20140328220045/http://sourceforge.net/apps/mediaw...
participants (3)
-
coderman
-
grarpamp
-
Karl Semich