Nov 8, 2015

Node.js - Using PhantomJS as a headless browser

Node.js seamlessly integrates with JS for programming web applications. PhantomJS, based on Node.js, provides an optimal solutions for crawling, parsing and interacting with web pages automatically controlled by PhantomJS codes.
http://phantomjs.org/
To crawl web pages using node.js, searching "crawl" from npm is trivially. But I found that many node.js packages within npm are not qualified to use for developing stable and flexible Apps. In this case, only phantomJS is qualified.
https://www.npmjs.com/package/crawler
In par with npm packages, PhantomJS's documentation is complete and clear so that I download it for testing soon.
http://phantomjs.org/download.html
Good news is that PhantomJS is all-in-one and binary-ready, all necessities is integrated into the only one binary file "phantomjs.exe".
PhantomJS environment: one binary file and many example codes
PhantomJS is also an open source project in GitHub, the build process for several OS platforms is shown in: http://phantomjs.org/build.html. Reading the release note, I realized PhantomJS is developed based on the Qt 5. Qt is a cross-platform developing tool widely used in embedded systems and Apps. Qt is also support webkit (Qt WebKit) so that it facilitates web page automation.
http://phantomjs.org/releases.html
Jumping into the "Quick Start", writing the program is easy and fast.
  • LINE 1: Use phantom's module "webpage" and node.js module "system".
  • LINE 2-5: Verify the argument list of console's command line.
  • LINE 10-22: page.open(url, cb-function) define call-back function to process URL page fetched by the webpage module (assigned to page). It estimates loading time of the URL and parse the webpage with evaluate(cb-function) to get the document title with the defined call-back function.
  • LINE 21: phantom.exit() must be executed, otherwise phantomjs.exe will not be terminated and consume the system memory.
Module packages: webpage and system
Save these codes as "url.js" and run with phantomjs.exe in the command line. The first result was failed since "http://" was not given. The second result successfully get the web pages with title "Gogole".
Command line of PhantomJS
PhantomJS is  powerful to provide headless browser (Headless Testing) that can be applied to develop web robots to simulate human behaviors. Similar tools like:
  • Casper.js is useful to build scripted navigation and testing
  • Lotte adds jQuery-like methods, chaining, and more assertion logic
  • WebSpecter is a BDD-style acceptance test framework for web applications
Screen Capture is also very easy by following code of page.render(). Image formats, such as .jpg, .gif, .png, are supported. Rendering pages as PDF files is also supported.
  • LINE 21: page.render('google.jpg') renders the web page and save the page image as file 'google.jpg' in the current working folder.

Render web pages into images
Rendered pages saved as an image: google.jpg
In the "examples" directory of the PhantomJS, there is a script rasterize.js (30 lines) which demonstrates a more complete rendering feature of PhantomJS.
Consequently, PhantomJS is a nice tool to develop automatic web robots for crawling web pages like human.



No comments :

Post a Comment