Apr 26, 2015

QNAP/Linux - Python Programming 04 (Learn by Doing: Crawl & Parse Pages)

In the first blog "Python Programming 01 (Setup and the First Program)", I use geturl.py as an example to write the first Python program. However, searching "url" or "http" from PyPI - the Python Package Index, your will find hundreds relevant packages in the result. How to select adequate packages for doing jobs, there is no deterministic answer.
https://pypi.python.org/pypi/urllib3/1.10.1
For example, I found urllib3 package that seems to be better than urllib. Let me try to import it in the command interactive mode.
urllib3 was not installed
Download urllib3 package, extract it and install with setup tool (ipkg install py26-setuptools).
install urllib3 through .py source codes
Then try to import again, it is successful.
import urllib3
The urllib3 download page also shows some sample codes to use this package, as the following example.
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://google.com/')
print r.status, r.data

Now, learning by doing, the question is: crawl the blog "http://ilearnblogger.blogspot.tw/2015/04/qnaplinux-python-programming-01-setup.html" and extract all tag Labels defined by me in this blog. Just select the part and press F12 to trigger Web Inspector of Chrome to debug.
Go to the HTML elements that render the selected highlight area
Press F12: Observe the rule of DOM tree's XPath to extract desired information
First, use urllib3 to get the page source.
Use sample code to get page easily
My blog page was successfully obtained.
Then, how to parse HTML DOM tree? Search 'parse html' from Package Index, all your needs or ideas should already have many packages. But how can I choose the most adequate one? Google search should accumulate the result that most people appreciate, and I like to get question's answer from stackoverflow due to its largest users and  best quality. So I ran the search 'python parse html "2.6" "stackoverflow"' and get the result.
https://www.google.com.tw/search?es_sm=93&q=python+parse+html+%222.6%22+%22stackoverflow%22
The only job you have to do is: be patient to try those packages smartly. That is, Learn by doing and Upgrade you experience of programming. I see following pages to determine which package should I use.

http://stackoverflow.com/users/257111/rhino
BeautifulSoup 4.3.2 is downloaded and installed by following scripts:
# Beautiful Soup 4.3.2 (October 2, 2013). For Python 2 (2.6+) and Python 3
wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz
tar -xf beautifulsoup4-4.3.2.tar.gz
cd beautifulsoup4-4.3.2
python setup.py install
cd ..

Beautiful Soup 4.3.2 (October 2, 2013) is for Python 2 (2.6+) and Python 3. It works. I've tried on Python 2.6 (ipkg install py26) and Python 2.7 installed from QNAP App Center.
Python 2.7 on QNAP
Just look some sample codes, I write following codes to achieve to goal.
Codes for extracting "tags" defined my blog article.
  • Line 7, 8: Get HTML source from web.
  • Line 10, 11: Parse HTML with BeautifulSoup and select DOM elements (in a list, i.e. array) containing CSS class "post-labels".
  • Line 13 - 15: The first for-loop extracts each element from the array, then the second for-loop extracts text (links.string) and append into the list (tags).
Append string into Python list.
The extracted tags are encoded with UNICODE and enclosed with '\n'. I search 'python string trim' from Google to see how to trim spaces or new lines of a string, although I am a beginner of Python.
search 'python string trim' from Google
No doubt, I found the 'strip()' function from the first result. Revise the code and get the result.
update code: tags.append(links.string.strip()) .... add code: for s in tags: print s
Since the default string encoding of Python is UNICODE, extracted Chinese tags can be shown correctly.
Extract tags from: http://ilearnblogger.blogspot.tw/2014/01/office-onenote.html
Learn by Doing. Set the problem and goal, before you try to learn programming languages.

Source code: https://github.com/ilearnblogger/pyp/blob/master/extractDOM.py

No comments :

Post a Comment