iLearnBlogger: QNAP/Linux - Python Programming 04 (Learn by Doing: Crawl & Parse Pages)

In the first blog "Python Programming 01 (Setup and the First Program)", I use geturl.py as an example to write the first Python program. However, searching "url" or "http" from PyPI - the Python Package Index, your will find hundreds relevant packages in the result. How to select adequate packages for doing jobs, there is no deterministic answer.

https://pypi.python.org/pypi/urllib3/1.10.1

For example, I found urllib3 package that seems to be better than urllib. Let me try to import it in the command interactive mode.

urllib3 was not installed

Download urllib3 package, extract it and install with setup tool (ipkg install py26-setuptools).

install urllib3 through .py source codes

Then try to import again, it is successful.

import urllib3

The urllib3 download page also shows some sample codes to use this package, as the following example.

import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://google.com/')
print r.status, r.data

Now, learning by doing, the question is: crawl the blog "http://ilearnblogger.blogspot.tw/2015/04/qnaplinux-python-programming-01-setup.html" and extract all tag Labels defined by me in this blog. Just select the part and press F12 to trigger Web Inspector of Chrome to debug.

Go to the HTML elements that render the selected highlight area

Press F12: Observe the rule of DOM tree's XPath to extract desired information

First, use urllib3 to get the page source.

Use sample code to get page easily

My blog page was successfully obtained.

Then, how to parse HTML DOM tree? Search 'parse html' from Package Index, all your needs or ideas should already have many packages. But how can I choose the most adequate one? Google search should accumulate the result that most people appreciate, and I like to get question's answer from stackoverflow due to its largest users and best quality. So I ran the search 'python parse html "2.6" "stackoverflow"' and get the result.

https://www.google.com.tw/search?es_sm=93&q=python+parse+html+%222.6%22+%22stackoverflow%22

The only job you have to do is: be patient to try those packages smartly. That is, Learn by doing and Upgrade you experience of programming. I see following pages to determine which package should I use.

Rank 1 page (Parsing HTML in Python - Stack Overflow): native HTML parser is too tedious (may be good for advanced parsing jobs) and is not adequate for beginner. µTidylib is NOT FOUND. html5lib was moved! htql is good at handling malformed html, but it's library modules and needs to be installed into QNAP. Just keep it and try it in the future.
Rank 2 page (Python code to remove HTML tags from a string): the purpose is for manipulating DOM tree elements, not for extracting.
Rank 3 page (parse xhtml in python 2.6 - Stack Overflow): I was attracted by the sentence "Have you tried BeautifulSoup? It handles documents that aren't well formed and I've found it pretty good." The author Rhino is the top 2% geek, his words should be the most valuable.

http://stackoverflow.com/users/257111/rhino

BeautifulSoup 4.3.2 is downloaded and installed by following scripts:

# Beautiful Soup 4.3.2 (October 2, 2013). For Python 2 (2.6+) and Python 3

wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz

tar -xf beautifulsoup4-4.3.2.tar.gz

cd beautifulsoup4-4.3.2

python setup.py install

cd ..

Beautiful Soup 4.3.2 (October 2, 2013) is for Python 2 (2.6+) and Python 3. It works. I've tried on Python 2.6 (ipkg install py26) and Python 2.7 installed from QNAP App Center.

Python 2.7 on QNAP

Just look some sample codes, I write following codes to achieve to goal.

Codes for extracting "tags" defined my blog article.

Line 7, 8: Get HTML source from web.
Line 10, 11: Parse HTML with BeautifulSoup and select DOM elements (in a list, i.e. array) containing CSS class "post-labels".
Line 13 - 15: The first for-loop extracts each element from the array, then the second for-loop extracts text (links.string) and append into the list (tags).

Append string into Python list.

The extracted tags are encoded with UNICODE and enclosed with '\n'. I search 'python string trim' from Google to see how to trim spaces or new lines of a string, although I am a beginner of Python.

search 'python string trim' from Google

No doubt, I found the 'strip()' function from the first result. Revise the code and get the result.

update code: tags.append(links.string.strip()) .... add code: for s in tags: print s

Since the default string encoding of Python is UNICODE, extracted Chinese tags can be shown correctly.

Extract tags from: http://ilearnblogger.blogspot.tw/2014/01/office-onenote.html

Learn by Doing. Set the problem and goal, before you try to learn programming languages.

Source code: https://github.com/ilearnblogger/pyp/blob/master/extractDOM.py

Pages

Apr 26, 2015

QNAP/Linux - Python Programming 04 (Learn by Doing: Crawl & Parse Pages)

No comments :

Post a Comment