In the first blog "
Python Programming 01 (Setup and the First Program)", I use geturl.py as an example to write the first Python program. However, searching "url" or "http" from
PyPI - the Python Package Index, your will find hundreds relevant packages in the result. How to select adequate packages for doing jobs, there is no deterministic answer.
For example, I found
urllib3 package that seems to be better than urllib. Let me try to import it in the command interactive mode.
|
urllib3 was not installed |
Download
urllib3 package, extract it and install with setup tool (
ipkg install py26-setuptools).
|
install urllib3 through .py source codes |
Then try to import again, it is successful.
|
import urllib3 |
The
urllib3 download page also shows some sample codes to use this package, as the following example.
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://google.com/')
print r.status, r.data
Now,
learning by doing, the question is:
crawl the blog "
http://ilearnblogger.blogspot.tw/2015/04/qnaplinux-python-programming-01-setup.html" and
extract all tag Labels defined by me in this blog. Just select the part and press
F12 to trigger
Web Inspector of Chrome to debug.
|
Go to the HTML elements that render the selected highlight area |
|
Press F12: Observe the rule of DOM tree's XPath to extract desired information |
First, use urllib3 to get the page source.
|
Use sample code to get page easily |
|
My blog page was successfully obtained. |
Then, how to parse HTML DOM tree? Search 'parse html' from
Package Index, all your needs or ideas should already have many packages.
But how can I choose the most adequate one? Google search should accumulate the result that most people appreciate, and I like to get question's answer from
stackoverflow due to its largest users and best quality. So I ran the search 'python parse html "2.6" "stackoverflow"' and get the result.
The only job you have to do is:
be patient to try those packages smartly. That is, Learn by doing and Upgrade you experience of programming. I see following pages to determine which package should I use.
# Beautiful Soup 4.3.2 (October 2, 2013). For Python 2 (2.6+) and Python 3
wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz
tar -xf beautifulsoup4-4.3.2.tar.gz
cd beautifulsoup4-4.3.2
python setup.py install
cd ..
Beautiful Soup 4.3.2 (October 2, 2013) is for Python 2 (2.6+) and Python 3. It works. I've tried on Python 2.6 (ipkg install py26) and Python 2.7 installed from QNAP App Center.
|
Python 2.7 on QNAP |
Just look some sample codes, I write following codes to achieve to goal.
|
Codes for extracting "tags" defined my blog article. |
- Line 7, 8: Get HTML source from web.
- Line 10, 11: Parse HTML with BeautifulSoup and select DOM elements (in a list, i.e. array) containing CSS class "post-labels".
- Line 13 - 15: The first for-loop extracts each element from the array, then the second for-loop extracts text (links.string) and append into the list (tags).
|
Append string into Python list. |
The extracted tags are encoded with UNICODE and enclosed with '\n'. I search 'python string trim' from Google to see how to trim spaces or new lines of a string, although I am a beginner of Python.
|
search 'python string trim' from Google |
No doubt, I found the '
strip()' function from the first result. Revise the code and get the result.
|
update code: tags.append(links.string.strip()) .... add code: for s in tags: print s |
Since the default string encoding of Python is UNICODE, extracted Chinese tags can be shown correctly.
Learn by Doing. Set the problem and goal, before you try to learn programming languages.
Source code:
https://github.com/ilearnblogger/pyp/blob/master/extractDOM.py
No comments :
Post a Comment