Mar 18, 2014

QNAP/Linux Tool - wget (2) Backup the whole website and page content

In the previous blog, QNAP/Linux Tool - wget (1) Mirror Websites, the whole website html files were archived in your QNAP NAS. It's simple way to backup your blog articles, however, media contents within blogs are not fetched and archived.

How to backup/archive the whole website and page content, first you must check your QNAP Linux file system to recognize the mounted FS with enough disk space. For example, the following "df" command returns FS of my QNAP. The first one is the root (/, i.e. /dev/ram0, which is a ramdisk with only 32.9 MB space). If you run wget in this FS, you will get system message "insufficient ramdisk space!" soon. Obviously, we should run wget under the directory: "/share/HDA_DATA"
[/] # dfFilesystem Size Used Available Use% Mounted on/dev/ram0 32.9M 15.2M 17.7M 46% /tmpfs 32.0M 636.0k 31.4M 2% /tmp/dev/sda4 364.2M 306.1M 58.1M 84% /mnt/ext/dev/md9 509.5M 172.6M 336.9M 34% /mnt/HDA_ROOT/dev/sda3 233.3G 95.0G 137.8G 41% /share/HDA_DATAtmpfs 32.0M 352.0k 31.7M 1% /.eaccelerator.tmp[/] #

First, let's try to archive the whole page content (including images, icons, musics, etc.) under the directory that I created.
[/share/HDA_DATA/wgtest/page] # wget -E -H -k -K -p -U Mozilla http://ilearnblogger.blogspot.tw/2014/03/qnaplinux-tool-wget-1.html

The result consists the target directory "ilearnblogger.blogspot.tw/", in which the same directory structure speficied in the URL is maintained. Media contents presented in the page are also archived with the source <hostname>/<path> structures. After fetch the whole page content, links within the original html file was converted into relative path associated to the local FS. Therefore, the original html is renamed with .orig (qnaplinux-tool-wget-1.html.orig).
[/share/HDA_DATA/wgtest/page] # ls -ladrwxr-xr-x 12 admin administ 4096 Mar 18 16:40 ./drwxr-xr-x 8 admin administ 4096 Mar 18 16:12 ../drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 1.bp.blogspot.com/drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 2.bp.blogspot.com/drwxr-xr-x 5 admin administ 4096 Mar 18 16:40 3.bp.blogspot.com/drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 ilearnblogger.blogspot.tw/drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 img1.blogblog.com/drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 img2.blogblog.com/drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 lh5.googleusercontent.com/drwxr-xr-x 2 admin administ 4096 Mar 18 16:40 pagead2.googlesyndication.com/drwxr-xr-x 3 admin administ 4096 Mar 18 16:40 upload.wikimedia.org/drwxr-xr-x 2 admin administ 4096 Mar 18 16:40 www.google.com/[/share/HDA_DATA/wgtest/page] # ls ilearnblogger.blogspot.tw/2014/03/qnaplinux-tool-wget-1.html qnaplinux-tool-wget-1.html.orig[/share/HDA_DATA/wgtest/page] #

To realize the definition of each option, use grep to filter desired options out from "wget -h". Note that, \ is used for special char "|", the first "\" is to avoid ambiguity from search pattern started from "-". "|" (OR) is used to denote several patterns for extraction in the grep "pattern string".
[/share/HDA_DATA/wgtest] # wget -h | grep "\-E,\|-H,\|-k,\|-K,\|-p,\|-U," -E, --html-extension save HTML documents with `.html' extension. -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION. -k, --convert-links make links in downloaded HTML point to local files. -K, --backup-converted before converting file X, back up as X.orig. -p, --page-requisites get all images, etc. needed to display HTML page. -H, --span-hosts go to foreign hosts when recursive.[/share/HDA_DATA/wgtest] #

Note: both option (-) and suboption (--) are workable, but case-sensitive. Option (e.g. -E) is for simple, and suboption (e.g. --html-extension) is for human-readable.
-E
--html-extension save HTML documents with `.html' extension.
-U
--user-agent=AGENT identify as AGENT instead of Wget/VERSION. Some web server might reject robot-like browsers (wget is one of them), so that you should mimic widely user browsers (e.g. Chrome, IE) by speficying the browser string (http://www.useragentstring.com/pages/Chrome/) to cheat the website.
-k
--convert-links make links in downloaded HTML point to local files. The URL of links, images, media contents, etc. withing the original HTML file must be converted into relative paths associated to the local storage path.
-K
--backup-converted before converting file X, back up the original HTML file X as X.orig.
-p
--page-requisites get all images, etc. needed to display HTML page. I.e. Get all the contents presented in the page.
-H
--span-hosts go to foreign hosts when recursive.


Create a symbolic link, my blog article archived in my QNAP is available from the Web: http://192.168.0.110/page/<original URL>
[/share/HDA_DATA/wgtest/page] # ln -sf /share/HDA_DATA/wgtest/page/ /share/HDA_DATA/Web/page
All blog links are relative to the local storage.
The image URL is also relative to the local storage.

In this way, it is easy to backup/archive personal blogs and contents published with cloud and social network services. Wget options applied to cache the whole page content explained as follows.


As for mirror the whole site of my blog, it's easy to add the mirror option (-m or --mirror).
[/share/HDA_DATA/wgtest/p1] # wget -m -E -H -k -K -p -U "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36" http://ilearnblogger.blogspot.tw/

In this case, I mimic wget-robot as the newest Chrome browser. Consequently, my blog's total page views was increased since the command tries to read all articles of my browser with very quick speed. However, my wget request seems to be banned by the Google Blogger site soon, the the counter didn't increase more.
Before run Chrome-minic wget, the counter: 18,268.
After run Chrome-minic wget for several second, the counter really increased.
No problem, not only simulate a browser (-U "Mozilla ...") but also read pages like a normal human who reads page per n seconds (-w 30 --random-wait), where n is a random number between 0 and 2*30. I want to go to sleep, the wget is therefore executed in the background (-b).
[/share/HDA_DATA/wgtest/p1] # wget -m -b -w 30 --random-wait -E -H -k -K -p -U "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36" http://ilearnblogger.blogspot.tw/
Continuing in background, pid 27352.Output will be written to `wget-log'.[/share/HDA_DATA/wgtest/p1] #



Let's check the result tomorrow!

No comments :

Post a Comment