Mar 13, 2014

QNAP/Linux Tool - wget (1) Mirror Websites

GNU Wget was developed to fetch files from the Internet (the Web), in which HTTP, HTTPS, FTP, and many widely used protocols are supported. Since 1996, wget was born for the crawling Web requirement. I firstly studied wget source code in 1998 due to developing crawler programs for search engine.
http://upload.wikimedia.org/wikipedia/commons/a/a7/Wget-1.10-kde-3.4.2-de.png
GNU wget has many attractive features to fetch large files or mirror the entire web or FTP site:

  • files named with wild cards and recursively mirror directories
  • support unicode for files written with many different languages
  • convert absolute links (downloaded sites) to relative (local mirror sites)
  • available on most UNIX-like and Microsoft Windows systems
  • supports HTTP/HTTPS proxies, cookies, persistent HTTP connections, continuous transferation
  • non-interactive command for using as service or script programs
  • use server or local timestamps to check modifications efficiently

GNU Wget is distributed under the GNU General Public License.
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. GNU Wget
Therefore, I think wget is the most useful tool in light-weight Linux-based OS or embedded systems, especially for fetching source codes and packages for installations. From the GNU official site for wget, Wget 1.13.4, the document is clear for realizing widespread use cases of wget. Or you can issue command help to check arguments and options of wget.
[/tmp/wget] # wget --helpGNU Wget 1.11.4, a non-interactive network retriever.Usage: wget [OPTION]... [URL]...Mandatory arguments to long options are mandatory for short options too.Startup:...Logging and input file:...Download: -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits). --retry-connrefused retry even if connection is refused. -O, --output-document=FILE write documents to FILE. -nc, --no-clobber skip downloads that would download to existing files. -c, --continue resume getting a partially-downloaded file. --progress=TYPE select progress gauge type. -N, --timestamping don't re-retrieve files unless newer than local.....

First, try to simply fetch an URL and output (-O) as user-defined filename. If (-O without filename) or (without -O), the filename of the URL is applied.
[/tmp/wget] # wget -O youtube.html http://www.youtube.com/--2014-03-12 23:42:53-- http://www.youtube.com/Resolving www.youtube.com... 74.125.31.93, 74.125.31.91, 74.125.31.190, ...Connecting to www.youtube.com|74.125.31.93|:80... connected.HTTP request sent, awaiting response... 200 OKLength: unspecified [text/html]Saving to: `youtube.html' [ <=> ] 199,194 281K/s in 0.7s 2014-03-12 23:42:55 (281 KB/s) - `youtube.html' saved [199194][/tmp/wget] # lsyoutube.html[/tmp/wget] # ls -l-rw-r--r-- 1 admin administ 199194 Mar 12 23:42 youtube.html[/tmp/wget] #
[/tmp/wget] # wget http://www.youtube.com/ --2014-03-12 23:44:55-- http://www.youtube.com/Resolving www.youtube.com... 74.125.31.190, 74.125.31.136, 74.125.31.93, ...Connecting to www.youtube.com|74.125.31.190|:80... connected.HTTP request sent, awaiting response... 200 OKLength: unspecified [text/html]Saving to: `index.html' [ <=> ] 198,955 219K/s in 0.9s 2014-03-12 23:44:57 (219 KB/s) - `index.html' saved [198955][/tmp/wget] # ls -ladrwxr-xr-x 2 admin administ 80 Mar 12 23:44 ./drwxrwxrwx 15 admin administ 1860 Mar 12 23:44 ../-rw-r--r-- 1 admin administ 198955 Mar 12 23:44 index.html-rw-r--r-- 1 admin administ 199194 Mar 12 23:42 youtube.html

For checking the modification status of an URL periodically, use (-N, i.e. check server's timestamp with the fetched file). Use (-c) to efficiently and continuously get data of previously unfinished.
[/tmp/wget] # wget -t0 -c -N http://www.youtube.com/

To simulate wget (crawler) like a real person who is browsing a website, set the time period between two requests as random n seconds in the range [0, 2*30] (between 0 and 2 * wait seconds). This tip avoids your crawler been banned due to frequently accesses.
[/tmp/wget] # wget -c -N --wait=30 --random-wait http://www.youtube.com/

It's very easy to crawl all the site with wget. I backup my blog in this case. (-rp) recursively fetches pages of the site (URL) recursively. Then, du (disk usage) returns the statistic information (sum of each directory in KB) under the directory (ilearnblogger.blogspot.tw/). Finally, create a symbolic link between the directory to the Web directory so that we can browse the blog article pages.
[/tmp/wget] # wget -rp http://ilearnblogger.blogspot.tw/.... many normal message texts ...
[/tmp/wget] # du ilearnblogger.blogspot.tw/632 ilearnblogger.blogspot.tw/2012/12632 ilearnblogger.blogspot.tw/20121268 ilearnblogger.blogspot.tw/2013/01348 ilearnblogger.blogspot.tw/2013/03100 ilearnblogger.blogspot.tw/2013/04256 ilearnblogger.blogspot.tw/2013/0296 ilearnblogger.blogspot.tw/2013/0792 ilearnblogger.blogspot.tw/2013/05260 ilearnblogger.blogspot.tw/2013/0988 ilearnblogger.blogspot.tw/2013/121128 ilearnblogger.blogspot.tw/2013/06276 ilearnblogger.blogspot.tw/2013/083912 ilearnblogger.blogspot.tw/20131580 ilearnblogger.blogspot.tw/2014/021336 ilearnblogger.blogspot.tw/2014/011636 ilearnblogger.blogspot.tw/2014/034552 ilearnblogger.blogspot.tw/201411016 ilearnblogger.blogspot.tw/[/tmp/wget] # cd ilearnblogger.blogspot.tw/2014/03[/tmp/wget/ilearnblogger.blogspot.tw/2014/03] # lsqnap-nas-find-useful-packages-in-ipkg.html qnap-nas-my-first-app-helloworld.html qnap-nas.html ubuntu-lesson-11-run-sh-shell.html
qnap-nas-install-qnap-app-with-qdk.html qnap-nas-qdk-metadata-information.html ssh-on-chrome-browser-secure-shell.html
qnap-nas-ipkg-and-optware-ipkg-app.html qnap-nas-qdk.html ssh-putty.html[/tmp/wget/ilearnblogger.blogspot.tw/2014/03] # cd ..[/tmp/wget/ilearnblogger.blogspot.tw/2014] # cd ..[/tmp/wget/ilearnblogger.blogspot.tw] # cd ..[/tmp/wget] # ln -sf /tmp/wget/ilearnblogger.blogspot.tw /share/MD0_DATA/Web/iblog[/tmp/wget] #

Mirror my blog site with QNAP NAS and wget.
QNAP has preloaded "wget" in its OS, QTS. Of course, the powerful wget can not be introduced with only one article, I'll depict some useful case studies soon.

No comments :

Post a Comment