iLearnBlogger: QNAP/Linux Tool - wget (1) Mirror Websites

GNU Wget was developed to fetch files from the Internet (the Web), in which HTTP, HTTPS, FTP, and many widely used protocols are supported. Since 1996, wget was born for the crawling Web requirement. I firstly studied wget source code in 1998 due to developing crawler programs for search engine.

http://upload.wikimedia.org/wikipedia/commons/a/a7/Wget-1.10-kde-3.4.2-de.png

GNU wget has many attractive features to fetch large files or mirror the entire web or FTP site:

files named with wild cards and recursively mirror directories
support unicode for files written with many different languages
convert absolute links (downloaded sites) to relative (local mirror sites)
available on most UNIX-like and Microsoft Windows systems
supports HTTP/HTTPS proxies, cookies, persistent HTTP connections, continuous transferation
non-interactive command for using as service or script programs
use server or local timestamps to check modifications efficiently

GNU Wget is distributed under the GNU General Public License.

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. GNU Wget

Therefore, I think wget is the most useful tool in light-weight Linux-based OS or embedded systems, especially for fetching source codes and packages for installations. From the GNU official site for wget, Wget 1.13.4, the document is clear for realizing widespread use cases of wget. Or you can issue command help to check arguments and options of wget.

[/tmp/wget] # wget --help

GNU Wget 1.11.4, a non-interactive network retriever.

Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:

...

Logging and input file:

...

Download:

  -t,  --tries=NUMBER            set number of retries to NUMBER (0 unlimits).

       --retry-connrefused       retry even if connection is refused.

  -O,  --output-document=FILE    write documents to FILE.

  -nc, --no-clobber              skip downloads that would download to

                                 existing files.

  -c,  --continue                resume getting a partially-downloaded file.

       --progress=TYPE           select progress gauge type.

  -N,  --timestamping            don't re-retrieve files unless newer than

                                 local.

....

First, try to simply fetch an URL and output (-O) as user-defined filename. If (-O without filename) or (without -O), the filename of the URL is applied.

[/tmp/wget] # wget -O youtube.html http://www.youtube.com/

--2014-03-12 23:42:53--  http://www.youtube.com/

Resolving www.youtube.com... 74.125.31.93, 74.125.31.91, 74.125.31.190, ...

Connecting to www.youtube.com|74.125.31.93|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: unspecified [text/html]

Saving to: `youtube.html'

    [   <=>                                                                                                                                       ] 199,194      281K/s   in 0.7s    

2014-03-12 23:42:55 (281 KB/s) - `youtube.html' saved [199194]

[/tmp/wget] # ls

youtube.html

[/tmp/wget] # ls -l

-rw-r--r--    1 admin    administ    199194 Mar 12 23:42 youtube.html

[/tmp/wget] #

[/tmp/wget] # wget http://www.youtube.com/                

--2014-03-12 23:44:55--  http://www.youtube.com/

Resolving www.youtube.com... 74.125.31.190, 74.125.31.136, 74.125.31.93, ...

Connecting to www.youtube.com|74.125.31.190|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: unspecified [text/html]

Saving to: `index.html'

    [    <=>                                                                                                                                      ] 198,955      219K/s   in 0.9s    

2014-03-12 23:44:57 (219 KB/s) - `index.html' saved [198955]

[/tmp/wget] # ls -la

drwxr-xr-x    2 admin    administ        80 Mar 12 23:44 ./

drwxrwxrwx   15 admin    administ      1860 Mar 12 23:44 ../

-rw-r--r--    1 admin    administ    198955 Mar 12 23:44 index.html

-rw-r--r--    1 admin    administ    199194 Mar 12 23:42 youtube.html

For checking the modification status of an URL periodically, use (-N, i.e. check server's timestamp with the fetched file). Use (-c) to efficiently and continuously get data of previously unfinished.

[/tmp/wget] # wget -t0 -c -N http://www.youtube.com/

To simulate wget (crawler) like a real person who is browsing a website, set the time period between two requests as random n seconds in the range [0, 2*30] (between 0 and 2 * wait seconds). This tip avoids your crawler been banned due to frequently accesses.

[/tmp/wget] # wget -c -N --wait=30 --random-wait http://www.youtube.com/

It's very easy to crawl all the site with wget. I backup my blog in this case. (-rp) recursively fetches pages of the site (URL) recursively. Then, du (disk usage) returns the statistic information (sum of each directory in KB) under the directory (ilearnblogger.blogspot.tw/). Finally, create a symbolic link between the directory to the Web directory so that we can browse the blog article pages.

[/tmp/wget] # wget -rp http://ilearnblogger.blogspot.tw/

.... many normal message texts ...

[/tmp/wget] # du ilearnblogger.blogspot.tw/

632     ilearnblogger.blogspot.tw/2012/12

632     ilearnblogger.blogspot.tw/2012

1268    ilearnblogger.blogspot.tw/2013/01

348     ilearnblogger.blogspot.tw/2013/03

100     ilearnblogger.blogspot.tw/2013/04

256     ilearnblogger.blogspot.tw/2013/02

96      ilearnblogger.blogspot.tw/2013/07

92      ilearnblogger.blogspot.tw/2013/05

260     ilearnblogger.blogspot.tw/2013/09

88      ilearnblogger.blogspot.tw/2013/12

1128    ilearnblogger.blogspot.tw/2013/06

276     ilearnblogger.blogspot.tw/2013/08

3912    ilearnblogger.blogspot.tw/2013

1580    ilearnblogger.blogspot.tw/2014/02

1336    ilearnblogger.blogspot.tw/2014/01

1636    ilearnblogger.blogspot.tw/2014/03

4552    ilearnblogger.blogspot.tw/2014

11016   ilearnblogger.blogspot.tw/

[/tmp/wget] # cd ilearnblogger.blogspot.tw/2014/03

[/tmp/wget/ilearnblogger.blogspot.tw/2014/03] # ls

qnap-nas-find-useful-packages-in-ipkg.html  qnap-nas-my-first-app-helloworld.html       qnap-nas.html                               ubuntu-lesson-11-run-sh-shell.html

qnap-nas-install-qnap-app-with-qdk.html     qnap-nas-qdk-metadata-information.html      ssh-on-chrome-browser-secure-shell.html

qnap-nas-ipkg-and-optware-ipkg-app.html     qnap-nas-qdk.html                           ssh-putty.html

[/tmp/wget/ilearnblogger.blogspot.tw/2014/03] # cd ..

[/tmp/wget/ilearnblogger.blogspot.tw/2014] # cd ..

[/tmp/wget/ilearnblogger.blogspot.tw] # cd ..

[/tmp/wget] # ln -sf /tmp/wget/ilearnblogger.blogspot.tw /share/MD0_DATA/Web/iblog

[/tmp/wget] # 

Mirror my blog site with QNAP NAS and wget.

QNAP has preloaded "wget" in its OS, QTS. Of course, the powerful wget can not be introduced with only one article, I'll depict some useful case studies soon.

Pages

Mar 13, 2014

QNAP/Linux Tool - wget (1) Mirror Websites

No comments :

Post a Comment