Recursive HTTP download with wget
Downloading a lot of files from an HTTP source with a lot of sub directories can be quite annoying. Who ever has clicked through several
folders in his browser to download a couple of files knows what I’m talking about, especially if you have several hierarchies of sub
directories. Of course there are browser extensions that help, but if you want to solve the problem in a shell, wget
is your tool of
choice.
The parameters explained, taken from the wget manual page, some of them might be optional for your case:
- -r --recursive Turn on recursive retrieving. The default maximum depth is 5.
- -l depth --level=depth Specify recursion maximum depth level depth.
- -A acclist --accept acclist Specify comma-separated lists of file name suffixes or patterns to accept. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist, it will be treated as a pattern, rather than a suffix. In this case, you have to enclose the pattern into quotes to prevent your shell from expanding it, like in -A “*.mp3” or -A ‘*.mp3’.
- -p --page-requisites This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
- -nd --no-directories Do not create a hierarchy of directories when retrieving recursively.
- -np --no-parent Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
- -nv --no-verbose Turn off verbose without being completely quiet (use -q for that), which means that error messages and basic information still get printed.
- -k --convert-links After the download is complete, convert the links in the document to make them suitable for local viewing.
- -P prefix --directory-prefix=prefix Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
- --http-user=user --http-password=password Specify the username user and password password on an HTTP server. According to the type of the challenge, Wget will encode them using either the “basic” (insecure), the “digest”, or the Windows “NTLM” authentication scheme.
- --no-check-certificate Don’t check the server certificate against the available certificate authorities. Also don’t require the URL host name to match the common name presented by the certificate.