How to Download Files With Wget

Wget is a great tool for automating the task of downloading entire websites, files, or anything that needs to mimic a traditional web browser. This article discusses many of the things that you can use wget

If wget isn’t installed you can use either apt, yum to install it:

Installing Wget on Debian, Ubuntu

Installing Wget on RHEL, CentOS

Installing Wget on Windows

There is a windows binary for wget, but we’ve found that Cygwin works much better and provides other useful tools as well.

Basic Download with Wget

For the the most part you should be able to just download a file, but if it’s https you might have certificate problems. In that case use the –no-check-certificate flag.

Download File into Different Name and Location

Maybe you want to download a file into a different name (-O) or location (-P)? By default wget will download the file to the current working directory and use the original file name.

Bulk Dowload List of Files in wget

If you need to download several files at once using wget you can use the -i flag combined with a text file and 1 download per line:

Change User Agent in wget

If by chance, they do not like wget hammering their website, you can change the user agent, so they don’t know:

Download Entire Website

Though you might need to fiddle with cookies, span, recursiveness, domain and the other more advanced flags, you should start with a basic download of an entire website, using the “mirror” and “local browsing” flags:

Tip: You might also need to gunzip the files if they are compressed.

Rate Limit Wget Downloads

It is rude if you blindly torch a server’s resources. It is polite (and won’t set off as many alarms), if you request resources at a more respectable rate. Many site administrators will block wget because by default people do not behave nicely. Here is how to be more polite when using wget:

Use Passwords with wget

This only works with basic auth, but here are the flags for using a password and user on http authentication:

Use wget to Check for Broken Links

If you are scanning a site, it’s polite to wait 1 second between grabs. The following will spider a site and look for broken links, dumping the information to wget.log file.

Download MP3 files from Directory

It may be useful to limit your downloads to a specific directory and it’s subdirectories. The –no-parent flag will help with this. Here is an example to download mp3 files from a directory:

Download all Pictures from Website using Wget

This example will put all of the jpg, gif, png, and jpeg files into the /tmp/pictures folder from the site.com/images:

Scan list of sites for New PDFs

Sometimes, there are particular files you are interested in and ONLY those files. Wouldn’t it be nice to monitor multiple websites for these files all at once and keep a local copy for easy browsing at your leisure? You can surely do this, though it might not provide the site owner with the ad revenue or metrics that they desire:

Using wget with login cookies

You can have wget get cookies, or you can login with a browser, and use that cookie file after you manually create it. I was able to use this to get past a recent wordpress password location to a membership site.

Populate Cache Using Wget

WordPress has plugins that cache. There are also squid proxies and a plethora of other caching mechanisms. If you want to preload your caches (whatever they are), you can do it with wget:

Use wget Through Proxy

We use socks proxy quite a bit from ssh to a remote server to bypass firewalls (ssh user@remote -D 7070). After the proxy is setup, we use firefox and it’s socks proxy config to use 127.0.0.1:7070 as the proxy. You could use wget through a proxy like this:

If you use a different proxy, then just export it appropriately and your wget will pick it up from the environment.

Download wget Using Timestamps

This isn’t so much a feature of wget as it is of the shell, but working hand in hand you can take a dynamic site and get period data from it, loading it into sequential snapshots.

Leave a Reply

Your email address will not be published. Required fields are marked *