05. April 2023

Cache warmup with Varnish Cache

Advanced methods to warmup your cache after a restart or deployment

If you deploy your application within Docker, including Varnish, you most likely have a blue/green deployment up and running, by which your visitors do not notice any downtime while your application switches to the new version. You most probably want or need to warmup your Varnish cache before it gets hit by traffic.

Or you have a simple deployment of your application while Varnish is running all the time, and maybe you just want to refresh all caches without restarting Varnish (which would effectivly kill all caches).

Anyhow, warming up is rather simple: access those pages once and you are good.

There are two easy ways to do that, which can also be combined:

Using "wget" to crawl the whole page
Using "curl" to crawl based on your sitemap.xml file

Let's automate it with "wget"

The command is simple:

wget -p -S --spider --max-redirect 0 --recursive --reject-regex "(customer|auth)" -U "custom warmup bot" -D yourdomain.com https://yourdomain.com

Depending on your application you may adjust some parameters or leave them out alltogether:

-p: get all images, css, js, etc... referenced on each page. wget does request all urls only once, so no need to worry about excess unnecessary requests
-S: Include server response in output, could be helpful for any debugging later on
--spider: Don't keep any downloaded files
--max-redirect 0: How many redirects to follow, in this case: none. With Varnish even your redirects will be cached so this should be sufficient in most cases
--recursive: Follow all links however deep this may be
-D yourdomain.com: Limit wget to this domain only
--reject-regex "(customer|auth|)": Ignore urls matched against this regex to warmup cachable content only. You can define in this regex all those pages you defined as "pass" in your Varnish config anyway
-U "custom warmup bot": User agent to use for request

Additional recommended params:

--no-check-certificate: Ignore SSL certificate errors in case you are running all your requests locally with just a self-signed certificate
-e robots=off: Ignores your robots.txt
-l5: Limit how deep down the rabbit whole of your website wget should go, in this case 5 levels deep
-o wget.log: Output everything into a log file and keep quite until finished with any shell output

Running multiple wget crawlers in parallel for multiple sites

If you have multiple sites you want to warmup in parallel, you can run them as a background job and wait until they are finished:

wget ...params... https://your-first-domain.com 2>&1 &
wget ...params... https://your-second-domain.com 2>&1 &
wget ...params... https://your-third-domain.com 2>&1 &
jobs # show all running background jobs
wait # wait until all background running jobs are finished

wget vs wget2

wget is a battle tested old piece of software, which gets the job done, plain and simple. But one thing is missing in wget and that is multi-threaded requests. Every url request is done one after the other, which can take quite a lot of time. To mitigate that, you can use wget2 which per default uses 5 threads to request urls in parallel.

If you want to adjust it, you can use this option: --max-threads=5

Stay local

You have to check wether wget/curl request to your website is handled locally or routed via www. If you can, put your domain into your /etc/hosts file with your local IP (most probably 127.0.0.1) to ensure that your warmup script is performant and does not go outside your server and back unnecessary.

Let's automate it with "curl" and your sitemap.xml file

If you have one sitemap then this command will get all https urls out of it and fetch it with 10 curl processes in parallel:

egrep -o "https://[^<]+" /path/to/your/sitemap.xml | xargs -P 10 -n 1 curl -A "custom warmup bot" -k -o /dev/null

Let's break it down:

egrep: This command searches for a text pattern, using extended regular expressions to perform the match. Running egrep is equivalent to running grep with the -E option
-o "https://[^<]+": The actual regex pattern to identify https urls
xargs -P 10 -n 1: Use 10 parallel processes, processing each only 1 url
curl: Now we are using curl to request a url
-A "custom warmup bot": User agent to use for request
-k: Ignore ssl errors in case of a self-signed certificate
-o /dev/null: Ignore outputs, otherwise you will see the whole html code of those pages

If you have multiple sitemaps, you would need to tweak your command a little:

cat /path/to/your/sitemaps/*.xml | egrep -o "https://[^<]+" | xargs -P 10 -n 1 curl -A "custom warmup bot" -k -o /dev/null

How to do a forced refresh of cached content

There is a nice option in Varnish you can set to always force a cache miss, which in return would refresh any existing cache entry. You can modify your Varnish config in such a way that it checks for a user agent or a specific IP and then sets this refresh option:

vcl 4.1;

acl purge {
  "127.0.0.1";
}

sub vcl_recv {
  if (req.http.User-Agent == "custom warmup bot" && client.ip ~ purge) {
    set req.hash_always_miss = true;
  }
}

Afterwards you run your warmup commands, preferably "wget" to avoid requesting the same page multiple times (wget keeps track of already requested pages).

Experiment with your warmup script

Depending on your server resources you may be adviced to use less or maybe more parallel processes, which greatly affect the whole warmup time. You may also combine those two in case not every url, which is referenced in your sitemap.xml, is also referenced on your website directly. Since you should tweak you warmup script in such a way that it only requests cachable content, it is totally fine to run both methods after another.