How to avoid ugly peak traffic by blocking bots

There are good bots, and then there are the others. And they can cause some trouble with aggressive peaks in traffic.

Depending on the estimation, bot traffic accounts for somewhere between 30-50%. I've seen websites where the real number is even higher. So there can be a real incentive to figure out a way to reduce traffic from malicious bots. Most of the time you want to let the good bots in and keep the bad guys out.

Do be clear: The main objective is reducing traffic to our application to reduce load. But it's also beneficial to reduce outbound traffic to reduce costs, if need be.

Whats the real issue here?

If a dumb or malicious bot takes aim at your website, usually you get around 1-2 dozen requests per second, for either a couple of minutes, or in some bad cases even a couple of hours. Depending on your application and your customer behaviour this can be the equivalent of some hunderts of visitors/customers at the same time.

If you have auto scaling in place, then this could be a no-brainer for you, in terms up application uptime. But cost-wise this could be potentially really bad. If you only have one server in place, then such traffic could bring your server down quickly.

What are the solutions for it? Proper blocking with 403s.

Well, special bot management software, a rate limiter, better caching, scaling horizontally, just to name a few. We are talking about the bad bots. There are some recommendations about a proper robots.txt setup, but no bot is obligated to follow those rules. The bad bots do not for sure.

For this article I want to look at a combination of a blacklist and a whitelist, based on user agents, since this covers a lot of ground already.

Blacklist

There are some bots that are obviously bad. One example is the "The Mysterious Mozlila User Agent bot". Another example is PetalBot which can be aggressive and dumb at the same time, even continuing with requests for years, ignoring 403s for thousands of requests per day, no joke. Another example is every useragent with MSIE 2-9 or old Firefox Versions (<60). Even a IE11 useragent (keyword "Trident") is mostly used by bad bots, though I still see daily some real visitors with such old browsers. Still not joking.

Then there are some obvious scapers, which I consider bad bots too, since they are oftentimes very aggressive in nature too. If it's a "python" or a "ruby" script, a "Go-http-client" or "curl", most often it's bad in my experience.

You could create a simple rule in nginx to block useragents based on specific words in their useragent identity:

1 server {
2     ...
3     if ($http_user_agent ~* "^.*(MSIE\s*[2-9]|scanner|petalbot|ruby|wordpress|go-http-client|python|java|curl|spider|trident).*$") {
4         return 403;
5     }
6     ...
7 }

If you really want to go deep, you can check out the nginx-ultimate-bad-bot-blocker project by Mitchell Krog.

This simple steps can go a long way to protect your server from unwanted access and potentially big spikes in traffic.

Whitelist

In certain cases you could additionally work with a whitelist, meaning: If the useragent contains "bot" or "http", then it better be google|bingbot|twitter|whatsapp|facebook|duckduck|linkedin|instagram. Or if it's not "Mozilla/5.0" then it better be some "Safari" or some flavor or "Anroid". This would eliminate a whole bunch of bad actors. But it also means that you really have to stay up to date with the latest and newest user agents and regularely have an eye to spare on your access logs (you should anyway, but hey, I know we all have work to do).

You could create a simple rule in nginx like this:

 1 server {
 2     ...
 3     set $isbot = "";
 4     if ($http_user_agent ~* (bot|http)) {
 5         set $isbot "${isbot}YES";
 6     }
 7 
 8     if ($http_user_agent !~* (google|bingbot|twitter|whatsapp|facebook|duckduck|linkedin|instagram)) {
 9         set $isbot "${isbot}NO";
10     }
11 
12     if ($isbot = 'YESNO') {
13         return 403;
14     }
15     ...
16 }

Here we are, trying to do a simple if (this === true && that !== true) statement in nginx. Yeah, there are better ways to do that. But that's for another article.

Where are you?

Depending on wether you a local, a national or even internal player, you need a different approach. Figuring that out can mean another server not needed if you have so much traffic that you need to scale horizontally. Maybe you could one day even shutdown the server(s) dedicated for bot traffic only. But until then, there is even more optimization to do, so stay tuned.