Beating bots and saving cents

How to configure Apache on Debian to block specific user agents from accessing files with certain extensions for all virtual hosts on a server

This morning I received an alert from Linode that my server ‘exceeded the notification threshold for outbound traffic rate’. The only other time this has happened was two weeks ago and I’m pretty confident that neither this site, nor any of the other sites I run for various non-profits, was suddenly very popular. I didn’t have time to investigate the previous instance but a few days after that, the monthly bill included an extra $0.90 overage fee. Not exactly breaking the bank but I was curious to know what was going on and if there was a security issue or more expensive problems to come.

Checking the Apache logs for the last 24 hours with GoAccess quickly identified the culprits:

 Hits     h% Vis.     v%  Tx. Amount Data
 ---- ------ ---- ------ ----------- ----
 8718 44.31%  817 39.47%   20.60 GiB Crawlers
 3449 17.53%  122  5.89%   14.84 GiB  ├─ GPTBot/1.2     
  800  4.07%  211 10.19%    5.66 GiB  ├─ facebookexternalhit/1.1

Everything else was in the mega- and kilobyte range. Grepping over the log files made it clear that the crawlers were downloading all the MP3s from the various church sermon podcasts going back over a decade. While more biblical teaching might be good to have in the next GPT model, I would prefer if OpenAI didn’t punish my server every fortnight.

The next step was decide how to stop them. Is robots.txt enough? Some initial investigation indicated that the Facebook crawler ignores robots.txt, insisting that the link sharing is user-initiated. Besides, I still want links to these sites to be able to be shared – they just don’t need to download gigabytes of podcasts to do so. Fail2ban has been quite effective at blocking unauthorised login attempted, but rate-limiting the crawlers might only spread out the traffic. Instead, I configured Apache to forbid crawlers with the above user agents from accessing .mp3 files.

To configure all virtual hosts at once, using Debian, create /etc/apache2/conf-available/block.conf:

# Default robots.txt
Alias /robots.txt /var/www/robots.txt

# Block bots from scraping MP3s
RewriteOptions InheritDown
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit|GPTBot
RewriteCond %{REQUEST_URI} \.(mp3)$
RewriteRule . - [R=403,L]

Enable the config with sudo a2enconf block, ensure each virtual host includes RewriteEngine On, and reload the server with sudo systemctl reload apache2. This also configures a server-wide robots.txt file. If I need to update the list of user agents, I can do it for all the sites at once. It worked well in my testing of adding ‘Safari’ to the list but I guess I’ll see how it goes in two weeks.