Beating bots and saving cents

How to configure Apache on Debian to block specific user agents from accessing files with certain extensions for all virtual hosts on a server

This morning I received an alert from Linode that my server ‘exceeded the notification threshold for outbound traffic rate’. The only other time this has happened was two weeks ago and I’m pretty confident that neither this site, nor any of the other sites I run for various non-profits, was suddenly very popular. I didn’t have time to investigate the previous instance but a few days after that, the monthly bill included an extra $0.90 overage fee. Not exactly breaking the bank but I was curious to know what was going on and if there was a security issue or more expensive problems to come.

Checking the Apache logs for the last 24 hours with GoAccess quickly identified the culprits:

 Hits     h% Vis.     v%  Tx. Amount Data
 ---- ------ ---- ------ ----------- ----
 8718 44.31%  817 39.47%   20.60 GiB Crawlers
 3449 17.53%  122  5.89%   14.84 GiB  ├─ GPTBot/1.2     
  800  4.07%  211 10.19%    5.66 GiB  ├─ facebookexternalhit/1.1

Everything else was in the mega- and kilobyte range. Grepping over the log files made it clear that the crawlers were downloading all the MP3s from the various church sermon podcasts going back over a decade. While more biblical teaching might be good to have in the next GPT model, I would prefer if OpenAI didn’t punish my server every fortnight.

The next step was decide how to stop them. Is robots.txt enough? Some initial investigation indicated that the Facebook crawler ignores robots.txt, insisting that the link sharing is user-initiated. Besides, I still want links to these sites to be able to be shared – they just don’t need to download gigabytes of podcasts to do so. Fail2ban has been quite effective at blocking unauthorised login attempted, but rate-limiting the crawlers might only spread out the traffic. Instead, I configured Apache to forbid crawlers with the above user agents from accessing .mp3 files.

To configure all virtual hosts at once, using Debian, create /etc/apache2/conf-available/block.conf:

# Default robots.txt
Alias /robots.txt /var/www/robots.txt

# Block bots from scraping MP3s
RewriteOptions InheritDown
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit|GPTBot
RewriteCond %{REQUEST_URI} \.(mp3)$
RewriteRule . - [R=403,L]

Enable the config with sudo a2enconf block, ensure each virtual host includes RewriteEngine On, and reload the server with sudo systemctl reload apache2. This also configures a server-wide robots.txt file. If I need to update the list of user agents, I can do it for all the sites at once. It worked well in my testing of adding ‘Safari’ to the list but I guess I’ll see how it goes in two weeks.

Teaching from home and Internet points

After months of lockdown due to COVID-19, I’m back to teaching on campus this week. I really enjoyed teaching from home and having more time with family.

It was a funny experience being a video on my students’ screens and vice versa: visits from siblings and pets, kitchen science pracs and conversations with parents that ranged from heart-warming to, “What are you watching?” “It’s my maths class!”

However, it’s clear a lot of students have struggled with remote learning. I think it worked well for students with initiative but for those who didn’t turn up to video chats or email with questions, it would be pretty hard to learn on your own.

Like many others, I found the constraints inspired a lot of creativity. I started using iPad and Apple Pencil a lot for teaching this year and making more videos for students too. I hope to keep that going back in the classroom.

For fun, I’ve been teaching myself how to play the drums.

Also, just making stuff for the lols:

(and maybe an intro to circle geometry)
The title letters (Pewdieπ) are cut from one round slice of melon but don’t quite match the length of the four melon diameters. That’s right, the circumference is pi (~3.14) diameters! BIGBRAIN! I made this for my maths students who convinced me to sub a few years ago (how do you do fellow 19 year olds?).

I’m not aiming to be an Internet sensation à la Wootube or Pewdiepie but it was kinda cool when something I made went big:

As seen by millions of people. Enjoy your Internet points and 10 seconds of fame, Mr Kerr.

Blogging again…

I have been enjoying making things recently so I figured I should share them somewhere. I might write about my process or just post some photos.