Block Crawlers and Bots
This article shows my current state of affairs for how to analyze Apache Webserver Logs in order to find and block
- Web Crawlers, AI Crawlers, SEO Crawlers, Bots
Main offenders were huge corporations likely trying to gain free access to relevant trainig data for their AI Bot businesses:
- Meta / Facebook, Microsoft / OpenAI, Amazon, Alibaba, Huawei
These companies are responsible for far beyond 70% of traffic hitting on my small web server.
There is a nice video commentary from GamersNexus on Youtube about this: Piracy is for Trillion Dollar Companies | Fair Use, Copyright Law & Meta AI
This article is a work in progress.
Introduction
Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.
There are regular CPU workloads of 50 % even up to 80 or 100%.
htop shows, that there is a lot of CPU utilization by mariadbd process, which seems to be related to the MariaDB Database.
After doing some checking on my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.
The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.
When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:
- Microsoft: https://ipinfo.io/ips/20.171.207.0/24
- Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24
- OpenAI
and couple of different IPs:
- 65.109.100.155
- 185.177.72.54
among others.
After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:
"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/
For related Hacker News search for AI crawlers
How to Analyze Apache Logs
Manually with cat and awk
Print top 50 IP addresses and number of hits in a given log file:
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -50
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -50
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/
Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses and their number of hits from these log files:
{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50
Hint: Using the questionmark in access.log.?.gz leads to return only single digit logs resulting in getting the last 10 days of logs.
Print top 50 user agents and number of hits of a given set of gz-compressed logfiles:
zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50
Or again print top 50 user agents and number of hits from a combination of uncompressed and compressed log files, i.e. last 10 days:
{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50
Hint (again): Using the questionmark in access.log.?.gz leads to return only single digit logs resulting in getting the last 10 days of logs.
Print top 50 IP addresses for one of the above gained user agents, like GPTBot:
{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep GPTBot | awk '{print $1}' | sort | uniq -c | sort -nr | head 50
Result as of 21 Sep. 2025 (excerpt):
46298 20.171.207.124 39515 20.171.207.52 8689 20.171.207.225 4236 20.171.207.160 2198 20.171.207.119 1705 20.171.207.210
Or Facebooks cralwer called meta-externalagent:
{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep meta-externalagent | awk '{print $1}' | sort | uniq -c | sort -nr | head -50
Result as of 21 Sep. 2025:
7256 57.141.0.19 6910 57.141.0.32 6767 57.141.0.62 6625 57.141.0.43 6570 57.141.0.55
With these one can assume that the ranges 20.171.207.0/24 and 57.141.0.0/24 belong to similar persons. Ownership can be seen for example here:
- Microsoft: https://ipinfo.io/ips/20.171.207.0/24
- Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24
Goaccess
Install Goaccess for easier analyzing.
Who owns an IP Address or Address Range?
IP Addresses and Address Ranges Geolocation
Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24
IP Address Fraud Detection
Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/
Fail2ban supported IP database: https://www.abuseipdb.com/
Blocking Strategies
Blocking strategies are dependent of use case at hand:
- robots.txt: General first step of asking visitors nicely not do to what you don't want to see
- fail2ban: Try to generalize rules from situation at hand to automate blocking
- iptables
- Block individual IP adresses or IP address ranges, if above does not work
- Block countries by IP address range, if there are too many individual IP addresses
robots.txt
Place a file called robots.txt in the root of your Apache web server.
Example (excerpt):
# Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
User-agent: AhrefsBot
Disallow: /
# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
User-agent: Amazonbot
Disallow: /
# Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
User-agent: Barkrowler
Disallow: /
# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
User-agent: ChatGPT-User
Disallow: /
# facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
User-agent: facebookexternalhit
Disallow: /
# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
User-agent: GPTBot
Disallow: /
# meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
User-agent: meta-externalagent
Disallow: /
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
User-agent: OAI-SearchBot
Disallow: /
This seems to be mostly respected by:
- Amazon
- Microsoft (at least partially)
This seem to be completely ignored by:
- Meta / Facebook
fail2ban
Allows for customized filter rules for the use case at hand, i.e. based on User Agent, access behaviour or similar topics. This is however kind of tedious.
Hiere are some hints:
Probably it makes sense to setup Fail2Ban to block accordingly as described here:
https://denshub.com/de/fail2ban-server-protection/
https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo
Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet):
fail2ban-client set jailnamehere banip 1.2.3.0/24
iptables
Or by directly creating rules with iptables in order to block individual IP adresses or IP address ranges
sudo iptables -A INPUT -s 1.2.3.4 -j DROP
See also Iptables (Debian)#Create Blocking Rules Manually
If this should not be feasible, then as a last resort there is the possibility to block whole countries.
Further Ideas for Blocking Strategies
Further reading: https://github.com/fail2ban/fail2ban/issues/2261
