Block Crawlers and Bots

This article shows my current state of affairs for how to analyze Apache Webserver Logs in order to find and block

Web Crawlers, AI Crawlers, SEO Crawlers, Bots

Main offenders were huge corporations likely trying to gain free access to relevant trainig data for their AI Bot businesses:

Meta / Facebook, Microsoft / OpenAI, Amazon, Alibaba, Huawei

These companies are responsible for far beyond 70% of traffic hitting on my small web server.

There is a nice video commentary from GamersNexus on Youtube about this: Piracy is for Trillion Dollar Companies | Fair Use, Copyright Law & Meta AI

This article is a work in progress.

Introduction

Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.

There are regular CPU workloads of 50 % even up to 80 or 100%.

htop shows, that there is a lot of CPU utilization by mariadbd process, which seems to be related to the MariaDB Database.

After doing some checking on my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.

The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.

When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:

Microsoft: https://ipinfo.io/ips/20.171.207.0/24
Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24
OpenAI

and couple of different IPs:

65.109.100.155
185.177.72.54

among others.

After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:

"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/

For related Hacker News search for AI crawlers

How to Analyze Apache Logs

Manually with cat and awk

Print top 50 IP addresses and number of hits in a given log file:

awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -50
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -50

Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/

Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses and their number of hits from these log files:

{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50

Hint: Using the questionmark in access.log.?.gz leads to return only single digit logs resulting in getting the last 10 days of logs.

Print top 50 user agents and number of hits of a given set of gz-compressed logfiles:

zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50

Or again print top 50 user agents and number of hits from a combination of uncompressed and compressed log files, i.e. last 10 days:

{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50

Hint (again): Using the questionmark in access.log.?.gz leads to return only single digit logs resulting in getting the last 10 days of logs.

Print top 50 IP addresses for one of the above gained user agents, like GPTBot:

{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep GPTBot | awk '{print $1}' | sort | uniq -c | sort -nr | head 50

Result as of 21 Sep. 2025 (excerpt):

  46298 20.171.207.124
  39515 20.171.207.52
   8689 20.171.207.225
   4236 20.171.207.160
   2198 20.171.207.119
   1705 20.171.207.210

Or Facebooks cralwer called meta-externalagent:

{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep meta-externalagent | awk '{print $1}' | sort | uniq -c | sort -nr | head -50

Result as of 21 Sep. 2025:

   7256 57.141.0.19
   6910 57.141.0.32
   6767 57.141.0.62
   6625 57.141.0.43
   6570 57.141.0.55

With these one can assume that the ranges 20.171.207.0/24 and 57.141.0.0/24 belong to similar persons. Ownership can be seen for example here:

Microsoft: https://ipinfo.io/ips/20.171.207.0/24
Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24

Goaccess

Install Goaccess for easier analyzing.

Who owns an IP Address or Address Range?

IP Addresses and Address Ranges Geolocation

Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24

IP Address Fraud Detection

Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/

Fail2ban supported IP database: https://www.abuseipdb.com/

Blocking Strategies

Blocking strategies are dependent of use case at hand:

robots.txt: General first step of asking visitors nicely not do to what you don't want to see
fail2ban: Try to generalize rules from situation at hand to automate blocking
iptables
- Block individual IP adresses or IP address ranges, if above does not work
- Block countries by IP address range, if there are too many individual IP addresses

robots.txt

Place a file called robots.txt in the root of your Apache web server.

Example (excerpt):

# Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
User-agent: AhrefsBot
Disallow: /

# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
User-agent: Amazonbot
Disallow: /

# Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
User-agent: Barkrowler
Disallow: /

# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
User-agent: ChatGPT-User
Disallow: /

# facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
User-agent: facebookexternalhit
Disallow: /

# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
User-agent: GPTBot
Disallow: /

# meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
User-agent: meta-externalagent
Disallow: /

# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
User-agent: OAI-SearchBot
Disallow: /

This seems to be mostly respected by:

Google
Amazon
Microsoft (at least partially)

This seem to be completely ignored by:

Meta / Facebook

fail2ban

Allows for customized filter rules for the use case at hand, i.e. based on User Agent, access behaviour or similar topics. This is however kind of tedious.

Hiere are some hints:

https://www.digitalocean.com/community/tutorials/how-fail2ban-works-to-protect-services-on-a-linux-server

Probably it makes sense to setup Fail2Ban to block accordingly as described here:

https://denshub.com/de/fail2ban-server-protection/

https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo

Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet):

fail2ban-client set jailnamehere banip 1.2.3.0/24

iptables

Or by directly creating rules with iptables in order to block individual IP adresses or IP address ranges

sudo iptables -A INPUT -s 1.2.3.4 -j DROP

If this should not be feasible, then as a last resort there is the possibility to block whole countries.

Further Ideas for Blocking Strategies

Further reading: https://github.com/fail2ban/fail2ban/issues/2261

Anonym

Suche

Block Crawlers and Bots

Namensräume

Mehr

Seitenaktionen

Inhaltsverzeichnis

Introduction

How to Analyze Apache Logs

Manually with cat and awk

Goaccess

Who owns an IP Address or Address Range?

IP Addresses and Address Ranges Geolocation

IP Address Fraud Detection

Blocking Strategies

robots.txt

fail2ban

iptables

Further Ideas for Blocking Strategies

Navigation

Navigation

SAP Development

Debian & Fedora Linux

Wikiwerkzeuge

Wikiwerkzeuge

Anonym

Suche

Block Crawlers and Bots

Introduction

How to Analyze Apache Logs

Manually with cat and awk

Goaccess

Who owns an IP Address or Address Range?

IP Addresses and Address Ranges Geolocation

IP Address Fraud Detection

Blocking Strategies

robots.txt

fail2ban

iptables

Further Ideas for Blocking Strategies

Navigation

Wikiwerkzeuge

Seitenwerkzeuge

Kategorien