Block Crawlers and Bots: Unterschied zwischen den Versionen

Aus Matts Wiki
K Matt verschob die Seite Monitoring Apache Logs to Prevent Abusive Practices nach Block Crawlers and Bots, ohne dabei eine Weiterleitung anzulegen
Keine Bearbeitungszusammenfassung
Zeile 1: Zeile 1:
This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices and outright stealing of content done by huge corporations to likely to gain free trainig data for AI bots.
This article shows my current state of affairs for how to analyze Apache Webserver Logs in order to find and block
 
* Web Crawlers
* AI Crawlers
* SEO Crawlers
* Bots
 
Main offenders were huge corporations likely trying to gain free trainig data for AI:
 
* Meta / Facebook
* Microsoft / OpenAI
* Amazon
* Alibaba
* Huawei
 
These companies are responsible for far beyond 70% of traffic on my small web server.
 
This article is a work in progress.


== Introduction ==
== Introduction ==
Zeile 8: Zeile 25:
<code>htop</code> shows, that there is a lot of CPU utilization by <code>mariadbd</code> process, which seems to be related to the MariaDB Database.
<code>htop</code> shows, that there is a lot of CPU utilization by <code>mariadbd</code> process, which seems to be related to the MariaDB Database.


After checking my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.  
After doing some checking on my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.  


The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.
The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.  


When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:
When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:
Zeile 32: Zeile 49:


== How to Analyze Apache Logs ==
== How to Analyze Apache Logs ==
=== Manually with cat and awk ===
Print top 50 IP addresses and number of hits in a given log file:
Print top 50 IP addresses and number of hits in a given log file:
  awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -50
  awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -50
Zeile 70: Zeile 89:
*Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24
*Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24


== Databases for IP Adresses ==
=== Goaccess ===
Install [[Goaccess (Debian)|Goaccess]] for easier analyzing.
 
== Who owns an IP Address or Address Range? ==
 
=== IP Addresses and Address Ranges Geolocation ===
Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24
 
=== IP Address Fraud Detection ===
Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/
Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/


Fail2ban supported IP database: https://www.abuseipdb.com/
Fail2ban supported IP database: https://www.abuseipdb.com/


Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24
== Blocking Strategies ==
 
=== robots.txt ===
Place a file called <code>robots.txt</code> in the root of your Apache web server.
 
Example (excerpt):<syntaxhighlight lang="text" line="1">
# Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
User-agent: AhrefsBot
Disallow: /
 
# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
User-agent: Amazonbot
Disallow: /
 
# Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
User-agent: Barkrowler
Disallow: /
 
# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
User-agent: ChatGPT-User
Disallow: /
 
# facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
User-agent: facebookexternalhit
Disallow: /
 
# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
User-agent: GPTBot
Disallow: /
 
# meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
User-agent: meta-externalagent
Disallow: /
 
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
User-agent: OAI-SearchBot
Disallow: /
</syntaxhighlight>This seems to be mostly respected by:
 
* Google
* Amazon
* Microsoft (at least partially)
 
This seem to be completely ignored by:
 
* Meta / Facebook
 
=== fail2ban ===
https://www.digitalocean.com/community/tutorials/how-fail2ban-works-to-protect-services-on-a-linux-server


== Ideas for Blocking Abusive Behavior ==
Probably it makes sense to setup [[Fail2Ban (Debian)|Fail2Ban]] to block accordingly as described here:
Probably it makes sense to setup [[Fail2Ban (Debian)|Fail2Ban]] to block accordingly as described here:


Zeile 86: Zeile 160:
Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet):
Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet):
  fail2ban-client set jailnamehere banip 1.2.3.0/24
  fail2ban-client set jailnamehere banip 1.2.3.0/24
Further reading: https://github.com/fail2ban/fail2ban/issues/2261


=== iptables ===
Or by directly creating rules with iptables:
Or by directly creating rules with iptables:
 
sudo iptables -A INPUT -s 1.2.3.4 -j DROP
[[Iptables (Debian)#Create Blocking Rules Manually]]
See also [[Iptables (Debian)#Create Blocking Rules Manually]]
 
== Further Ideas for Blocking Strategies ==
 
Further reading: https://github.com/fail2ban/fail2ban/issues/2261
 
 
https://www.digitalocean.com/community/tutorials/how-fail2ban-works-to-protect-services-on-a-linux-server


__INHALTSVERZEICHNIS_ERZWINGEN__
__INHALTSVERZEICHNIS_ERZWINGEN__
[[Kategorie:LAMP]]
[[Kategorie:LAMP]]
[[Kategorie:Linux]]
[[Kategorie:Linux]]

Version vom 29. September 2025, 02:09 Uhr

This article shows my current state of affairs for how to analyze Apache Webserver Logs in order to find and block

  • Web Crawlers
  • AI Crawlers
  • SEO Crawlers
  • Bots

Main offenders were huge corporations likely trying to gain free trainig data for AI:

  • Meta / Facebook
  • Microsoft / OpenAI
  • Amazon
  • Alibaba
  • Huawei

These companies are responsible for far beyond 70% of traffic on my small web server.

This article is a work in progress.

Introduction

Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.

There are regular CPU workloads of 50 % even up to 80 or 100%.

htop shows, that there is a lot of CPU utilization by mariadbd process, which seems to be related to the MariaDB Database.

After doing some checking on my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.

The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.

When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:

and couple of different IPs:

  • 65.109.100.155
  • 185.177.72.54

among others.

After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:

"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/

For related Hacker News search for AI crawlers

How to Analyze Apache Logs

Manually with cat and awk

Print top 50 IP addresses and number of hits in a given log file:

awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -50
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -50

Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/

Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses and their number of hits from these log files:

{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50

Hint: Using the questionmark in access.log.?.gz leads to return only single digit logs resulting in getting the last 10 days of logs.

Print top 50 user agents and number of hits of a given set of gz-compressed logfiles:

zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50

Or again print top 50 user agents and number of hits from a combination of uncompressed and compressed log files, i.e. last 10 days:

{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50

Hint (again): Using the questionmark in access.log.?.gz leads to return only single digit logs resulting in getting the last 10 days of logs.

Print top 50 IP addresses for one of the above gained user agents, like GPTBot:

{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep GPTBot | awk '{print $1}' | sort | uniq -c | sort -nr | head 50

Result as of 21 Sep. 2025 (excerpt):

  46298 20.171.207.124
  39515 20.171.207.52
   8689 20.171.207.225
   4236 20.171.207.160
   2198 20.171.207.119
   1705 20.171.207.210

Or Facebooks cralwer called meta-externalagent:

{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep meta-externalagent | awk '{print $1}' | sort | uniq -c | sort -nr | head -50

Result as of 21 Sep. 2025:

   7256 57.141.0.19
   6910 57.141.0.32
   6767 57.141.0.62
   6625 57.141.0.43
   6570 57.141.0.55

With these one can assume that the ranges 20.171.207.0/24 and 57.141.0.0/24 belong to similar persons. Ownership can be seen for example here:

Goaccess

Install Goaccess for easier analyzing.

Who owns an IP Address or Address Range?

IP Addresses and Address Ranges Geolocation

Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24

IP Address Fraud Detection

Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/

Fail2ban supported IP database: https://www.abuseipdb.com/

Blocking Strategies

robots.txt

Place a file called robots.txt in the root of your Apache web server.

Example (excerpt):

# Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
User-agent: AhrefsBot
Disallow: /

# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
User-agent: Amazonbot
Disallow: /

# Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
User-agent: Barkrowler
Disallow: /

# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
User-agent: ChatGPT-User
Disallow: /

# facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
User-agent: facebookexternalhit
Disallow: /

# Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
User-agent: GPTBot
Disallow: /

# meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
User-agent: meta-externalagent
Disallow: /

# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
User-agent: OAI-SearchBot
Disallow: /

This seems to be mostly respected by:

  • Google
  • Amazon
  • Microsoft (at least partially)

This seem to be completely ignored by:

  • Meta / Facebook

fail2ban

https://www.digitalocean.com/community/tutorials/how-fail2ban-works-to-protect-services-on-a-linux-server

Probably it makes sense to setup Fail2Ban to block accordingly as described here:

https://denshub.com/de/fail2ban-server-protection/

https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo

Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet):

fail2ban-client set jailnamehere banip 1.2.3.0/24

iptables

Or by directly creating rules with iptables:

sudo iptables -A INPUT -s 1.2.3.4 -j DROP

See also Iptables (Debian)#Create Blocking Rules Manually

Further Ideas for Blocking Strategies

Further reading: https://github.com/fail2ban/fail2ban/issues/2261