Monitoring Apache Logs to Prevent Abusive Practices: Unterschied zwischen den Versionen
Matt (Diskussion | Beiträge) |
Matt (Diskussion | Beiträge) |
||
(2 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt) | |||
Zeile 1: | Zeile 1: | ||
This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices of | This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices and outright stealing of content done by huge corporations to likely to gain free trainig data for AI bots. | ||
== Introduction == | == Introduction == | ||
Zeile 14: | Zeile 14: | ||
When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be: | When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be: | ||
* Microsoft | * Microsoft: https://ipinfo.io/ips/20.171.207.0/24 | ||
* Meta / Facebook | * Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24 | ||
* OpenAI | * OpenAI | ||
Zeile 32: | Zeile 32: | ||
== How to Analyze Apache Logs == | == How to Analyze Apache Logs == | ||
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head - | Print top 100 IP addresses and number of hits in a given log file: | ||
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head - | awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -100 | ||
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -100 | |||
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/ | Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/ | ||
Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses and their number of hits from these log files: | |||
{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50 | |||
Hint: Using the questionmark in <code>access.log.?.gz</code> leads to return only single digit logs resulting in getting the last 10 days of logs. | |||
Print top 50 user agents and number of hits of a given set of gz-compressed logfiles: | |||
zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50 | |||
Print top 50 IP addresses for one of the above gained user agents, like GPTBot: | |||
{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep GPTBot | awk '{print $1}' | sort | uniq -c | sort -nr | head | |||
Result as of 21 Sep. 2025 (excerpt): | |||
46298 20.171.207.124 | |||
39515 20.171.207.52 | |||
8689 20.171.207.225 | |||
4236 20.171.207.160 | |||
2198 20.171.207.119 | |||
1705 20.171.207.210 | |||
Or Facebooks cralwer called meta-externalagent: | |||
{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep meta-externalagent | awk '{print $1}' | sort | uniq -c | sort -nr | head -50 | |||
Result as of 21 Sep. 2025: | |||
7256 57.141.0.19 | |||
6910 57.141.0.32 | |||
6767 57.141.0.62 | |||
6625 57.141.0.43 | |||
6570 57.141.0.55 | |||
With these one can assume that the ranges 20.171.207.0/24 and 57.141.0.0/24 belong to similar persons. Ownership can be seen for example here: | |||
*Microsoft: https://ipinfo.io/ips/20.171.207.0/24 | |||
*Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24 | |||
== Databases for IP Adresses == | == Databases for IP Adresses == | ||
https://www.ipqualityscore.com/ | Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/ | ||
https://www.abuseipdb.com/ | Fail2ban supported IP database: https://www.abuseipdb.com/ | ||
Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24 | |||
== Ideas for Blocking Abusive Behavior == | |||
Probably it makes sense to setup [[Fail2Ban (Debian)|Fail2Ban]] to block accordingly as described here: | Probably it makes sense to setup [[Fail2Ban (Debian)|Fail2Ban]] to block accordingly as described here: | ||
Zeile 46: | Zeile 80: | ||
https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo | https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo | ||
Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet): | |||
fail2ban-client set jailnamehere banip 1.2.3.0/24 | |||
Further reading: https://github.com/fail2ban/fail2ban/issues/2261 | |||
__INHALTSVERZEICHNIS_ERZWINGEN__ | __INHALTSVERZEICHNIS_ERZWINGEN__ | ||
[[Kategorie:LAMP]] | [[Kategorie:LAMP]] | ||
[[Kategorie:Linux]] | [[Kategorie:Linux]] |
Aktuelle Version vom 21. September 2025, 12:15 Uhr
This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices and outright stealing of content done by huge corporations to likely to gain free trainig data for AI bots.
Introduction
Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.
There are regular CPU workloads of 50 % even up to 80 or 100%.
htop
shows, that there is a lot of CPU utilization by mariadbd
process, which seems to be related to the MariaDB Database.
After checking my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.
The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.
When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:
- Microsoft: https://ipinfo.io/ips/20.171.207.0/24
- Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24
- OpenAI
and couple of different IPs:
- 65.109.100.155
- 185.177.72.54
among others.
After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:
"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/
For related Hacker News search for AI crawlers
How to Analyze Apache Logs
Print top 100 IP addresses and number of hits in a given log file:
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -100 zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -100
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/
Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses and their number of hits from these log files:
{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50
Hint: Using the questionmark in access.log.?.gz
leads to return only single digit logs resulting in getting the last 10 days of logs.
Print top 50 user agents and number of hits of a given set of gz-compressed logfiles:
zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -50
Print top 50 IP addresses for one of the above gained user agents, like GPTBot:
{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep GPTBot | awk '{print $1}' | sort | uniq -c | sort -nr | head
Result as of 21 Sep. 2025 (excerpt):
46298 20.171.207.124 39515 20.171.207.52 8689 20.171.207.225 4236 20.171.207.160 2198 20.171.207.119 1705 20.171.207.210
Or Facebooks cralwer called meta-externalagent:
{ cat other_vhosts_access.log ; cat other_vhosts_access.log.1 ; zcat other_vhosts_access.log.?.gz ; } | grep meta-externalagent | awk '{print $1}' | sort | uniq -c | sort -nr | head -50
Result as of 21 Sep. 2025:
7256 57.141.0.19 6910 57.141.0.32 6767 57.141.0.62 6625 57.141.0.43 6570 57.141.0.55
With these one can assume that the ranges 20.171.207.0/24 and 57.141.0.0/24 belong to similar persons. Ownership can be seen for example here:
- Microsoft: https://ipinfo.io/ips/20.171.207.0/24
- Meta / Facebook: https://ipinfo.io/ips/57.141.0.0/24
Databases for IP Adresses
Fraud detection and ownership of IP adresses: https://www.ipqualityscore.com/
Fail2ban supported IP database: https://www.abuseipdb.com/
Geographic location and ownership of IP address ranges: https://ipinfo.io/ips/20.171.207.0/24
Ideas for Blocking Abusive Behavior
Probably it makes sense to setup Fail2Ban to block accordingly as described here:
https://denshub.com/de/fail2ban-server-protection/
https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo
Another simplear idea is to outright block ranges of IPs with fail2ban or iptables something like that (this is untested yet):
fail2ban-client set jailnamehere banip 1.2.3.0/24
Further reading: https://github.com/fail2ban/fail2ban/issues/2261