Monitoring Apache Logs to Prevent Abusive Practices: Unterschied zwischen den Versionen
Matt (Diskussion | Beiträge) Die Seite wurde neu angelegt: „This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices of private servers. == Introduction == Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was. There are regular CPU workloads of 50 % even up to 80 or 100%. <code>htop</code> shows, that there is a lot of CPU utilization by <code>mariadbd</code> process, which seems to b…“ |
Matt (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
||
Zeile 25: | Zeile 25: | ||
among others. | among others. | ||
After googling I found the following entry on | After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled: | ||
"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/ | "AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/ | ||
== How to | For related Hacker News search for <code>AI crawlers</code> | ||
== How to Analyze Apache Logs == | |||
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20 | awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20 | ||
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20 | zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20 |
Version vom 20. September 2025, 13:32 Uhr
This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices of private servers.
Introduction
Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.
There are regular CPU workloads of 50 % even up to 80 or 100%.
htop
shows, that there is a lot of CPU utilization by mariadbd
process, which seems to be related to the MariaDB Database.
After checking my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.
The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.
When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:
- Microsoft
- Meta / Facebook
- OpenAI
and couple of different IPs:
- 65.109.100.155
- 185.177.72.54
among others.
After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:
"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/
For related Hacker News search for AI crawlers
How to Analyze Apache Logs
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20 zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/
Databases for IP Adresses
https://www.ipqualityscore.com/