Monitoring Apache Logs to Prevent Abusive Practices: Unterschied zwischen den Versionen
Matt (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
Matt (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
||
Zeile 35: | Zeile 35: | ||
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20 | awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20 | ||
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20 | zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20 | ||
Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses from these log files: | |||
{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50 | |||
Hint: Using the questionmark in <code>access.log.?.gz</code> leads to return only single digit logs resulting in getting the last 10 days of logs. | |||
Print top 20 user agents of a given set of gz-compressed logfiles: | Print top 20 user agents of a given set of gz-compressed logfiles: | ||
zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -20 | zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -20 | ||
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/ | Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/ |
Version vom 21. September 2025, 11:35 Uhr
This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices of private servers.
Introduction
Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.
There are regular CPU workloads of 50 % even up to 80 or 100%.
htop
shows, that there is a lot of CPU utilization by mariadbd
process, which seems to be related to the MariaDB Database.
After checking my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.
The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.
When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:
- Microsoft
- Meta / Facebook
- OpenAI
and couple of different IPs:
- 65.109.100.155
- 185.177.72.54
among others.
After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:
"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/
For related Hacker News search for AI crawlers
How to Analyze Apache Logs
Print top 20 IP adresses of a given log file:
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20 zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20
Combine uncompressed and compressed log files, i.e. last 10 days of Apache access logs and print top 50 IP adresses from these log files:
{ cat access.log ; cat access.log.1 ; zcat access.log.?.gz ; } | awk '{print $1}' | sort | uniq -c | sort -nr | head -50
Hint: Using the questionmark in access.log.?.gz
leads to return only single digit logs resulting in getting the last 10 days of logs.
Print top 20 user agents of a given set of gz-compressed logfiles:
zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -20
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/
Databases for IP Adresses
https://www.ipqualityscore.com/
Probably it makes sense to setup Fail2Ban to block accordingly as described here:
https://denshub.com/de/fail2ban-server-protection/
https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo