Monitoring Apache Logs to Prevent Abusive Practices: Unterschied zwischen den Versionen

Aus MattWiki
Keine Bearbeitungszusammenfassung
Keine Bearbeitungszusammenfassung
(5 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 32: Zeile 32:


== How to Analyze Apache Logs ==
== How to Analyze Apache Logs ==
Print top 20 IP adresses of a given log file:
  awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20
  awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20
  zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20
  zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20
Print top 20 user agents of a given set of gz-compressed logfiles:
zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -20
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/
Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/


Zeile 41: Zeile 49:
https://www.abuseipdb.com/
https://www.abuseipdb.com/


Probably it makes sense to setup [[Fail2Ban (Debian)|Fail2Ban]] to block accordingly as described here:
https://denshub.com/de/fail2ban-server-protection/


https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo


__INHALTSVERZEICHNIS_ERZWINGEN__
__INHALTSVERZEICHNIS_ERZWINGEN__
[[Kategorie:LAMP]]
[[Kategorie:Linux]]

Version vom 20. September 2025, 16:40 Uhr

This article shows my current efforts how to analyze Apache Logs and perhaps in the future how to prevent the offending and abusive practices of private servers.

Introduction

Lately I noticed kind of high CPU workloads on this very wiki server and I wondered why this was.

There are regular CPU workloads of 50 % even up to 80 or 100%.

htop shows, that there is a lot of CPU utilization by mariadbd process, which seems to be related to the MariaDB Database.

After checking my Apache logs I found a lof of access from different crawlers and often recurring IPs and IP-Ranges.

The logs showed, that there are hundreds or even thousands of daily access logs for different ressources.

When ranking them over the last 60 days based on my Apache logs the worst offenders judging by IP address or IP Address Range according to various IP Databases seemed to be:

  • Microsoft
  • Meta / Facebook
  • OpenAI

and couple of different IPs:

  • 65.109.100.155
  • 185.177.72.54

among others.

After googling I found the following entry on Hacker New https://news.ycombinator.com/item?id=44971487 referring to an article posted on 21th August 2025 on The Register titled:

"AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders" https://www.theregister.com/2025/08/21/ai_crawler_traffic/

For related Hacker News search for AI crawlers

How to Analyze Apache Logs

Print top 20 IP adresses of a given log file:

awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -20
zcat access.log.*.gz | awk '{print $1}' | sort | uniq -c | sort -nr | head -20


Print top 20 user agents of a given set of gz-compressed logfiles:

zcat access.log.?.gz | awk -F'"' '{print $(NF-1)}' | sort | uniq -c | sort -nr | head -20


Further reading: https://www.tecmint.com/find-top-ip-address-accessing-apache-web-server/

Databases for IP Adresses

https://www.ipqualityscore.com/

https://www.abuseipdb.com/

Probably it makes sense to setup Fail2Ban to block accordingly as described here:

https://denshub.com/de/fail2ban-server-protection/

https://dev.to/armiedema/detect-and-stop-404-attacks-with-fail2ban-1coo