Filter out crawlers that lag Forgejo #17
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
So Forgejo (and potentially other services later down the line) has been getting hit by crawler(s) using random IP addressess and User-Agents.
This has caused large amounts of lag for both the Forgejo instance and (disk operations on) the underlying machine,
server0.local, multiple times before.Note that these crawlers are often said to be AI-related, which I suppose would explain why they appeared at this time
The current workaround in place can be seen here.
This workaround always required the user to have cookies enabled, however, which can cause issues for some users. It also broke now that Forgejo no longer serves session cookies to all visitors, meaning you must sign in to view matched pages. It also always required the user to visit an unmatched page once.
Note that only select pages are prone to taxing the server's resources so destructively. As of writing this, the regular expression
^/[0-9a-zA-Z\-_\.]+/[0-9a-zA-Z\-_\.]+/(commit|blame|compare|commits|(src|raw|blame)/(commit|tag))/can be used to match all of their URL paths (as shown in the linked file).Notably all of these pages relate to viewing data from old commits. Large amounts of requests to these often lead to lots of
git cat-fileprocesses being spawned.Anyway, some potential heuristics:
Note that I haven't bothered to inspect these much, however I hear these might be those used by extremely outdated browsers.
It's worth noting, however, that chipmunk.land aims to be very backwards compatible, deliberately supporting raw HTTP and some old TLS versions and ciphers. User-Agent filters could disrupt this.
If they ever re-use IP addresses, this could be very useful.
The same can sometimes be said for some legitimate clients (curl, lynx, specially configured browsers, etc) though.
As of writing this, if you visit
view-source:https://codeberg.org/heathercat123/apiclone/commits/branch/mainin a fresh private/incognito window you will see a<a href="?trans-rights=human-rights">element clickable by the user.On that same page, I also see
<meta content="1; url=?trans-rights=human-rights" http-equiv=refresh>.The idea is to prevent these crawlers from connecting, to avoid downtime of course, while also not changing anything about the results of requests from, well, real users or literally any other clients not taking the website down.
Of course, I do not want to require bloated JS challenges, in fact I don't want to present any sort of challenge pages. Kinda sorta like how I don't serve ADs or other annoyances either.
Also this is very well a case where, so to speak, you do not need to outrun the bear, but only the person who is next to you; Anubis quite literally only presents challenges to Mozilla UAs (trivial to bypass) and is still effective!!!
Multiple forges running GitLab and Forgejo are still running normally in this epidemic, and I intend for mine to be one of them.
As for other potentially affected services:
Has maxed out
server0.local's CPU usage (and its CPU fan) before, but fortunately it's not being hit AFAIK.Note that requests to a service like APIClone shouldn't be too taxing though, and the most risky operations (project/asset loading) are behind JS and/or Flash.
These should be safe:
I also feel like this shows just how vulnerable some of the web is to effective DDoS, and the fact I can't handle it quickly makes me doubt how many deliberate attacks I would competent enough to take on right now were they happening... oh well.
Note that I have nothing against crawlers or web scraping (which is pretty cool imho) in general, I'm not even that mad at AI itself, I just don't want my goddamn sites taken down!!!
As for AI and its haters, here's a final bit of yap. I've heard of and seen people and organizations take extreme measures against AI scraping, such as presenting challenges for personal/static sites or even blocking archival (oof), despite these not preserving a meaningful amount of computing power for them.
I believe that since forever if info is public on the web it will, you know, be processed and fucked with by whoever and whatever the hell wants, and that this is an attempt to artificially defy this principle.
Of course, there is a business argument, particularly that AI training will let entities other than them make money from their work out of serendipity. Of course I feel as if this comes from a rather proprietary philosophy, but many AI-related things are proprietary too so whatever. Still, I think impacting things as important as archival over petty business affairs is plain wrong.
I suppose this also shows that consensual archival tends to fail, though unfortunately more forceful services like https://archive.ph/ can have trouble staying alive...
And now that I make this issue, and look at the requests being made for Forgejo, I notice they are being made much more slowly now... sort of peculiar how that happened.
Edit: oh also multiple crawlers (w/ non-random UAs mostly) are crawling a wiki on the server more quickly, not sure if that's a coincedence.
I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers.
Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now.
Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore.
Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.
@chipmunkmc wrote in #17 (comment):
@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua
@mysticgiggle wrote in #17 (comment):
@chipmunkmc restrict the bot accounts and delete them and give me admin
@mysticgiggle wrote in #17 (comment):
"give me admin" yeah no lmfao
of course, a user who sends passwords in minecraft chat and the likes and is unwanted on basically every issue tracker on the site is totally good admin material lmfao
lmfao
I could try to script something up. The reason this is hard is that it's much easier to prove a client's innocence (i.e. no Mozilla UA, Cookies present, etc) than a client's guilt, so to speak.
While looking through some old logs I noticed something quite peculiar: I saw multiple IPs start with the same 2 octets! This could help with filtering a lot >:D
Edit: I also see IPs with matching first octets, and sometimes but rarely close (
.82.and.83.) second octets. This honestly reminds me of when a deliberate DDoS hit me with a bunch of POSTs from80.and81.IPs.Edit 2: many of the IPs are from the same ISP too; again, this reminds me of the intentional DDoS from earlier.
Also a little backstory: so this whole crisis really began in 2025... sorta.
server0.local's disk; Irm -rf'd the repo archive folder, in turn screwing up Forgejo's queues without knowing what I did. Oops!rm -rf'd the queues too, but the disk got filled again. Afterwards, I tuned the repo archiver viaapp.ini, and the issue stopped.So how did this snowball into the site dying? Well let's go on...
So now what?
server0.localwhen they hit.I knew a better workaround was needed since the incident that led to the limiting, however I've been neglecting to truly look into implementing it until now.
This, combined with the DDoS incidents of 2025 (one leading to long IPv4 downtime until I could procure a VPS) occured somewhat close together, I feel. It's as if chipmunk.land is suffering from growing pains... but of course, it's still alive and kicking.
real.chipmunk.landbut previously via my home IP in mid-to-late 2022.I've hosted services on that laptop before though, particularly chipmunkbot.
Even in late 2021 I hosted a bot called SandCatBot (though it was unstable), and earlier that year I temporarily ran an AFK bot for some Aternos server!
Of course, the website stuff only came in late 2022, this Forgejo being set up closer to 2023.
Alright, enough procrastination.
i have been around for most of chipmunk.land's life lol
So blocking whole subnets (or ASNs) could cause a lot of false positives. While I could make a challenge in pure HTTP, challenges are still a very ugly solution to DoS; ideally the page should handle HTTP requests the same way as before.
However, according to a log file, the crawlers actually re-used IPs 3 times, so we could exploit this! Supposedly someone could even make a list of all crawler IPs from server logs once. I've also seen some filters (perhaps) allow the first request to a site through sometimes.
The IP reuse instances occurred seconds apart though, and seemed to be across the same /24. Also, the log I've been looking at only spans less than 3 minutes in total. I'll have to look at more logs.
Sorry I've been too lazy to get the filter written! I just procrastinate too much...
So after grepping one log (long story) for
/src/commitaccess, I saw a couple of Alibaba subnets were the only ones seen (only two /16s, and even distinct /24s albeit a lot); maybe blocking them will significantly throttle the crawlers?Back to writing a filter to automatically detect these (as they use other IP ranges of course), I noticed the first log I checked has a bunch of instances of /24 ranges being reused, however many were multiple (maybe about 30) seconds apart.
However, back to the IP list from that newer log, individual IPs were duplicated very often too (removing duplicates made the list over 4x shorter). I could maybe still filter by IP, or even combine strategies.
I don't think that stopped them at all, but the site hasnt died yet so whatever, I'll write the script soon. This made the crawlers reveal their IPv6 addresses (for some reason) just now anyway.
As for filtering, here's an idea. I can first of all flag larger subnets (or ASNs potentially) as suspicious without blocking them, and then use that to more zealously judge smaller subnets or individual IPs. Might be a good compromise.