Filter out crawlers that lag Forgejo #17

Open
opened 2026-05-16 02:12:17 -04:00 by chipmunkmc · 21 comments
Owner

So Forgejo (and potentially other services later down the line) has been getting hit by crawler(s) using random IP addressess and User-Agents.
This has caused large amounts of lag for both the Forgejo instance and (disk operations on) the underlying machine, server0.local, multiple times before.
Note that these crawlers are often said to be AI-related, which I suppose would explain why they appeared at this time

The current workaround in place can be seen here.
This workaround always required the user to have cookies enabled, however, which can cause issues for some users. It also broke now that Forgejo no longer serves session cookies to all visitors, meaning you must sign in to view matched pages. It also always required the user to visit an unmatched page once.

Note that only select pages are prone to taxing the server's resources so destructively. As of writing this, the regular expression ^/[0-9a-zA-Z\-_\.]+/[0-9a-zA-Z\-_\.]+/(commit|blame|compare|commits|(src|raw|blame)/(commit|tag))/ can be used to match all of their URL paths (as shown in the linked file).
Notably all of these pages relate to viewing data from old commits. Large amounts of requests to these often lead to lots of git cat-file processes being spawned.

Anyway, some potential heuristics:

  • Random IPs, potentially from specific (maybe residential? unsure) ASNs are used.
  • Random User-Agent strings are used.
    Note that I haven't bothered to inspect these much, however I hear these might be those used by extremely outdated browsers.
    It's worth noting, however, that chipmunk.land aims to be very backwards compatible, deliberately supporting raw HTTP and some old TLS versions and ciphers. User-Agent filters could disrupt this.
  • All User-Agent strings tend to contain the string Mozilla (I mean c'mon, even Anubis uses this one!)
  • These clients make very frequent requests to random URL paths picked up from links, even if the site is down.
    If they ever re-use IP addresses, this could be very useful.
  • As far as I know, these clients never bother to request images/stylesheets/scripts, and also neglect to preserve cookies (as used in the current hack) too.
    The same can sometimes be said for some legitimate clients (curl, lynx, specially configured browsers, etc) though.
  • About that, the clients aren't really capable of executing scripts in the first place, though nor is literally any browser with JavaScript disabled or outright unsupported.
  • Perhaps they all use one protocol too (haven't checked, HTTPS is likely as modern websites tend to force it).
  • They might even all use the same TLS versions, though this might be harder to filter.
  • These clients could also use a consistent set and ordering of headers but I haven't checked.
  • Their ability to follow URL fragments might be limited.
    As of writing this, if you visit view-source:https://codeberg.org/heathercat123/apiclone/commits/branch/main in a fresh private/incognito window you will see a <a href="?trans-rights=human-rights"> element clickable by the user.
  • The crawlers cannot handle meta redirects properly, not that this helps us much here.
    On that same page, I also see <meta content="1; url=?trans-rights=human-rights" http-equiv=refresh>.

The idea is to prevent these crawlers from connecting, to avoid downtime of course, while also not changing anything about the results of requests from, well, real users or literally any other clients not taking the website down.
Of course, I do not want to require bloated JS challenges, in fact I don't want to present any sort of challenge pages. Kinda sorta like how I don't serve ADs or other annoyances either.
Also this is very well a case where, so to speak, you do not need to outrun the bear, but only the person who is next to you; Anubis quite literally only presents challenges to Mozilla UAs (trivial to bypass) and is still effective!!!

Multiple forges running GitLab and Forgejo are still running normally in this epidemic, and I intend for mine to be one of them.

As for other potentially affected services:

  • MediaWiki - I hear some endpoints it provides can be particularly taxing, perhaps in particular those for diffing pages.
    Has maxed out server0.local's CPU usage (and its CPU fan) before, but fortunately it's not being hit AFAIK.
  • APIClone - No idea how well this one handles stress; there have been instances of it slowing down but judging by server logs these are likely unrelated bugs.
    Note that requests to a service like APIClone shouldn't be too taxing though, and the most risky operations (project/asset loading) are behind JS and/or Flash.

These should be safe:

  • Static sites - very low overhead.
  • linx-server - this one doesn't even have many outlinks for crawlers to follow, and it mainly serves files.

I also feel like this shows just how vulnerable some of the web is to effective DDoS, and the fact I can't handle it quickly makes me doubt how many deliberate attacks I would competent enough to take on right now were they happening... oh well.
Note that I have nothing against crawlers or web scraping (which is pretty cool imho) in general, I'm not even that mad at AI itself, I just don't want my goddamn sites taken down!!!

As for AI and its haters, here's a final bit of yap. I've heard of and seen people and organizations take extreme measures against AI scraping, such as presenting challenges for personal/static sites or even blocking archival (oof), despite these not preserving a meaningful amount of computing power for them.
I believe that since forever if info is public on the web it will, you know, be processed and fucked with by whoever and whatever the hell wants, and that this is an attempt to artificially defy this principle.
Of course, there is a business argument, particularly that AI training will let entities other than them make money from their work out of serendipity. Of course I feel as if this comes from a rather proprietary philosophy, but many AI-related things are proprietary too so whatever. Still, I think impacting things as important as archival over petty business affairs is plain wrong.
I suppose this also shows that consensual archival tends to fail, though unfortunately more forceful services like https://archive.ph/ can have trouble staying alive...

So Forgejo (and potentially other services later down the line) has been getting hit by crawler(s) using random IP addressess and User-Agents. This has caused large amounts of lag for both the Forgejo instance and (disk operations on) the underlying machine, `server0.local`, multiple times before. Note that these crawlers are often said to be AI-related, which I suppose would explain why they appeared at this time The current workaround in place can be seen [here](https://code.chipmunk.land/chipmunk.land/misc/src/commit/9471cd39c83fa73870f217f5ae0e3c4398b814fd/server0.local/etc/nginx/http.d/forgejo.conf#L17-L31). This workaround always required the user to have cookies enabled, however, which can cause issues for some users. It also broke now that Forgejo no longer serves session cookies to all visitors, meaning you must sign in to view matched pages. It also always required the user to visit an unmatched page once. Note that only select pages are prone to taxing the server's resources so destructively. As of writing this, the regular expression `^/[0-9a-zA-Z\-_\.]+/[0-9a-zA-Z\-_\.]+/(commit|blame|compare|commits|(src|raw|blame)/(commit|tag))/` can be used to match all of their URL paths (as shown in the linked file). Notably all of these pages relate to viewing data from *old* commits. Large amounts of requests to these often lead to lots of `git cat-file` processes being spawned. Anyway, some potential heuristics: * Random IPs, potentially from specific (maybe residential? unsure) ASNs are used. * Random User-Agent strings are used. Note that I haven't bothered to inspect these much, however I hear these might be those used by extremely outdated browsers. It's worth noting, however, that chipmunk.land aims to be very backwards compatible, deliberately supporting *raw* HTTP and *some* old TLS versions and ciphers. User-Agent filters could disrupt this. * All User-Agent strings tend to contain the string *Mozilla* (I mean c'mon, even Anubis uses this one!) * These clients make very frequent requests to random URL paths picked up from links, even if the site is down. If they ever re-use IP addresses, this could be very useful. * As far as I know, these clients never bother to request images/stylesheets/scripts, and also neglect to preserve cookies (as used in the current hack) too. The same can sometimes be said for some legitimate clients (curl, lynx, specially configured browsers, etc) though. * About that, the clients aren't really capable of executing scripts in the first place, though nor is literally any browser with JavaScript disabled or outright unsupported. * Perhaps they all use one protocol too (haven't checked, HTTPS is likely as modern websites tend to force it). * They might even all use the same TLS versions, though this might be harder to filter. * These clients could also use a consistent set and ordering of headers but I haven't checked. * Their ability to follow URL fragments might be limited. As of writing this, if you visit `view-source:https://codeberg.org/heathercat123/apiclone/commits/branch/main` in a fresh private/incognito window you will see a `<a href="?trans-rights=human-rights">` element clickable by the user. * The crawlers cannot handle meta redirects properly, not that this helps us much here. On that same page, I also see `<meta content="1; url=?trans-rights=human-rights" http-equiv=refresh>`. The idea is to prevent these crawlers from connecting, to avoid downtime of course, while also not changing *anything* about the results of requests from, well, real users or literally any other clients not taking the website down. Of course, I do *not* want to require bloated JS challenges, in fact I don't want to present any sort of challenge pages. Kinda sorta like how I don't serve ADs or other annoyances either. Also this is very well a case where, so to speak, you do not need to outrun the bear, but only the person who is next to you; Anubis quite literally only presents challenges to Mozilla UAs (trivial to bypass) and is still effective!!! Multiple forges running GitLab and Forgejo are still running normally in this epidemic, and I intend for mine to be one of them. As for other potentially affected services: * MediaWiki - I hear some endpoints it provides can be particularly taxing, perhaps in particular those for diffing pages. Has maxed out `server0.local`'s CPU usage (and its CPU fan) before, but fortunately it's not being hit AFAIK. * APIClone - No idea how well this one handles stress; there have been instances of it slowing down but judging by server logs these are likely unrelated bugs. _Note that requests to a service like APIClone shouldn't be too taxing though, and the most risky operations (project/asset loading) are behind JS and/or Flash._ These should be safe: * Static sites - very low overhead. * linx-server - this one doesn't even have many outlinks for crawlers to follow, and it mainly serves files. I also feel like this shows just how vulnerable some of the web is to effective DDoS, and the fact I can't handle it quickly makes me doubt how many deliberate attacks I would competent enough to take on right now were they happening... oh well. Note that I have nothing against crawlers or web scraping (which is pretty cool imho) in general, I'm not even that mad at AI itself, I just don't want my goddamn sites taken down!!! As for AI and its haters, here's a final bit of yap. I've heard of and seen people and organizations take extreme measures against AI scraping, such as presenting challenges for personal/static sites or even blocking archival (oof), despite these not preserving a meaningful amount of computing power for them. I believe that since forever if info is public on the web it will, you know, be processed and fucked with by whoever and whatever the hell wants, and that this is an attempt to artificially defy this principle. Of course, there is a business argument, particularly that AI training will let entities other than them make money from their work out of serendipity. Of course I feel as if this comes from a rather *proprietary* philosophy, but many AI-related things are proprietary too so whatever. Still, I think impacting things as important as archival over petty business affairs is plain wrong. I suppose this also shows that consensual archival tends to fail, though unfortunately more forceful services like https://archive.ph/ can have trouble staying alive...
Author
Owner

And now that I make this issue, and look at the requests being made for Forgejo, I notice they are being made much more slowly now... sort of peculiar how that happened.
Edit: oh also multiple crawlers (w/ non-random UAs mostly) are crawling a wiki on the server more quickly, not sure if that's a coincedence.

And now that I make this issue, and look at the requests being made for Forgejo, I notice they are being made much more slowly now... sort of peculiar how that happened. Edit: oh also multiple crawlers (w/ non-random UAs mostly) are crawling a wiki on the server more quickly, not sure if that's a coincedence.
Author
Owner

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers.
Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now.
Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore.
Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc wrote in #17 (comment):

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@chipmunkmc wrote in https://code.chipmunk.land/chipmunk.land/misc/issues/17#issuecomment-1369: > I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config. @chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@mysticgiggle wrote in #17 (comment):

@chipmunkmc wrote in #17 (comment):

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@chipmunkmc restrict the bot accounts and delete them and give me admin

@mysticgiggle wrote in https://code.chipmunk.land/chipmunk.land/misc/issues/17#issuecomment-1380: > @chipmunkmc wrote in #17 (comment): > > > I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config. > > @chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua @chipmunkmc restrict the bot accounts and delete them and give me admin

@mysticgiggle wrote in #17 (comment):

@mysticgiggle wrote in #17 (comment):

@chipmunkmc wrote in #17 (comment):

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@chipmunkmc restrict the bot accounts and delete them and give me admin

"give me admin" yeah no lmfao

@mysticgiggle wrote in https://code.chipmunk.land/chipmunk.land/misc/issues/17#issuecomment-1381: > @mysticgiggle wrote in #17 (comment): > > > @chipmunkmc wrote in #17 (comment): > > > I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config. > > > > > > @chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua > > @chipmunkmc restrict the bot accounts and delete them and give me admin "give me admin" yeah no lmfao
Author
Owner

of course, a user who sends passwords in minecraft chat and the likes and is unwanted on basically every issue tracker on the site is totally good admin material lmfao

of course, a user who sends passwords in minecraft chat and the likes and is unwanted on basically every issue tracker on the site is *totally* good admin material lmfao

lmfao

lmfao
Author
Owner

I could try to script something up. The reason this is hard is that it's much easier to prove a client's innocence (i.e. no Mozilla UA, Cookies present, etc) than a client's guilt, so to speak.

I could try to script something up. The reason this is hard is that it's much easier to prove a client's innocence (i.e. no Mozilla UA, Cookies present, etc) than a client's guilt, so to speak.
Author
Owner

While looking through some old logs I noticed something quite peculiar: I saw multiple IPs start with the same 2 octets! This could help with filtering a lot >:D
Edit: I also see IPs with matching first octets, and sometimes but rarely close (.82. and .83.) second octets. This honestly reminds me of when a deliberate DDoS hit me with a bunch of POSTs from 80. and 81. IPs.
Edit 2: many of the IPs are from the same ISP too; again, this reminds me of the intentional DDoS from earlier.

While looking through some old logs I noticed something quite peculiar: I saw multiple IPs start with the same 2 octets! This could help with filtering a lot >:D Edit: I also see IPs with matching first octets, and sometimes but rarely close (`.82.` and `.83.`) second octets. This honestly reminds me of when a _deliberate_ DDoS hit me with a bunch of POSTs from `80.` and `81.` IPs. Edit 2: many of the IPs are from the same ISP too; again, this reminds me of the intentional DDoS from earlier.
Author
Owner

Also a little backstory: so this whole crisis really began in 2025... sorta.

  • Early on, I noticed some sites were behind an apparent CF-like challenge; I realized this was Anubis later on.
  • Later incoming requests to this Forgejo caused a significant slowdown, so I got all paranoid about it and put it behind Anubis, but was not satisfied with the result and reverted it.
    • I sort of regret this, but it led to something... see below.
    • I later put the registration endpoint behind Anubis, and if I remember correctly spambots continued to get in, hinting to them using headless browsers. (related: #9)
    • This is also when I realized the site was more vulnerable to DoS than I had assumed. Silly me!
  • Later that year crawlers started causing Forgejo's repository archive cache to fill up server0.local's disk; I rm -rf'd the repo archive folder, in turn screwing up Forgejo's queues without knowing what I did. Oops!
    • Months later (I'm so stupid) I found the issue and rm -rf'd the queues too, but the disk got filled again. Afterwards, I tuned the repo archiver via app.ini, and the issue stopped.

So how did this snowball into the site dying? Well let's go on...

  • Also in later 2025 (or earlier 2026?), the site suffered from another slowdown, the second of this sort I can remember. I didn't check logs much, but users later concluded this was due to repository mirroring, leading to a mirror getting deleted.
  • In earlier 2026, a bunch of files from a specific commit on a mirrored repository got requested en masse, leading to another slowdown. I ended up making the mirror organization limited (login-only) as a lazy "fix."
    • I mean, it makes sense that mirrored repositories would be the largest.
  • There were multiple other incidents I think, but eventually it happened to a user's repository (not mine) again. This is the straw that broke the camel's back.

So now what?

  • Using a simple regex location block, I blocked access to files from specific commits; this lazily but successfully stopped the attacks for the time being.
  • Eventually I asked a kaboom.pw user for help, and got an nginx snippet that blocked access to certain paths also via regex location, but allowed access for all users with session cookies.
    • This, for a time, was actually a very clean solution; Forgejo (like its parent, Gitea) always provided session cookies for users at around this time.
    • Of course this was never truly ideal, as it requires the user to have cookies enabled, cannot be applied to sites without similar cookie behavior (hint hint), and required at least one page visit. Still, it was largely unnoticable to many.
  • Now crawlers kept discovering new paths that slowed down the site (such as ones using tags instead of commits), so it became a cat-and-mouse game to block those. Around this time the crawlers also started slowing down HDD operations on server0.local when they hit.
  • After more cat-and-mouse the site reached a state of stability, however...
  • A Forgejo update changed the session cookie behavior, breaking this workaround.

I knew a better workaround was needed since the incident that led to the limiting, however I've been neglecting to truly look into implementing it until now.

This, combined with the DDoS incidents of 2025 (one leading to long IPv4 downtime until I could procure a VPS) occured somewhat close together, I feel. It's as if chipmunk.land is suffering from growing pains... but of course, it's still alive and kicking.

  • Oh, and we've been here since November of 2022! Well actually the lines are blurry, I hosted a Minecraft server on my laptop that was once accessible on real.chipmunk.land but previously via my home IP in mid-to-late 2022.
    I've hosted services on that laptop before though, particularly chipmunkbot.
    Even in late 2021 I hosted a bot called SandCatBot (though it was unstable), and earlier that year I temporarily ran an AFK bot for some Aternos server!
    Of course, the website stuff only came in late 2022, this Forgejo being set up closer to 2023.

Alright, enough procrastination.

Also a little backstory: so this whole crisis really began in 2025... sorta. * Early on, I noticed some sites were behind an apparent CF-like challenge; I realized this was Anubis later on. * Later incoming requests to this Forgejo caused a significant slowdown, so I got all paranoid about it and put it behind Anubis, but was not satisfied with the result and reverted it. * I sort of regret this, but it led to something... see below. * I later put the registration endpoint behind Anubis, and if I remember correctly spambots continued to get in, hinting to them using headless browsers. (related: #9) * This is also when I realized the site was more vulnerable to DoS than I had assumed. Silly me! * Later that year crawlers started causing Forgejo's repository archive cache to fill up `server0.local`'s disk; I `rm -rf`'d the repo archive folder, in turn screwing up Forgejo's queues without knowing what I did. Oops! * Months later (I'm so stupid) I found the issue and `rm -rf`'d the queues too, but the disk got filled again. Afterwards, I tuned the repo archiver via `app.ini`, and the issue stopped. So how did this snowball into the site dying? Well let's go on... * Also in later 2025 (or earlier 2026?), the site suffered from another slowdown, the second of this sort I can remember. I didn't check logs much, but users later concluded this was due to repository mirroring, leading to a mirror getting deleted. * In earlier 2026, a bunch of files from a specific commit on a mirrored repository got requested en masse, leading to another slowdown. I ended up making the mirror organization limited (login-only) as a lazy "fix." * I mean, it makes sense that mirrored repositories would be the largest. * There were multiple other incidents I think, but eventually it happened to a user's repository (not mine) again. This is the straw that broke the camel's back. So now what? * Using a simple regex location block, I blocked access to files from specific commits; this lazily but successfully stopped the attacks for the time being. * Eventually I asked a kaboom.pw user for help, and got an nginx snippet that blocked access to certain paths also via regex location, but allowed access for all users with session cookies. * This, for a time, was actually a very clean solution; Forgejo (like its parent, Gitea) always provided session cookies for users at around this time. * Of course this was never truly ideal, as it requires the user to have cookies enabled, cannot be applied to sites without similar cookie behavior (*hint hint*), and required at least one page visit. Still, it was largely unnoticable to many. * Now crawlers kept discovering new paths that slowed down the site (such as ones using tags instead of commits), so it became a cat-and-mouse game to block those. Around this time the crawlers also started slowing down HDD operations on `server0.local` when they hit. * After more cat-and-mouse the site reached a state of stability, however... * A Forgejo update changed the session cookie behavior, breaking this workaround. I knew a better workaround was needed since the incident that led to the limiting, however I've been neglecting to truly look into implementing it until now. This, combined with the DDoS incidents of 2025 (one leading to long IPv4 downtime until I could procure a VPS) occured somewhat close together, I feel. It's as if chipmunk.land is suffering from growing pains... but of course, it's still alive and kicking. * Oh, and we've been here since November of 2022! Well actually the lines are blurry, I hosted a Minecraft server on my laptop that was once accessible on `real.chipmunk.land` but previously via my home IP in mid-to-late 2022. I've hosted services on that laptop before though, particularly [chipmunkbot](https://code.chipmunk.land/chipmunkmc/chipmunkbot-archive). Even in late 2021 I hosted a bot called SandCatBot (though it was unstable), and earlier that year I temporarily ran an AFK bot for some Aternos server! Of course, the website stuff only came in late 2022, this Forgejo being set up closer to 2023. Alright, enough procrastination.

i have been around for most of chipmunk.land's life lol

i have been around for most of chipmunk.land's life lol
Author
Owner

So blocking whole subnets (or ASNs) could cause a lot of false positives. While I could make a challenge in pure HTTP, challenges are still a very ugly solution to DoS; ideally the page should handle HTTP requests the same way as before.
However, according to a log file, the crawlers actually re-used IPs 3 times, so we could exploit this! Supposedly someone could even make a list of all crawler IPs from server logs once. I've also seen some filters (perhaps) allow the first request to a site through sometimes.
The IP reuse instances occurred seconds apart though, and seemed to be across the same /24. Also, the log I've been looking at only spans less than 3 minutes in total. I'll have to look at more logs.
Sorry I've been too lazy to get the filter written! I just procrastinate too much...

So blocking whole subnets (or ASNs) could cause a lot of false positives. While I could make a challenge in pure HTTP, challenges are still a very ugly solution to DoS; ideally the page should handle HTTP requests the same way as before. However, according to a log file, the crawlers actually re-used IPs 3 times, so we could exploit this! Supposedly someone could even make a list of all crawler IPs from server logs once. I've also seen some filters (perhaps) allow the first request to a site through sometimes. The IP reuse instances occurred seconds apart though, and seemed to be across the same /24. Also, the log I've been looking at only spans less than 3 minutes in total. I'll have to look at more logs. Sorry I've been too lazy to get the filter written! I just procrastinate too much...
Author
Owner

So after grepping one log (long story) for /src/commit access, I saw a couple of Alibaba subnets were the only ones seen (only two /16s, and even distinct /24s albeit a lot); maybe blocking them will significantly throttle the crawlers?
Back to writing a filter to automatically detect these (as they use other IP ranges of course), I noticed the first log I checked has a bunch of instances of /24 ranges being reused, however many were multiple (maybe about 30) seconds apart.
However, back to the IP list from that newer log, individual IPs were duplicated very often too (removing duplicates made the list over 4x shorter). I could maybe still filter by IP, or even combine strategies.

So after grepping one log (long story) for `/src/commit` access, I saw a couple of Alibaba subnets were the only ones seen (only two /16s, and even distinct /24s albeit a lot); maybe blocking them will significantly throttle the crawlers? Back to writing a filter to automatically detect these (as they use other IP ranges of course), I noticed the first log I checked has a bunch of instances of /24 ranges being reused, however many were multiple (maybe about 30) seconds apart. However, back to the IP list from that newer log, individual IPs were duplicated very often too (removing duplicates made the list over 4x shorter). I could maybe still filter by IP, or even combine strategies.
Author
Owner

I don't think that stopped them at all, but the site hasnt died yet so whatever, I'll write the script soon. This made the crawlers reveal their IPv6 addresses (for some reason) just now anyway.
As for filtering, here's an idea. I can first of all flag larger subnets (or ASNs potentially) as suspicious without blocking them, and then use that to more zealously judge smaller subnets or individual IPs. Might be a good compromise.
Edit: or do the more basic detection method except I only blacklist /24s or /32s involved after checking... that would simplify it I guess.
Edit 2: on second thought the simplified method might just lead to blacklisting upon using an abused subnet anyway :/
Edit 3: the time gaps between /24 or /32 (IPv4!) requests are so large these might be harder to detect... I'm sure it can be done though (low frequency, perhaps similar paths, lack of /assets/ requests, etc, perhaps).

I don't think that stopped them at all, but the site hasnt died yet so whatever, I'll write the script soon. This made the crawlers reveal their IPv6 addresses (for some reason) just now anyway. As for filtering, here's an idea. I can first of all flag larger subnets (or ASNs potentially) as suspicious without blocking them, and then use that to more zealously judge smaller subnets or individual IPs. Might be a good compromise. Edit: or do the more basic detection method except I only blacklist /24s or /32s involved after checking... that would simplify it I guess. Edit 2: on second thought the simplified method might just lead to blacklisting upon using an abused subnet anyway :/ Edit 3: the time gaps between /24 or /32 (IPv4!) requests are so large these might be harder to detect... I'm sure it can be done though (low frequency, perhaps similar paths, lack of `/assets/` requests, etc, perhaps).
Author
Owner

So I decided to look at the request headers these scrapers use, and...

GET /Impostor/Impostor/src/commit/c1c6311c87a1fddaf0d2d1987b551b8e090d2f
5a/src/Impostor.Plugins.Debugger HTTP/1.0
Host: code.chipmunk.land
X-Real-IP: 47.79.10.13
X-Forwarded-For: 47.79.10.13
X-Forwarded-Proto: https
Pragma: no-cache
Cache-Control: no-cache
Sec-Ch-Ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (
KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/w
ebp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip
Accept-Language: zh-CN,zh;q=0.9
Cookie: redirect_to=%2Fqcomlt%2Fabl%2Fblame%2Fcommit%2F9d76963bc15c5044a80378748
eca1b843c4740f2%2FOptionRomPkg%2FCirrusLogic5430Dxe%2FCirrusLogic5430UgaDraw.c; session=214bf591d10e9625

Some things are immediately interesting

  • X-Forwarded-Proto: https - guess they don't want to hit http as often! (edit: I saw one http protocol request later on, from a different crawler)
  • Accept-Language: zh-CN,zh;q=0.9
  • Cookie: redirect_to=%2Fqcomlt%2Fabl%2Fblame%2Fcommit%2F9d76963bc15c5044a80378748 eca1b843c4740f2%2FOptionRomPkg%2FCirrusLogic5430Dxe%2FCirrusLogic5430UgaDraw.c; session=214bf591d10e9625
    • I didn't know whether these supported cookies at all (filtering based on session cookie presence throttled them a bunch)
  • Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/w ebp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7 - are images usually included?!

Now this is a huge fingerprinting vector I greatly regret overlooking. This could change how I look at things completely.
If there's anything I should learn from #9, it's that I should never overlook metadata, yet I somehow forgot this lesson :/

So I decided to look at the request headers these scrapers use, and... ```http GET /Impostor/Impostor/src/commit/c1c6311c87a1fddaf0d2d1987b551b8e090d2f 5a/src/Impostor.Plugins.Debugger HTTP/1.0 Host: code.chipmunk.land X-Real-IP: 47.79.10.13 X-Forwarded-For: 47.79.10.13 X-Forwarded-Proto: https Pragma: no-cache Cache-Control: no-cache Sec-Ch-Ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120" Sec-Ch-Ua-Mobile: ?0 Sec-Ch-Ua-Platform: "macOS" Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/w ebp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7 Sec-Fetch-Site: none Sec-Fetch-Mode: navigate Sec-Fetch-User: ?1 Sec-Fetch-Dest: document Accept-Encoding: gzip Accept-Language: zh-CN,zh;q=0.9 Cookie: redirect_to=%2Fqcomlt%2Fabl%2Fblame%2Fcommit%2F9d76963bc15c5044a80378748 eca1b843c4740f2%2FOptionRomPkg%2FCirrusLogic5430Dxe%2FCirrusLogic5430UgaDraw.c; session=214bf591d10e9625 ``` Some things are immediately interesting * `X-Forwarded-Proto: https` - guess they don't want to hit http as often! (edit: I saw *one* http protocol request later on, from a different crawler) * `Accept-Language: zh-CN,zh;q=0.9` * `Cookie: redirect_to=%2Fqcomlt%2Fabl%2Fblame%2Fcommit%2F9d76963bc15c5044a80378748 eca1b843c4740f2%2FOptionRomPkg%2FCirrusLogic5430Dxe%2FCirrusLogic5430UgaDraw.c; session=214bf591d10e9625` * I didn't know whether these supported cookies at all (filtering based on session cookie presence throttled them a bunch) * `Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/w ebp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7` - are images usually included?! Now this is a *huge* fingerprinting vector I greatly regret overlooking. This could change how I look at things completely. If there's anything I should learn from #9, it's that I should never overlook metadata, yet I somehow forgot this lesson :/
Author
Owner

So I see this:

47.82.13.84 - - [23/May/2026:01:48:07 -0400] "GET /user/login?redirect_to=%2Fchipmunkmc%2FShadow%2Fblame%2Fcommit%2F30d62464bf657b58ad149a81020217da69595ecf%2Fsrc%2Fmain%2Fjava%2Fme%2Fx150%2Fcoffee%2Fhelper%2Fevent%2Fevents%2Fbase%2FNonCancellableEvent.java HTTP/1.1" 403 548 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36" "-"

Sounds like a win, however checking the Forgejo log reveals:

2026/05/23 01:49:23 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/login?redirect_to=%2fdinhero21%2fMinecraftDeobfuscated-Mojang%2fcommits%2fcommit%2fc7169b5c0272fd92dc4e63073816bd99bd71881b%2fminecraft%2fsrc%2fnet%2fminecraft%2fworld%2fentity%2fai%2fgoal%2fLandOnOwnersShoulderGoal.java for [2a03:2880:f814:14::]:0, 200 OK in 2.4ms @ auth/auth.go:144(auth.SignIn)

But it turns out this isn't really a stealthy bot, as older logs reveal it has an actual UA:

User-Agent: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)

So I wasn't seeing an IPv6 extension of the previous crawlers, maybe they're IPv4-only...
I also see this though:

187.190.238.151 - - [23/May/2026:01:52:32 -0400] "GET /kaboomstandardsorganization/guide/issues?q=&type=all&sort=relevance&state=open&labels=80&milestone=0&project=0&assignee=0&poster=0 HTTP/1.1" 200 63555 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:136.0) Gecko/20100101 Firefox/136.0" "-"

So I looked for the first two octets (i.e. /16) in the older HTTP log and got:

GET /Parker2991/MCProtocolLib/archive/1.21-1.bundle HTTP/1.0
Connection: keep-alive
Host: code.chipmunk.land
X-Real-IP: 187.190.28.188
X-Forwarded-For: 187.190.28.188
X-Forwarded-Proto: https
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36
Accept-Encoding: gzip, deflate
Accept: */*

Notably this lacks some headers, however it shows that either

  • there are many of these scrapers (prolly)
  • they deliberately mix up their headers yet cannot bypass something as trivial as Anubis (how)

Anyway, if I want to fingerprint these with anything other than a UA Mozilla check, I'm going to dig through some logs I suppose.
Well it's getting late, so I'll sleep now.

Edit: so it seems we have, at the very least

  • That scraper that uses random UAs but always 47.79.0.0/16 and 47.82.0.0/16 and has distinct headers
  • This one, which uses random UAs and likely (residential?) proxies (except it keeps reusing the same /16 subnets lmfao)
  • The scrapers using proper UAs and often distinct subnets (Meta/OpenAI/etc); rumors say these evade blocking w/ spoofed UAs but I'm not sure if that's a misattribution to the previous two

Note that an earlier DoS incident largely involved requests from the one using random proxies, though I guess these could all pose a risk.
Also there haven't been any incidents lately (i.e. since I loosened up the Cookie block) but I guess previous incidents show I can't rely on this for too long.
Still, there was a point when the site would repeatedly die and cause disk I/O issues on server0 as mentioned above :/.

So I see this: >`47.82.13.84 - - [23/May/2026:01:48:07 -0400] "GET /user/login?redirect_to=%2Fchipmunkmc%2FShadow%2Fblame%2Fcommit%2F30d62464bf657b58ad149a81020217da69595ecf%2Fsrc%2Fmain%2Fjava%2Fme%2Fx150%2Fcoffee%2Fhelper%2Fevent%2Fevents%2Fbase%2FNonCancellableEvent.java HTTP/1.1" 403 548 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36" "-"` Sounds like a win, however checking the Forgejo log reveals: >`2026/05/23 01:49:23 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/login?redirect_to=%2fdinhero21%2fMinecraftDeobfuscated-Mojang%2fcommits%2fcommit%2fc7169b5c0272fd92dc4e63073816bd99bd71881b%2fminecraft%2fsrc%2fnet%2fminecraft%2fworld%2fentity%2fai%2fgoal%2fLandOnOwnersShoulderGoal.java for [2a03:2880:f814:14::]:0, 200 OK in 2.4ms @ auth/auth.go:144(auth.SignIn)` But it turns out this isn't really a stealthy bot, as older logs reveal it has an actual UA: >`User-Agent: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)` So I wasn't seeing an IPv6 extension of the previous crawlers, maybe they're IPv4-only... I also see this though: >`187.190.238.151 - - [23/May/2026:01:52:32 -0400] "GET /kaboomstandardsorganization/guide/issues?q=&type=all&sort=relevance&state=open&labels=80&milestone=0&project=0&assignee=0&poster=0 HTTP/1.1" 200 63555 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:136.0) Gecko/20100101 Firefox/136.0" "-"` So I looked for the first two octets (i.e. /16) in the older HTTP log and got: ```http GET /Parker2991/MCProtocolLib/archive/1.21-1.bundle HTTP/1.0 Connection: keep-alive Host: code.chipmunk.land X-Real-IP: 187.190.28.188 X-Forwarded-For: 187.190.28.188 X-Forwarded-Proto: https User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Accept-Encoding: gzip, deflate Accept: */* ``` Notably this lacks some headers, however it shows that either * there are many of these scrapers (prolly) * they deliberately mix up their headers yet cannot bypass something as trivial as Anubis (how) Anyway, if I want to fingerprint these with anything other than a UA `Mozilla` check, I'm going to dig through some logs I suppose. Well it's getting late, so I'll sleep now. Edit: so it seems we have, at the very least * That scraper that uses random UAs but always 47.79.0.0/16 and 47.82.0.0/16 and has distinct headers * This one, which uses random UAs and likely (residential?) proxies (except it keeps reusing the same /16 subnets lmfao) * The scrapers using proper UAs and often distinct subnets (Meta/OpenAI/etc); rumors say these evade blocking w/ spoofed UAs but I'm not sure if that's a misattribution to the previous two Note that an earlier DoS incident largely involved requests from the one using random proxies, though I guess these could all pose a risk. Also there haven't been any incidents lately (i.e. since I loosened up the Cookie block) but I guess previous incidents show I can't rely on this for too long. Still, there was a point when the site would repeatedly die and cause disk I/O issues on server0 as mentioned above :/.
Author
Owner

Oh by the way help is appreciated but I doubt I'm going to get much here.
(okay for real this time, good night!)

Oh by the way help is appreciated but I doubt I'm going to get much here. (okay for real this time, good night!)
Author
Owner

Well I added a new experimental workaround, except this one fully relies on request headers! /var/log/forgejo/http.log is much quieter with it, albiet some requests still leak through occasionally.
Anyway, if this does the trick I may close the issue.

Well I added a new experimental workaround, except this one fully relies on request headers! `/var/log/forgejo/http.log` is much quieter with it, albiet some requests still leak through occasionally. Anyway, if this does the trick I may close the issue.
Author
Owner

Judging by logs, I'd say this is a victory.

Judging by logs, I'd say this is a victory.
Author
Owner

I might want to strengthen soul_snatcher.js later (there are more variants of crawlers than I expected), however the current implementation has already brought down the amount of requests reaching the Forgejo to be less than one per second, quite in contrast to how things were before.
Also I haven't seen any false positives yet!!!

I might want to strengthen `soul_snatcher.js` later (there are more variants of crawlers than I expected), however the current implementation has already brought down the amount of requests reaching the Forgejo to be less than one per second, quite in contrast to how things were before. Also I haven't seen any false positives yet!!!
Author
Owner

In case header fingerprinting is no longer feasible, I thought of a new possible heuristic: the rate at which different User-Agents or IP addresses (especially in the same ASN or subnet) are seen in otherwise identical requests.

In case header fingerprinting is no longer feasible, I thought of a new possible heuristic: the rate at which different User-Agents or IP addresses (especially in the same ASN or subnet) are seen in otherwise identical requests.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
chipmunk.land/misc#17
No description provided.