Filter out crawlers that lag Forgejo #17

Open
opened 2026-05-16 02:12:17 -04:00 by chipmunkmc · 14 comments
Owner

So Forgejo (and potentially other services later down the line) has been getting hit by crawler(s) using random IP addressess and User-Agents.
This has caused large amounts of lag for both the Forgejo instance and (disk operations on) the underlying machine, server0.local, multiple times before.
Note that these crawlers are often said to be AI-related, which I suppose would explain why they appeared at this time

The current workaround in place can be seen here.
This workaround always required the user to have cookies enabled, however, which can cause issues for some users. It also broke now that Forgejo no longer serves session cookies to all visitors, meaning you must sign in to view matched pages. It also always required the user to visit an unmatched page once.

Note that only select pages are prone to taxing the server's resources so destructively. As of writing this, the regular expression ^/[0-9a-zA-Z\-_\.]+/[0-9a-zA-Z\-_\.]+/(commit|blame|compare|commits|(src|raw|blame)/(commit|tag))/ can be used to match all of their URL paths (as shown in the linked file).
Notably all of these pages relate to viewing data from old commits. Large amounts of requests to these often lead to lots of git cat-file processes being spawned.

Anyway, some potential heuristics:

  • Random IPs, potentially from specific (maybe residential? unsure) ASNs are used.
  • Random User-Agent strings are used.
    Note that I haven't bothered to inspect these much, however I hear these might be those used by extremely outdated browsers.
    It's worth noting, however, that chipmunk.land aims to be very backwards compatible, deliberately supporting raw HTTP and some old TLS versions and ciphers. User-Agent filters could disrupt this.
  • All User-Agent strings tend to contain the string Mozilla (I mean c'mon, even Anubis uses this one!)
  • These clients make very frequent requests to random URL paths picked up from links, even if the site is down.
    If they ever re-use IP addresses, this could be very useful.
  • As far as I know, these clients never bother to request images/stylesheets/scripts, and also neglect to preserve cookies (as used in the current hack) too.
    The same can sometimes be said for some legitimate clients (curl, lynx, specially configured browsers, etc) though.
  • About that, the clients aren't really capable of executing scripts in the first place, though nor is literally any browser with JavaScript disabled or outright unsupported.
  • Perhaps they all use one protocol too (haven't checked, HTTPS is likely as modern websites tend to force it).
  • They might even all use the same TLS versions, though this might be harder to filter.
  • These clients could also use a consistent set and ordering of headers but I haven't checked.
  • Their ability to follow URL fragments might be limited.
    As of writing this, if you visit view-source:https://codeberg.org/heathercat123/apiclone/commits/branch/main in a fresh private/incognito window you will see a <a href="?trans-rights=human-rights"> element clickable by the user.
  • The crawlers cannot handle meta redirects properly, not that this helps us much here.
    On that same page, I also see <meta content="1; url=?trans-rights=human-rights" http-equiv=refresh>.

The idea is to prevent these crawlers from connecting, to avoid downtime of course, while also not changing anything about the results of requests from, well, real users or literally any other clients not taking the website down.
Of course, I do not want to require bloated JS challenges, in fact I don't want to present any sort of challenge pages. Kinda sorta like how I don't serve ADs or other annoyances either.
Also this is very well a case where, so to speak, you do not need to outrun the bear, but only the person who is next to you; Anubis quite literally only presents challenges to Mozilla UAs (trivial to bypass) and is still effective!!!

Multiple forges running GitLab and Forgejo are still running normally in this epidemic, and I intend for mine to be one of them.

As for other potentially affected services:

  • MediaWiki - I hear some endpoints it provides can be particularly taxing, perhaps in particular those for diffing pages.
    Has maxed out server0.local's CPU usage (and its CPU fan) before, but fortunately it's not being hit AFAIK.
  • APIClone - No idea how well this one handles stress; there have been instances of it slowing down but judging by server logs these are likely unrelated bugs.
    Note that requests to a service like APIClone shouldn't be too taxing though, and the most risky operations (project/asset loading) are behind JS and/or Flash.

These should be safe:

  • Static sites - very low overhead.
  • linx-server - this one doesn't even have many outlinks for crawlers to follow, and it mainly serves files.

I also feel like this shows just how vulnerable some of the web is to effective DDoS, and the fact I can't handle it quickly makes me doubt how many deliberate attacks I would competent enough to take on right now were they happening... oh well.
Note that I have nothing against crawlers or web scraping (which is pretty cool imho) in general, I'm not even that mad at AI itself, I just don't want my goddamn sites taken down!!!

As for AI and its haters, here's a final bit of yap. I've heard of and seen people and organizations take extreme measures against AI scraping, such as presenting challenges for personal/static sites or even blocking archival (oof), despite these not preserving a meaningful amount of computing power for them.
I believe that since forever if info is public on the web it will, you know, be processed and fucked with by whoever and whatever the hell wants, and that this is an attempt to artificially defy this principle.
Of course, there is a business argument, particularly that AI training will let entities other than them make money from their work out of serendipity. Of course I feel as if this comes from a rather proprietary philosophy, but many AI-related things are proprietary too so whatever. Still, I think impacting things as important as archival over petty business affairs is plain wrong.
I suppose this also shows that consensual archival tends to fail, though unfortunately more forceful services like https://archive.ph/ can have trouble staying alive...

So Forgejo (and potentially other services later down the line) has been getting hit by crawler(s) using random IP addressess and User-Agents. This has caused large amounts of lag for both the Forgejo instance and (disk operations on) the underlying machine, `server0.local`, multiple times before. Note that these crawlers are often said to be AI-related, which I suppose would explain why they appeared at this time The current workaround in place can be seen [here](https://code.chipmunk.land/chipmunk.land/misc/src/commit/9471cd39c83fa73870f217f5ae0e3c4398b814fd/server0.local/etc/nginx/http.d/forgejo.conf#L17-L31). This workaround always required the user to have cookies enabled, however, which can cause issues for some users. It also broke now that Forgejo no longer serves session cookies to all visitors, meaning you must sign in to view matched pages. It also always required the user to visit an unmatched page once. Note that only select pages are prone to taxing the server's resources so destructively. As of writing this, the regular expression `^/[0-9a-zA-Z\-_\.]+/[0-9a-zA-Z\-_\.]+/(commit|blame|compare|commits|(src|raw|blame)/(commit|tag))/` can be used to match all of their URL paths (as shown in the linked file). Notably all of these pages relate to viewing data from *old* commits. Large amounts of requests to these often lead to lots of `git cat-file` processes being spawned. Anyway, some potential heuristics: * Random IPs, potentially from specific (maybe residential? unsure) ASNs are used. * Random User-Agent strings are used. Note that I haven't bothered to inspect these much, however I hear these might be those used by extremely outdated browsers. It's worth noting, however, that chipmunk.land aims to be very backwards compatible, deliberately supporting *raw* HTTP and *some* old TLS versions and ciphers. User-Agent filters could disrupt this. * All User-Agent strings tend to contain the string *Mozilla* (I mean c'mon, even Anubis uses this one!) * These clients make very frequent requests to random URL paths picked up from links, even if the site is down. If they ever re-use IP addresses, this could be very useful. * As far as I know, these clients never bother to request images/stylesheets/scripts, and also neglect to preserve cookies (as used in the current hack) too. The same can sometimes be said for some legitimate clients (curl, lynx, specially configured browsers, etc) though. * About that, the clients aren't really capable of executing scripts in the first place, though nor is literally any browser with JavaScript disabled or outright unsupported. * Perhaps they all use one protocol too (haven't checked, HTTPS is likely as modern websites tend to force it). * They might even all use the same TLS versions, though this might be harder to filter. * These clients could also use a consistent set and ordering of headers but I haven't checked. * Their ability to follow URL fragments might be limited. As of writing this, if you visit `view-source:https://codeberg.org/heathercat123/apiclone/commits/branch/main` in a fresh private/incognito window you will see a `<a href="?trans-rights=human-rights">` element clickable by the user. * The crawlers cannot handle meta redirects properly, not that this helps us much here. On that same page, I also see `<meta content="1; url=?trans-rights=human-rights" http-equiv=refresh>`. The idea is to prevent these crawlers from connecting, to avoid downtime of course, while also not changing *anything* about the results of requests from, well, real users or literally any other clients not taking the website down. Of course, I do *not* want to require bloated JS challenges, in fact I don't want to present any sort of challenge pages. Kinda sorta like how I don't serve ADs or other annoyances either. Also this is very well a case where, so to speak, you do not need to outrun the bear, but only the person who is next to you; Anubis quite literally only presents challenges to Mozilla UAs (trivial to bypass) and is still effective!!! Multiple forges running GitLab and Forgejo are still running normally in this epidemic, and I intend for mine to be one of them. As for other potentially affected services: * MediaWiki - I hear some endpoints it provides can be particularly taxing, perhaps in particular those for diffing pages. Has maxed out `server0.local`'s CPU usage (and its CPU fan) before, but fortunately it's not being hit AFAIK. * APIClone - No idea how well this one handles stress; there have been instances of it slowing down but judging by server logs these are likely unrelated bugs. _Note that requests to a service like APIClone shouldn't be too taxing though, and the most risky operations (project/asset loading) are behind JS and/or Flash._ These should be safe: * Static sites - very low overhead. * linx-server - this one doesn't even have many outlinks for crawlers to follow, and it mainly serves files. I also feel like this shows just how vulnerable some of the web is to effective DDoS, and the fact I can't handle it quickly makes me doubt how many deliberate attacks I would competent enough to take on right now were they happening... oh well. Note that I have nothing against crawlers or web scraping (which is pretty cool imho) in general, I'm not even that mad at AI itself, I just don't want my goddamn sites taken down!!! As for AI and its haters, here's a final bit of yap. I've heard of and seen people and organizations take extreme measures against AI scraping, such as presenting challenges for personal/static sites or even blocking archival (oof), despite these not preserving a meaningful amount of computing power for them. I believe that since forever if info is public on the web it will, you know, be processed and fucked with by whoever and whatever the hell wants, and that this is an attempt to artificially defy this principle. Of course, there is a business argument, particularly that AI training will let entities other than them make money from their work out of serendipity. Of course I feel as if this comes from a rather *proprietary* philosophy, but many AI-related things are proprietary too so whatever. Still, I think impacting things as important as archival over petty business affairs is plain wrong. I suppose this also shows that consensual archival tends to fail, though unfortunately more forceful services like https://archive.ph/ can have trouble staying alive...
Author
Owner

And now that I make this issue, and look at the requests being made for Forgejo, I notice they are being made much more slowly now... sort of peculiar how that happened.
Edit: oh also multiple crawlers (w/ non-random UAs mostly) are crawling a wiki on the server more quickly, not sure if that's a coincedence.

And now that I make this issue, and look at the requests being made for Forgejo, I notice they are being made much more slowly now... sort of peculiar how that happened. Edit: oh also multiple crawlers (w/ non-random UAs mostly) are crawling a wiki on the server more quickly, not sure if that's a coincedence.
Author
Owner

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers.
Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now.
Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore.
Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc wrote in #17 (comment):

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@chipmunkmc wrote in https://code.chipmunk.land/chipmunk.land/misc/issues/17#issuecomment-1369: > I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config. @chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@mysticgiggle wrote in #17 (comment):

@chipmunkmc wrote in #17 (comment):

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@chipmunkmc restrict the bot accounts and delete them and give me admin

@mysticgiggle wrote in https://code.chipmunk.land/chipmunk.land/misc/issues/17#issuecomment-1380: > @chipmunkmc wrote in #17 (comment): > > > I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config. > > @chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua @chipmunkmc restrict the bot accounts and delete them and give me admin

@mysticgiggle wrote in #17 (comment):

@mysticgiggle wrote in #17 (comment):

@chipmunkmc wrote in #17 (comment):

I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config.

@chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua

@chipmunkmc restrict the bot accounts and delete them and give me admin

"give me admin" yeah no lmfao

@mysticgiggle wrote in https://code.chipmunk.land/chipmunk.land/misc/issues/17#issuecomment-1381: > @mysticgiggle wrote in #17 (comment): > > > @chipmunkmc wrote in #17 (comment): > > > I've removed the hacky mitigation, let's see if the site doesn't get DoS'd again. If this isn't the end, of course, I'll maybe stop being lazy and actually try filtering out the crawlers. Edit: right when that happened the requests became more frequent again, damn. I may have to add back the hacky workaround for now. Edit 2: the site wasn't lagged or anything but I don't feel like risking it anymore. Edit 3: I'm stupid lmao. I looked at the requests being logged by Forgejo, which doesn't include ones blocked by the hacky nginx config. > > > > > > @chipmunkmc there is more bots just use git.gay's instance's robots.txt and add scratch account stuff and allow the scratch auth ua > > @chipmunkmc restrict the bot accounts and delete them and give me admin "give me admin" yeah no lmfao
Author
Owner

of course, a user who sends passwords in minecraft chat and the likes and is unwanted on basically every issue tracker on the site is totally good admin material lmfao

of course, a user who sends passwords in minecraft chat and the likes and is unwanted on basically every issue tracker on the site is *totally* good admin material lmfao

lmfao

lmfao
Author
Owner

I could try to script something up. The reason this is hard is that it's much easier to prove a client's innocence (i.e. no Mozilla UA, Cookies present, etc) than a client's guilt, so to speak.

I could try to script something up. The reason this is hard is that it's much easier to prove a client's innocence (i.e. no Mozilla UA, Cookies present, etc) than a client's guilt, so to speak.
Author
Owner

While looking through some old logs I noticed something quite peculiar: I saw multiple IPs start with the same 2 octets! This could help with filtering a lot >:D
Edit: I also see IPs with matching first octets, and sometimes but rarely close (.82. and .83.) second octets. This honestly reminds me of when a deliberate DDoS hit me with a bunch of POSTs from 80. and 81. IPs.
Edit 2: many of the IPs are from the same ISP too; again, this reminds me of the intentional DDoS from earlier.

While looking through some old logs I noticed something quite peculiar: I saw multiple IPs start with the same 2 octets! This could help with filtering a lot >:D Edit: I also see IPs with matching first octets, and sometimes but rarely close (`.82.` and `.83.`) second octets. This honestly reminds me of when a _deliberate_ DDoS hit me with a bunch of POSTs from `80.` and `81.` IPs. Edit 2: many of the IPs are from the same ISP too; again, this reminds me of the intentional DDoS from earlier.
Author
Owner

Also a little backstory: so this whole crisis really began in 2025... sorta.

  • Early on, I noticed some sites were behind an apparent CF-like challenge; I realized this was Anubis later on.
  • Later incoming requests to this Forgejo caused a significant slowdown, so I got all paranoid about it and put it behind Anubis, but was not satisfied with the result and reverted it.
    • I sort of regret this, but it led to something... see below.
    • I later put the registration endpoint behind Anubis, and if I remember correctly spambots continued to get in, hinting to them using headless browsers. (related: #9)
    • This is also when I realized the site was more vulnerable to DoS than I had assumed. Silly me!
  • Later that year crawlers started causing Forgejo's repository archive cache to fill up server0.local's disk; I rm -rf'd the repo archive folder, in turn screwing up Forgejo's queues without knowing what I did. Oops!
    • Months later (I'm so stupid) I found the issue and rm -rf'd the queues too, but the disk got filled again. Afterwards, I tuned the repo archiver via app.ini, and the issue stopped.

So how did this snowball into the site dying? Well let's go on...

  • Also in later 2025 (or earlier 2026?), the site suffered from another slowdown, the second of this sort I can remember. I didn't check logs much, but users later concluded this was due to repository mirroring, leading to a mirror getting deleted.
  • In earlier 2026, a bunch of files from a specific commit on a mirrored repository got requested en masse, leading to another slowdown. I ended up making the mirror organization limited (login-only) as a lazy "fix."
    • I mean, it makes sense that mirrored repositories would be the largest.
  • There were multiple other incidents I think, but eventually it happened to a user's repository (not mine) again. This is the straw that broke the camel's back.

So now what?

  • Using a simple regex location block, I blocked access to files from specific commits; this lazily but successfully stopped the attacks for the time being.
  • Eventually I asked a kaboom.pw user for help, and got an nginx snippet that blocked access to certain paths also via regex location, but allowed access for all users with session cookies.
    • This, for a time, was actually a very clean solution; Forgejo (like its parent, Gitea) always provided session cookies for users at around this time.
    • Of course this was never truly ideal, as it requires the user to have cookies enabled, cannot be applied to sites without similar cookie behavior (hint hint), and required at least one page visit. Still, it was largely unnoticable to many.
  • Now crawlers kept discovering new paths that slowed down the site (such as ones using tags instead of commits), so it became a cat-and-mouse game to block those. Around this time the crawlers also started slowing down HDD operations on server0.local when they hit.
  • After more cat-and-mouse the site reached a state of stability, however...
  • A Forgejo update changed the session cookie behavior, breaking this workaround.

I knew a better workaround was needed since the incident that led to the limiting, however I've been neglecting to truly look into implementing it until now.

This, combined with the DDoS incidents of 2025 (one leading to long IPv4 downtime until I could procure a VPS) occured somewhat close together, I feel. It's as if chipmunk.land is suffering from growing pains... but of course, it's still alive and kicking.

  • Oh, and we've been here since November of 2022! Well actually the lines are blurry, I hosted a Minecraft server on my laptop that was once accessible on real.chipmunk.land but previously via my home IP in mid-to-late 2022.
    I've hosted services on that laptop before though, particularly chipmunkbot.
    Even in late 2021 I hosted a bot called SandCatBot (though it was unstable), and earlier that year I temporarily ran an AFK bot for some Aternos server!
    Of course, the website stuff only came in late 2022, this Forgejo being set up closer to 2023.

Alright, enough procrastination.

Also a little backstory: so this whole crisis really began in 2025... sorta. * Early on, I noticed some sites were behind an apparent CF-like challenge; I realized this was Anubis later on. * Later incoming requests to this Forgejo caused a significant slowdown, so I got all paranoid about it and put it behind Anubis, but was not satisfied with the result and reverted it. * I sort of regret this, but it led to something... see below. * I later put the registration endpoint behind Anubis, and if I remember correctly spambots continued to get in, hinting to them using headless browsers. (related: #9) * This is also when I realized the site was more vulnerable to DoS than I had assumed. Silly me! * Later that year crawlers started causing Forgejo's repository archive cache to fill up `server0.local`'s disk; I `rm -rf`'d the repo archive folder, in turn screwing up Forgejo's queues without knowing what I did. Oops! * Months later (I'm so stupid) I found the issue and `rm -rf`'d the queues too, but the disk got filled again. Afterwards, I tuned the repo archiver via `app.ini`, and the issue stopped. So how did this snowball into the site dying? Well let's go on... * Also in later 2025 (or earlier 2026?), the site suffered from another slowdown, the second of this sort I can remember. I didn't check logs much, but users later concluded this was due to repository mirroring, leading to a mirror getting deleted. * In earlier 2026, a bunch of files from a specific commit on a mirrored repository got requested en masse, leading to another slowdown. I ended up making the mirror organization limited (login-only) as a lazy "fix." * I mean, it makes sense that mirrored repositories would be the largest. * There were multiple other incidents I think, but eventually it happened to a user's repository (not mine) again. This is the straw that broke the camel's back. So now what? * Using a simple regex location block, I blocked access to files from specific commits; this lazily but successfully stopped the attacks for the time being. * Eventually I asked a kaboom.pw user for help, and got an nginx snippet that blocked access to certain paths also via regex location, but allowed access for all users with session cookies. * This, for a time, was actually a very clean solution; Forgejo (like its parent, Gitea) always provided session cookies for users at around this time. * Of course this was never truly ideal, as it requires the user to have cookies enabled, cannot be applied to sites without similar cookie behavior (*hint hint*), and required at least one page visit. Still, it was largely unnoticable to many. * Now crawlers kept discovering new paths that slowed down the site (such as ones using tags instead of commits), so it became a cat-and-mouse game to block those. Around this time the crawlers also started slowing down HDD operations on `server0.local` when they hit. * After more cat-and-mouse the site reached a state of stability, however... * A Forgejo update changed the session cookie behavior, breaking this workaround. I knew a better workaround was needed since the incident that led to the limiting, however I've been neglecting to truly look into implementing it until now. This, combined with the DDoS incidents of 2025 (one leading to long IPv4 downtime until I could procure a VPS) occured somewhat close together, I feel. It's as if chipmunk.land is suffering from growing pains... but of course, it's still alive and kicking. * Oh, and we've been here since November of 2022! Well actually the lines are blurry, I hosted a Minecraft server on my laptop that was once accessible on `real.chipmunk.land` but previously via my home IP in mid-to-late 2022. I've hosted services on that laptop before though, particularly [chipmunkbot](https://code.chipmunk.land/chipmunkmc/chipmunkbot-archive). Even in late 2021 I hosted a bot called SandCatBot (though it was unstable), and earlier that year I temporarily ran an AFK bot for some Aternos server! Of course, the website stuff only came in late 2022, this Forgejo being set up closer to 2023. Alright, enough procrastination.

i have been around for most of chipmunk.land's life lol

i have been around for most of chipmunk.land's life lol
Author
Owner

So blocking whole subnets (or ASNs) could cause a lot of false positives. While I could make a challenge in pure HTTP, challenges are still a very ugly solution to DoS; ideally the page should handle HTTP requests the same way as before.
However, according to a log file, the crawlers actually re-used IPs 3 times, so we could exploit this! Supposedly someone could even make a list of all crawler IPs from server logs once. I've also seen some filters (perhaps) allow the first request to a site through sometimes.
The IP reuse instances occurred seconds apart though, and seemed to be across the same /24. Also, the log I've been looking at only spans less than 3 minutes in total. I'll have to look at more logs.
Sorry I've been too lazy to get the filter written! I just procrastinate too much...

So blocking whole subnets (or ASNs) could cause a lot of false positives. While I could make a challenge in pure HTTP, challenges are still a very ugly solution to DoS; ideally the page should handle HTTP requests the same way as before. However, according to a log file, the crawlers actually re-used IPs 3 times, so we could exploit this! Supposedly someone could even make a list of all crawler IPs from server logs once. I've also seen some filters (perhaps) allow the first request to a site through sometimes. The IP reuse instances occurred seconds apart though, and seemed to be across the same /24. Also, the log I've been looking at only spans less than 3 minutes in total. I'll have to look at more logs. Sorry I've been too lazy to get the filter written! I just procrastinate too much...
Author
Owner

So after grepping one log (long story) for /src/commit access, I saw a couple of Alibaba subnets were the only ones seen (only two /16s, and even distinct /24s albeit a lot); maybe blocking them will significantly throttle the crawlers?
Back to writing a filter to automatically detect these (as they use other IP ranges of course), I noticed the first log I checked has a bunch of instances of /24 ranges being reused, however many were multiple (maybe about 30) seconds apart.
However, back to the IP list from that newer log, individual IPs were duplicated very often too (removing duplicates made the list over 4x shorter). I could maybe still filter by IP, or even combine strategies.

So after grepping one log (long story) for `/src/commit` access, I saw a couple of Alibaba subnets were the only ones seen (only two /16s, and even distinct /24s albeit a lot); maybe blocking them will significantly throttle the crawlers? Back to writing a filter to automatically detect these (as they use other IP ranges of course), I noticed the first log I checked has a bunch of instances of /24 ranges being reused, however many were multiple (maybe about 30) seconds apart. However, back to the IP list from that newer log, individual IPs were duplicated very often too (removing duplicates made the list over 4x shorter). I could maybe still filter by IP, or even combine strategies.
Author
Owner

I don't think that stopped them at all, but the site hasnt died yet so whatever, I'll write the script soon. This made the crawlers reveal their IPv6 addresses (for some reason) just now anyway.
As for filtering, here's an idea. I can first of all flag larger subnets (or ASNs potentially) as suspicious without blocking them, and then use that to more zealously judge smaller subnets or individual IPs. Might be a good compromise.

I don't think that stopped them at all, but the site hasnt died yet so whatever, I'll write the script soon. This made the crawlers reveal their IPv6 addresses (for some reason) just now anyway. As for filtering, here's an idea. I can first of all flag larger subnets (or ASNs potentially) as suspicious without blocking them, and then use that to more zealously judge smaller subnets or individual IPs. Might be a good compromise.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
chipmunk.land/misc#17
No description provided.