*> and use robots.txt as a guide of what to crawl rather than what not to crawl*...

matrss · 2026-03-26T16:33:49 1774542829

So, basically iocaine (https://iocaine.madhouse-project.org/). It has indeed been very useful to get the AI scraper load on a server I maintain down to a reasonable level, even with its not so strict default configuration.

willx86 · 2026-03-26T16:56:58 1774544218

https://blog.cloudflare.com/ai-labyrinth/

A bit like this? ( iocaine is newer)

johnisgood · 2026-03-26T17:15:24 1774545324

If I think about it, I find it awful. The fact that we need to put junk in our own stuff just for crawlers does not sit well with me.

account42 · 2026-03-31T09:30:54 1774949454

Yup, it's a clown world.

Any functioning society would deal with the offenders directly and had this stopped before it became an issue for most sites.

matrss · 2026-03-26T18:41:55 1774550515

First time seeing that, but yes, seems similar in concept. Iocaine can be self-hosted and put in as a "middleware" in your reverse proxy with a few lines of config, cloudflare's seems tied to their services. Cloudflares also generates garbage with generative models, while iocaine uses much simpler (and surely more "crude") methods of generating its garbage. Using LLMs to feed junk to LLMs just makes me cry, so much wasted compute.

Is iocaine actually newer though? Its first commit dates to 2025-01, while the blog post is from 2025-03. I couldn't find info on when Cloudflare started theirs. There's also Nepenthes, which had its first release in 2025-01 too.

dspillett · 2026-03-27T08:22:50 1774599770

Yes, except with the content being based on the real content rather than completely random. My intuition says that this will be more effective, specifically poisoning the model wrt tokens relating to that content rather than just increasing the overall noise level a bit (the damage there being smoothed out over the wider model).

freedomben · 2026-03-26T17:22:22 1774545742

Hot damn, this is a great idea! Reminds me fondly of an old project a friend and I built that looks like an SSH prompt or optionally an unauthed telnet listener, which looks and feels enough like a real shell that we would capture some pretty fascinating sessions of people trying to explore our system or load us with malware. Eventually somebody figured it out and then DDoSed the hell out of our stuff and would not stop hassling us. It was a good reminder that yanking people's chains sometimes really pisses them off and can attract attention and grudges that you really don't want. My friend ended up retiring his domain because he got tired of dealing with the special attention. It did allow us to capture some pretty fascinating data though that actually improved our security while it lasted.

Ferret7446 · 2026-03-26T20:31:50 1774557110

This is one reason why most crawlers ignore robots.txt now. The other reason is that bandwidth/bots are cheap enough now that they don't need web admins to help them optimize their crawlers

dspillett · 2026-03-27T16:03:35 1774627415

> This is one reason why most crawlers ignore robots.txt now.

I don't buy that for a second. Those not obeying robots.txt were doing so either because they were malicious (they wanted everything and wouldn't be told “please don't plough through these bits”) or stupid (not knowing any better) or both.

Anyone who was obeying robots.txt isn't going to start ignoring it because we've put honeypots there. Why would they think “well, now there are honeypots there I'm going to go scan those… honypots, yeah, that's a good idea”.

> The other reason is that bandwidth/bots are cheap enough now that they don't need web admins to help them optimize their crawlers

Web admins are not trying to optimize their crawlers, they are trying to stop their crawlers breaking sites.

account42 · 2026-03-31T09:36:46 1774949806

> Web admins are not trying to optimize their crawlers, they are trying to stop their crawlers breaking sites.

Actually they often do and that's one of the original purposed of robots.txt - to get search engines to stop wasting time on indexing worthless crap like endless dynamically generated pages. It's only relatively recently that most crawlers had a hostile relationship with website operators.