The Open-Source Software Saving the Internet From AI Bot Scrapers

fattyfoods@feddit.nl · 23 days ago

The Open-Source Software Saving the Internet From AI Bot Scrapers

unexposedhazard@discuss.tchncs.de · edit-2 23 days ago

Non paywalled link https://archive.is/VcoE1

It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

Mubelotix@jlai.lu · 23 days ago

Exactly. It’s called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed

JackbyDev@programming.dev · 22 days ago

It’s funby that older captchas could be viewed as proof of work algorithms now because image recognition is so good. (From using captchas.)

Mubelotix@jlai.lu · edit-2 22 days ago

Interesting stance. I have bought many tens of thousand of captcha for legitimate reasons, and I have now completely lost faith in them

Ferk@lemmy.ml · edit-2 21 days ago

That’s actually a good idea. A very simple “click the frog” captcha might be solvable by an AI but it would work as a way to make it more expensive for crawlers without wasting compute resources (energy!) on the user or slowing down old devices to a crawl. So in some ways it could be a better alternative to Anubis.

RandomTester@lemmybefree.net · 22 days ago

it wasn’t made for bitcoin originally? didn’t know that!

0xebfe@lemmy.today · 22 days ago

Originally called hashcash: http://hashcash.org/

RandomTester@lemmybefree.net · 22 days ago

you know it’s old when it doesn’t have ssl

daw@feddit.org · 22 days ago

exu@feditown.com · 22 days ago

It inherently blocks a lot of the simpler bots by requiring JavaScript as well.

lazynooblet@lazysoci.al · 23 days ago

Thank you for the link. Good read

fuzzy_tinker@lemmy.world · 23 days ago

This is fantastic and I appreciate that it scales well on the server side.

Ai scraping is a scourge and I would love to know the collective amount of power wasted due to the necessity of countermeasures like this and add this to the total wasted by ai.

interdimensionalmeme@lemmy.ml · 22 days ago

All this could be avoided by making submit photo id to login into a account.

adr1an@programming.dev · 22 days ago

That’s awful, it means I would get my photo id stolen hundreds of times per day, or there’s also thisfacedoesntexists… and won’t work. For many reasons. Not all websites require an account. And even those that do, when they ask for “personal verification” (like dating apps) have a hard time to implement just that. Most “serious” cases use human review of the photo and a video that has your face and you move in and out of an oval shape…

interdimensionalmeme@lemmy.ml · 22 days ago

Also you must drink a verification can !

HaraldvonBlauzahn@feddit.org · 22 days ago

I don’t think this would help:

https://thispersondoesnotexist.com/

interdimensionalmeme@lemmy.ml · 22 days ago

By photo ID, I don’t mean just any photo, I mean “photo id” cryptographically signed by the state, certificates checked, database pinged, identity validated, the whole enchilada

Russ@bitforged.space · 21 days ago

That would have the same effect as just taking the site offline…

No one is giving a random site their photo ID.

interdimensionalmeme@lemmy.ml · 21 days ago

You’d be surprised, many humans have simply no backbone, common sense nor self respect so I think they very probably would still, in large numbers. Proof is facebook and palantir.

grysbok@lemmy.sdf.org · 23 days ago

My archive’s server uses Anubis and after initial configuration it’s been pain-free. Also, I’m no longer getting multiple automated emails a day about how the server’s timing out. It’s great.

We went from about 3000 unique “pinky swear I’m not a bot” visitors per (iirc) half a day to 20 such visitors. Twenty is much more in-line with expectations.

Jankatarch@lemmy.world · 22 days ago

Everytime I see anubis I get happy because I know the website has some quality information.

Panda@lemmy.today · 22 days ago

I’ve seen this pop up on websites a lot lately. Usually it takes a few seconds to load the website but there have been occasions where it seemed to hang as it was stuck on that screen for minutes and I ended up closing my browser tab because the website just wouldn’t load.

Is this a (known) issue or is it intended to be like this?

lime!@feddit.nu · 22 days ago

anubis is basically a bitcoin miner, with the difficulty turned way down (and obviously not resulting in any coins), so it’s inherently random. if it takes minutes it does seem like something is wrong though. maybe a network error?

isolatedscotch@discuss.tchncs.de · edit-2 22 days ago

adding to this, some sites set the difficulty way higher then others, nerdvpn’s invidious and redlib instances take about 5 seconds and some ~20k hashes, while privacyredirect’s inatances are almost instant with less then 50 hashes each time

RepleteLocum@lemmy.blahaj.zone · 22 days ago

So they make the internet worse for poor people? I could get through 20k in a second, but someone with just an old laptop would take a few minutes, no?

isolatedscotch@discuss.tchncs.de · 22 days ago

So they make the internet worse for poor people? I could get through 20k in a second, but someone with just an old laptop would take a few minutes, no?

i mean, kinda? you are absolutely right that someone with an old pc might need to wait a few extra seconds, but the speed is ultimately throttled by the browser

JackbyDev@programming.dev · 22 days ago

Well, it’s the scrapers that are causing the problem.

MadPsyentist@lemmy.nz · 21 days ago

Just wait till they hit my homepage with a 200mb react frontend, 9 seperate tracking / analytics scripts and generic shopify scripts on it :P

Random Dent@lemmy.ml · 22 days ago

Isn’t that just the way things work in general though? If you have a worse computer, everything is going to be slower, broadly speaking.

Cricket [he/him]@lemmy.zip · 21 days ago

I have had a similar experience. Most sites with Anubis take only a few seconds to go through, but I ran into I think it was some small blog where it took at least 5 minutes. Like someone mentioned, it may have been how they set it up with number of hashes required. The site that took forever for me seemed to have some exorbitant number like 5k or 50k (I don’t recall exactly).

bdonvr@thelemmy.club · 23 days ago

Ooh can this work with Lemmy without affecting federation?

Captain Beyond@linkage.ds8.zone · 23 days ago

Yes.

Source: I use it on my instance and federation works fine

bdonvr@thelemmy.club · 23 days ago

Thanks. Anything special configuring it?

Captain Beyond@linkage.ds8.zone · edit-2 23 days ago

I keep my server config in a public git repo, but I don’t think you have to do anything really special to make it work with lemmy. Since I use Traefik I followed the guide for setting up Anubis with Traefik.

I don’t expect to run into issues as Anubis specifically looks for user-agent strings that appear like human users (i.e. they contain the word “Mozilla” as most graphical web browsers do) any request clearly coming from a bot that identifies itself is left alone, and lemmy identifies itself as “Lemmy/{version} +{hostname}” in requests.

deadcade@lemmy.deadca.de · 22 days ago

“Yes”, for any bits the user sees. The frontend UI can be behind Anubis without issues. The API, including both user and federation, cannot. We expect “bots” to use an API, so you can’t put human verification in front of it. These "bots* also include applications that aren’t aware of Anubis, or unable to pass it, like all third party Lemmy apps.

That does stop almost all generic AI scraping, though it does not prevent targeted abuse.

Captain Beyond@linkage.ds8.zone · 22 days ago

The API, including both user and federation, cannot.

This is theoretically an issue however in practice Anubis only weighs requests that appear to come from a browser: https://anubis.techaro.lol/docs/design/how-anubis-works

I just tested my instance with Jerboa and it seems to work just fine.

interdimensionalmeme@lemmy.ml · 22 days ago

Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.

bdonvr@thelemmy.club · 22 days ago

That’s not true, search indexer bots should be allowed through from what I read here.

interdimensionalmeme@lemmy.ml · 22 days ago

If you allow my searchxng search scraper then an AI scraper is indistinguishable.

If you mean, “google and duckduckgo are whitelisted” then lemmy will only be searchable there, those specific whitelisted hosts. And google search index is also an AI scraper bot.

infinitesunrise@slrpnk.net · 23 days ago

Yeah, it’s already deployed on slrpnk.net. I see it momentarily every time I load the site.

seang96@spgrn.com · 23 days ago

As long as its not configured improperly. When forgejo devs added it it broke downloading images with Kubernetes for a moment. Basically would need to make sure user agent header for federation is allowed.

Resonosity@lemmy.dbzer0.com · 22 days ago

To be honest, I need to ask my admin about that!

fxomt@lemmy.dbzer0.com · edit-2 22 days ago

We don’t use anubis but we use iocaine (?), see /0 for the announcement post

medem@lemmy.wtf · 23 days ago

What advantage does this software provide over simply banning bots via robots.txt?

</Stupidquestion>

kcweller@feddit.nl · 23 days ago

Robots.txt expects that the client is respecting the rules, for instance, marking that they are a scraper.

AI scrapers don’t respect this trust, and thus robots.txt is meaningless.

medem@lemmy.wtf · 23 days ago

Well, now that y’all put it that way, I think it was pretty naive from me to think that these companies, whose business model is basically theft, would honour a lousy robots.txt file…

PlantPowerPhysicist@discuss.tchncs.de · 23 days ago

the scrapers ignore robots.txt. It doesn’t really ban them - it just asks them not to access things, but they are programmed by assholes.

irotsoma@lemmy.blahaj.zone · 23 days ago

TL;DR: You should have both due to the explicit breaking of the robots.txt contract by AI companies.

AI generally doesn’t obey robots.txt. That file is just notifying scrapers what they shouldn’t scrape, but relies on good faith of the scrapers. Many AI companies have explicitly chosen not no to comply with robots.txt, thus breaking the contract, so this is a system that causes those scrapers that are not willing to comply to get stuck in a black hole of junk and waste their time. This is a countermeasure, but not a solution. It’s just way less complex than other options that just block these connections, but then make you get pounded with retries. This way the scraper bot gets stuck for a while and doesn’t waste as many of your resources blocking them over and over again.

thingsiplay@beehaw.org · 23 days ago

The difference is:

robots.txt is a promise without a door
Anubis is a physical closed door, that opens up after some time

Mwa@thelemmy.club · 23 days ago

The problem is Ai doesn’t follow robots.txt,so Cloudflare are Anubis developed a solution.

oong3Eepa1ae1tahJozoosuu@lemmy.world · 23 days ago

I mean, you could have read the article before asking, it’s literally in there…

refalo@programming.dev · edit-2 22 days ago

I don’t understand how/why this got so popular out of nowhere… the same solution has already existed for years in the form of haproxy-protection and a couple others… but nobody seems to care about those.

Flipper@feddit.org · 22 days ago

Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.

It’s not always about being first but about marketing.

JohnEdwa@sopuli.xyz · edit-2 22 days ago

It’s not always about being first but about marketing.

And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
I’m even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn’t insignificant.

JackbyDev@programming.dev · 22 days ago

Compare and contrast.

High-performance traffic management and next-gen security with multi-cloud management and observability. Built for the enterprise — open source at heart.

Sounds like some over priced, vacuous, do-everything solution. Looks and sounds like every other tech website. Looks like it is meant to appeal to the people who still say “cyber”. Looks and sounds like fauxpen source.

Weigh the soul of incoming HTTP requests to protect your website!

Cute. Adorable. Baby girl. Protect my website. Looks fun. Has one clear goal.

LePoisson@lemmy.world · 21 days ago

Probably a similar reason as to why we don’t hear about the other potential hundreds of competing products or solutions to the same problem (in general).

Luck.

It’s just not fair in our world.

Kazumara@discuss.tchncs.de · 23 days ago

Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn’t matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

Ah found it, here

deadcade@lemmy.deadca.de · 22 days ago

Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn’t mean anything. Now, even a relatively small git web host takes an insane amount of resources. I’d know - I host a Forgejo instance. Caching doesn’t matter, because diffs berween two random commits are likely unique. Ratelimiting doesn’t matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users “because the site is busy”.

A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

interdimensionalmeme@lemmy.ml · 22 days ago

This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech’s dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won’t share.

Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

deadcade@lemmy.deadca.de · 22 days ago

No, it’d still be a problem; every diff between commits is expensive to render to web, even if “only one company” is scraping it, “only one time”. Many of these applications are designed for humans, not scrapers.

interdimensionalmeme@lemmy.ml · 22 days ago

If the rendering data for scraper was really the problem Then the solution is simple, just have downloadable dumps of the publicly available information That would be extremely efficient and cost fractions of pennies in monthly bandwidth Plus the data would be far more usable for whatever they are using it for.

The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.

I don’t think we can have both of these.

RedSnt 👓♂️🖥️@feddit.dk · 23 days ago

Brodie interviewed the creator of Anubis a little while back, it’s pretty good.

interdimensionalmeme@lemmy.ml · 22 days ago

Open source is also the AI scraper bots AND the internet itself, it is every character in the story.

thedeadwalking4242@lemmy.world · 22 days ago

I know people love anime myself included, but this popping up on my work PC can be frustrating

ILikeBoobies@lemmy.ca · 22 days ago

Contact the administrator to ask them to change the landing page

not_amm@lemmy.ml · 23 days ago

I had seen that prompt, but never searched about it. I found it a little annoying, mostly because I didn’t know what it was for, but now I won’t mind. I hope more solutions are developed :D

inbeesee@lemmy.world · 22 days ago

Fantastic article! Makes me less afraid to host a website with this potential solution

medem@lemmy.wtf · 23 days ago

deleted by creator