If you ever see a cute anime girl asking you for a short break when coming to our forum

X I recognized immediately :rofl:. Do the known crawlers also cause some traffic, ore are they blocked right away? It would be interesting to see such a diagram but with traffic (in GBytes or whatever the correct order of magnitude would be) on the y axis …

And ā€œother trafficā€, is this downloads of raw files and things like that?

I have seen this sort of thing happen by human hands to my open source projects as well. I’ve had people from big companies show up and demand my attention by pure entitlement. And I’ve had garbage pile up in my queue because of some bogus bounty that rewarded generic github interactions.

Thankfully I have learned in calmer times that ā€œnopeā€ is a perfectly valid way of handling these sorts of cases.

1 Like

Yes, its a page view, the traffic is the size of the page. Its also the cost from our CDN or s3 egress.

We aren’t blocking anything. And blocking doesn’t really work.

I assume stuff without a user agent string, but I not sure.

1 Like

My weirdest experience in that category was someone (re)submitting someone else’s PR during Hacktoberfest and trying to claim it as a valid contribution because ā€œit is open sourceā€ :wink:

1 Like

I couldn’t resist :smiley:

1 Like

For anyone using Cloudflare, there’s now an AI crawler maze:

Unlike Nephentes or Iocaine, it doesn’t try to poison the datasets, however.

3 Likes

I’m going to make this quick as I paused my horse girl gatcha game to come here but last year sometime I posted a thread asking what people were doing with respect to posting data online in the age of AI scrapers:

Since then the problem has only become worse. Hosting things without CloudFlare is really impossible and that can get expensive.

I’m uninterested in becoming mulch for Big Tech myself and about four months ago I found out about the Anubis project which apparently started off life as a joke but since then has proven to be very useful. I deployed it across my sites and setup a new Piwigo instance to test with and without it running. The graph speaks for itself:

These are from Piwigo’s internals stats and despite having a robots.txt file set to AI company bots disallow they have zero respect. They would just keep sending requests until the server ran out of resouces. The second drop off is after I installed and setup Anubis. I went from tens of thousands of hits a day to 5-10.

How does it work? I simply gets your browser to do a quick proof of work calculation proving your a real human. The bots either cannot do this or it would cost them so much compute time on EC2/Azure/Google Cloud that it’s not worth their time. I find the latter part funny in that it turns the problem from me having to spend money to stop them to potentially costing them.

it’s also a pretty solid WAF, stops people door knocking looking for exploits and they’ve added a feature where it can monitor your load and dial up the challenge difficulty to throttle connections. Neat.

The downsides is it’s a bit aggressive. At first it blocked search engine scrapers and the Internet Archive bot as well. I think most of that has been fixed but if visibility is your first priority it might not be the best solution. However since I find Google and search really driving less and less traffic these days (they mostly expect people to take their AI answer at the top) it’s something I’m OK with personally. It identified the Piwigo API since it was taking JSON and darktable is able to connect through Anubis just fine without making any custom rules.

There are RPM and DEB packages available if you don’t want to compile. I’ve deployed it with Podman, natively and testing it on Kubernetes. If you use it and like it I’d recommend joining the dev’s Patreon. There’s a commercial version available as well if you’re a business entity. $5 is a lot cheaper that Cloud Flare and I’m still 85% sure Cloud Flare is a honey pot. ā€œHey let’s man-in-the-middle the whole internet Dave!ā€ ā€œGreat idea, except no one’s going to fall for that.ā€ ā€œā€¦ā€

I’m getting distracted. You can see it in action on my sites here:

https://www.leanderhutton.com/

Anyway hopefully this helps some open source projects. Ffmpeg has started using it as has GNOME’s GitLab. I’ve been running it since March or April on my main sites and it’s been great. Stable and low resource use. I use Anubis to terminate the SSL connection and then pass traffic back to nginx.

2 Likes

yeah for anyone who likes haproxy there is a patch WIP for anubis. in the mean time you can also use berghain.

can you guess when we turned on berghain here?

https://progress.opensuse.org/news/125

2 Likes

Wow, we have CloudFlare Enterprise at work but it’s not as effective as I would like. Our Confluence still gets nailed semi-reguarly.

Working at a university I get to see other side effects of LLMs and let me tell you I’m not getting any doctor to work on me that graduated after 2023. These things have all but imploded our education system.

1 Like

The other day i saw a fun story … a teacher gave the class the homework:

ā€œWrite an essay with chatgpt. and then check if everything is correct.ā€

I think that is a very good approach to teach people about the limitations of it.

And this was an interesting software developer perspective:

I would argue that similar things apply when writing a paper. you move from an active writer role to a constant reviewer role.

Tell me you work at Microsoft without telling me you work at Microsoft. Maybe Amazon.

I have similar issues to the posts’ author in that I largely find them disruptive and generally I don’t like interacting with a computer in that manner. But for educators it’s just going to be a constant game of cat and mouse. Plus, professors have started using them for writing letters and other admin tasks. I’ve got one that composes email replies with Copilot or GPT and it’s a bit grating to try to interact with them. I feel like they need a ā€œpress 0 to speak to an operatorā€ function on those. But students have a good argument for using them in that case.

Plus, in our system college is little more than a ticket to middle class life or at least it used to be, not necessarily about pushing yourself or learning. So most students are very motivated to take shortcuts. The language professors I talk to have moved 90% of graded work to in-person. Take home counts less than it used to.

1 Like

The worst thing is the way they auto-complete code comments. By definition, a comment should tell you something that is not obvious from the code. But the LLM knows only the code, and is therefore incapable of writing meaningful comments.

I’ve learned to look somewhere else whenever I start writing a comment, so as to not have my train of thought derailed by some LLM slop.

The other day, I was writing a blog post, and immediately, the LLM tried to extrapolate it entirely from the title. Enough is enough, I turned it off. Useful as it can be for code, these sorts of interruptions to my thinking are a net negative to my productivity.

4 Likes

I agree. This is only true if you are of the idea that good code documents itself. All of those automatic comments seem to stem from people putting useless comments in their code. This is mostly what the LLM keeps suggesting. The problem is that for it to suggest that, it’s because a lot of its training material had it, which is worrying.

Something like this (extremely simplified)

#parses configuration
var config = ParseConfig(configString);

For me comments should be mostly about design decisions, business logic and so on. Not a description of what the code is doing, for that you just read the god damned code :smiley:

3 Likes

I think you are right, but at the same time I think that the most coding LLMs are biased towards adding comments to code because it may be read by someone who is not familiar with either the language or the libraries (otherwise they would not be using LLMs :wink:)

1 Like

My company provided us with a subscription of Copilot and I’ve been using Claude Sonnet 4 in agent mode and it is really good, even if you know the language. I don’t use it to write new features or business logic, but to refactor and translate code from different languages, it’s really powerful.

I’ve also had to finish up some frontend refactoring that our design/frontend team couldn’t finish, and I it was all copilot. Since the pages follow the same structure I just told him "Look at these files and X commit, check the page structure and implement this new paradigm in X, Y and Z pages. It did it perfectly at first try and saved me a few hours of laborious work.

2 Likes

and they aren’t the only ones.

1 Like

AI datacenters are being powered by fossil fuels. Even if LLMs are ā€œgoodā€ and you’re able to ignore the myriad of ethical concerns, they’re contributing to global warming, which we absolutely don’t need.

1 Like

Data centers in general are also the main pushers for on site safe nuclear reactors and clean energy since it looks good on their ā€œcarbon neutralā€ reports and does away with transportation losses, Leading to a quicker future of energy independence and actual mass clean energy.

IMO the ethical sourcing of training data is a much worse problem than their energy expenditure. If we attack them on the energy front there is no clear place to draw the line. Slop video streaming takes up 60% of the world’s internet bandwidth (all those switches and routers require immense amounts of energy too) and the devices to decode and display the video.

We are fucked because of decades of oil lobbying, which continues to this day, and mindless consumerism mostly masterminded by American companies. AI energy expenditure is a pure diversion by those who will continue to pollute and screw up the planet.

1 Like

We won’t have a livable planet if AI is powered by fossel fuels.

1 Like

those site local nuclear reactors are a few years away. they even brought back up decomissioned reactors like three miles island.

but most of the consumption increase is actually covered by fossil fuel which have to be turned back on to stabilize the grid. See Wendover: How AI is Ruining the Electric Grid.

and we still havent solved the reuse of nuclear fuel or final storage. did you know that like half the budget of the german ministry for clean environment is being used just for nuclear waste storage?

And then you have people like the Google Founder who said ā€œ99% of the global energy production will need to be used to power AI at some pointā€ (paraphrased)

so yes your fancy markov chains are doing great.

p.s.: check also how much of the ā€œwe are greenā€ comes from buying emission compensations and how effective those really are.

3 Likes