X I recognized immediately . Do the known crawlers also cause some traffic, ore are they blocked right away? It would be interesting to see such a diagram but with traffic (in GBytes or whatever the correct order of magnitude would be) on the y axis ā¦
And āother trafficā, is this downloads of raw files and things like that?
I have seen this sort of thing happen by human hands to my open source projects as well. Iāve had people from big companies show up and demand my attention by pure entitlement. And Iāve had garbage pile up in my queue because of some bogus bounty that rewarded generic github interactions.
Thankfully I have learned in calmer times that ānopeā is a perfectly valid way of handling these sorts of cases.
My weirdest experience in that category was someone (re)submitting someone elseās PR during Hacktoberfest and trying to claim it as a valid contribution because āit is open sourceā
Iām going to make this quick as I paused my horse girl gatcha game to come here but last year sometime I posted a thread asking what people were doing with respect to posting data online in the age of AI scrapers:
Since then the problem has only become worse. Hosting things without CloudFlare is really impossible and that can get expensive.
Iām uninterested in becoming mulch for Big Tech myself and about four months ago I found out about the Anubis project which apparently started off life as a joke but since then has proven to be very useful. I deployed it across my sites and setup a new Piwigo instance to test with and without it running. The graph speaks for itself:
These are from Piwigoās internals stats and despite having a robots.txt file set to AI company bots disallow they have zero respect. They would just keep sending requests until the server ran out of resouces. The second drop off is after I installed and setup Anubis. I went from tens of thousands of hits a day to 5-10.
How does it work? I simply gets your browser to do a quick proof of work calculation proving your a real human. The bots either cannot do this or it would cost them so much compute time on EC2/Azure/Google Cloud that itās not worth their time. I find the latter part funny in that it turns the problem from me having to spend money to stop them to potentially costing them.
itās also a pretty solid WAF, stops people door knocking looking for exploits and theyāve added a feature where it can monitor your load and dial up the challenge difficulty to throttle connections. Neat.
The downsides is itās a bit aggressive. At first it blocked search engine scrapers and the Internet Archive bot as well. I think most of that has been fixed but if visibility is your first priority it might not be the best solution. However since I find Google and search really driving less and less traffic these days (they mostly expect people to take their AI answer at the top) itās something Iām OK with personally. It identified the Piwigo API since it was taking JSON and darktable is able to connect through Anubis just fine without making any custom rules.
There are RPM and DEB packages available if you donāt want to compile. Iāve deployed it with Podman, natively and testing it on Kubernetes. If you use it and like it Iād recommend joining the devās Patreon. Thereās a commercial version available as well if youāre a business entity. $5 is a lot cheaper that Cloud Flare and Iām still 85% sure Cloud Flare is a honey pot. āHey letās man-in-the-middle the whole internet Dave!ā āGreat idea, except no oneās going to fall for that.ā āā¦ā
Iām getting distracted. You can see it in action on my sites here:
Anyway hopefully this helps some open source projects. Ffmpeg has started using it as has GNOMEās GitLab. Iāve been running it since March or April on my main sites and itās been great. Stable and low resource use. I use Anubis to terminate the SSL connection and then pass traffic back to nginx.
Wow, we have CloudFlare Enterprise at work but itās not as effective as I would like. Our Confluence still gets nailed semi-reguarly.
Working at a university I get to see other side effects of LLMs and let me tell you Iām not getting any doctor to work on me that graduated after 2023. These things have all but imploded our education system.
Tell me you work at Microsoft without telling me you work at Microsoft. Maybe Amazon.
I have similar issues to the postsā author in that I largely find them disruptive and generally I donāt like interacting with a computer in that manner. But for educators itās just going to be a constant game of cat and mouse. Plus, professors have started using them for writing letters and other admin tasks. Iāve got one that composes email replies with Copilot or GPT and itās a bit grating to try to interact with them. I feel like they need a āpress 0 to speak to an operatorā function on those. But students have a good argument for using them in that case.
Plus, in our system college is little more than a ticket to middle class life or at least it used to be, not necessarily about pushing yourself or learning. So most students are very motivated to take shortcuts. The language professors I talk to have moved 90% of graded work to in-person. Take home counts less than it used to.
The worst thing is the way they auto-complete code comments. By definition, a comment should tell you something that is not obvious from the code. But the LLM knows only the code, and is therefore incapable of writing meaningful comments.
Iāve learned to look somewhere else whenever I start writing a comment, so as to not have my train of thought derailed by some LLM slop.
The other day, I was writing a blog post, and immediately, the LLM tried to extrapolate it entirely from the title. Enough is enough, I turned it off. Useful as it can be for code, these sorts of interruptions to my thinking are a net negative to my productivity.
I agree. This is only true if you are of the idea that good code documents itself. All of those automatic comments seem to stem from people putting useless comments in their code. This is mostly what the LLM keeps suggesting. The problem is that for it to suggest that, itās because a lot of its training material had it, which is worrying.
Something like this (extremely simplified)
#parses configuration
var config = ParseConfig(configString);
For me comments should be mostly about design decisions, business logic and so on. Not a description of what the code is doing, for that you just read the god damned code
I think you are right, but at the same time I think that the most coding LLMs are biased towards adding comments to code because it may be read by someone who is not familiar with either the language or the libraries (otherwise they would not be using LLMs )
My company provided us with a subscription of Copilot and Iāve been using Claude Sonnet 4 in agent mode and it is really good, even if you know the language. I donāt use it to write new features or business logic, but to refactor and translate code from different languages, itās really powerful.
Iāve also had to finish up some frontend refactoring that our design/frontend team couldnāt finish, and I it was all copilot. Since the pages follow the same structure I just told him "Look at these files and X commit, check the page structure and implement this new paradigm in X, Y and Z pages. It did it perfectly at first try and saved me a few hours of laborious work.
AI datacenters are being powered by fossil fuels. Even if LLMs are āgoodā and youāre able to ignore the myriad of ethical concerns, theyāre contributing to global warming, which we absolutely donāt need.
Data centers in general are also the main pushers for on site safe nuclear reactors and clean energy since it looks good on their ācarbon neutralā reports and does away with transportation losses, Leading to a quicker future of energy independence and actual mass clean energy.
IMO the ethical sourcing of training data is a much worse problem than their energy expenditure. If we attack them on the energy front there is no clear place to draw the line. Slop video streaming takes up 60% of the worldās internet bandwidth (all those switches and routers require immense amounts of energy too) and the devices to decode and display the video.
We are fucked because of decades of oil lobbying, which continues to this day, and mindless consumerism mostly masterminded by American companies. AI energy expenditure is a pure diversion by those who will continue to pollute and screw up the planet.
and we still havent solved the reuse of nuclear fuel or final storage. did you know that like half the budget of the german ministry for clean environment is being used just for nuclear waste storage?
And then you have people like the Google Founder who said ā99% of the global energy production will need to be used to power AI at some pointā (paraphrased)
so yes your fancy markov chains are doing great.
p.s.: check also how much of the āwe are greenā comes from buying emission compensations and how effective those really are.