Posting things online in the age of the AI race

I don’t normally do any analytics or tracking on my sites beyond simple checks for bad actors in the logs. IMO it’s creepy and frankly it’s none of my business who looks, I’m not looking to get sponsorship so I don’t need to prove numbers. However in the last few months after companies like OpenAI have been scraping the internet to train their LLMs or image generators I started running goaccess on my combined nginx logs.

It’s about 80-90% crawlers now depending on the vhost, in the past when I’ve done some log spelunking automated bots have always been a fairly large portion but that’s really getting up there. I suspect a fair amount of that remaining 10-20% are also scrapers trying to fly under the radar with legit agent strings. I’m not terribly surprised considering what we see out of CloudFlare at work.

In the past this simply meant a search engine was cataloging your site to deliver it to their users or maybe the Internet Archive was preserving your site for posterity. Both things I’m more than Ok with, but with Google sending less and less traffic to sites not owned by big tech or media companies and just delivering often hilariously incorrect answers with Gemini along with everyone and their dog trying to suck up data for their ML model I’m starting to wonder if it’s really worth it anymore.

I used to be a “self-host with your own URL/domain or it didn’t happen” kind of a guy and I still kind of have that line of thought. I don’t really post much to the big sites social media these days for a lot of the same reasons, they’re just building training data. But in general it looks like actual humans rarely leave Facebook, Instagram, Discord, Tiktok, YouTube or Twitter on the internet these days. So if the goal is to have real biological eyes on it maybe that’s just where one has to go? It seems having a dot com or dot org is just handing OpenAI, Google and Microsoft your hard work to have their ML model regurgitate it for others. OpenAI recently hired the former head of the NSA too. They clearly seem to be heading down a surveillance or weaponization path.

I guess my point here is this: what are your thoughts in sharing and posting your work online. Obviously given the nature of this board I’m thinking photos and videos but code and writing is also OK to talk about. Are you concerned about feeding the beast as it were? What about attracting clients, models or customers? Where’s the balance these days? What kind of deal with the devil are you willing to sign? I’m not sure yet in my case and have been thinking it over. Everything from completely pulling down my site, only posting in private Discord or similar communities or trying some WAFs to cut down on bots.

4 Likes

I know some of these words :raised_hand:t5:

Yeah, I’ve been thinking about this a lot. More so, for me thinking about it in the creative spaces. Music and video production seems ripe for the stronger words than pickings for AI regurgitation.

There was a discussion on here a day or so ago about discord and Radu’s server. Honestly, I would love to join in but I’ve stayed away from discord and all social media from the beginning barring the early days of newsgroups and photography forums and I really don’t participate in street photography but chatting privately with those people seems like it would be fun. But I don’t because I don’t trust the medium of discord.

Back to AI in general, personally, I have an almost violent reaction to AI video. There is a recent film on Netflix, a cheesy hollywood film, called Ford vs Ferrari or something like that. Halfway through the film there is a character waving at the camera. Someone watching a motorsport race but its an AI generated model. It has completely ruined the movie, it wasn’t a great movie but I can’t think about that movie without thinking about this alien aspect of that generated character. It’s a stick in my craw. I cannot describe my dislike for this kind of animation.

So, I wonder what kind of defense apart from privacy in communication and secure media distribution can be used to defend against it? I guess I am saying I don’t have any answers but would like to hear others’ opinions.

I know there are people who are going to say, ‘this is the future, suck it up’ kind of comment but obviously, I hope it’s obvious, I don’t believe that’s the way forward. Like how I believe, facebook or whatever iteration of the walled-garden, you-are-the-product, is an answer to the problem of in-person communication. Saying this as a shy person who doesn’t like crowds at all.

So, I for me, AI in general is a solution in search of a problem. It’s going to decimate the workforce and our primate brains are using it to create another massive power imbalance with the few having dominion over the many. Just like all of these digital megacorps have done for the past twenty years or more. This is all I can see happening.

2 Likes

I regard anything I put anywhere on-line as being in the Public Domain … available to any entity that wants to scrape, steal, edit, na-ni-na.

Ah this is a cool tool and I think I’ll start using it as well. Thanks for sharing.

I feel much the same way as you, so I don’t know that there is too much to discuss there, but I also feel like I should post my work to my own website (and maybe the fediverse too).

So if you still want to post to your own website, you can try and block AI bots from crawling your site at the web server level using the robots.txt file. If you’re feeling more… uh… offensive, you can check out glaze and nightshade which claims to render your images useless for training LLMs but doesn’t change the look of your images all that much. That’s pretty cool.

4 Likes

That’s an interesting observation. I notice myself appending “forum” or “reddit” to my searches quite regularly, to fight the SEO blogspam and get some human perspective.

A large part of YouTube’s appeal is similarly, the genuine humanity of many presenters. So much of current entertainment media is so… overproduced, slick corporate, anonymous. It’s weirdly disgusting.

As for your actual question, I don’t particularly mind feeding the beast. It’s unavoidable anyway. I go on vacation to feed the tourism industry. I buy a new computer from a multinational conglomerate. I work for a company doing their part in some global industry. And I write a blog that’s no doubt stolen and regurgitated elsewhere. For me, it’s more about creative writing anyway, than about “building an audience” or any semi-commercial activity like that. Not everything needs to have an economic goal.

My personal blog, on the other hand, is password-protected, and hopefully inaccessible to search crawlers. It’s for sharing personal stories and pictures with family and friends, and has no relevance to the internet at large. The alternative would have been a Facebook group or group chat, but I much prefer to have my own platform that is not owned by anyone else.

3 Likes

Good or bad they are ultimately correct. I’ve had to take up using AI at work to keep my output competitive with our younger engineers who’ve adopted it. The managerial and capital class love it. Doesn’t matter if their output is buggier or causes problems down the road. Jira story point go up. Brrrr. This is at a public university so I can’t imagine how the competitive and bean counter world of private industry is drooling over these things. Two people with ChatGPT can now do the work of ten, again I’d argue that quality is probably not there as I’ve seen some awful mistakes made by devs/engineers overly reliant on it but most folks’ bar for “good enough” is far lower than we think. See: smart phones killing the camera industry.

The younger end of Gen Z is also adopting it in everything. I’m not sure why our CS, English or Foreign Language departments even bother giving homework anymore. They’re all running these things through an LLM. I minored in Japanese and keep up with my former professors. They’re seeing it everywhere and one professor asked me about how to combat it from take home writing assignments. It’s kind of blatantly obvious when the student is doing take home essays at a much higher level than they’re obviously able to write or speak in class but they cannot prove it beyond a doubt.

It’s going to be like smart phones and object storage: those entering the work force have only ever known these tools. Getting the under 25 year olds to use a desktop OS or an old school directory hierarchy us next to impossible so us old folks are just going to have to adapt.

But that doesn’t mean I have to submit my life to be made into the AI hot dog we’re about to get jammed in our throat. I really want to keep my photography side job going and keep sharing it people who enjoy it as well.

3 Likes

I’ve seen Glaze and Nightshade, to me the noise is sort of obvious and I wonder about the longevity of it. If my understanding and experience on how these models are trained is correct eventually they will be ineffective as the models are exposed it more often. Especially if the trainers get a set of untainted and tainted work submitted to train on. I’ve trained some ML bots to recognize bees entering and leaving a hive before, it’s quite remarkable how adaptable neural networks can be. Glaze and Nightshade might be an OK solution for now but I don’t expect them to keep up. Just got to remember to not use it on files you’re delivering to clients!

As for robots.txt considering OpenAI almost assuredly used an actresses’ voice after she told the no I don’t expect them to respect a text file at all. Not to mention anyone else simply changing a user agent string. They seem to have the “this is happening whether you like it or not” attitude.

3 Likes

I kid :smiley: When Bob wrote this, societal and technological advancements were nowhere near happening at the speed we see them today.

I know it’s not for everyone and he did commit some crimes but Ted Kaczynski made a point in his manifesto that people who started becoming reliant on AI would soon forget how they did things without it, and I’m afraid we’re starting to go down that road.

It’s not that forgetting old ways is bad, but we are replacing them with things which are abstracted at a level never seen before. You still use your brain when operating a computer, much less so a smartphone I’d say (See chimp using instagram video), and LLM’s and other AI’s are just the bottom level.

It’s all good when we use them to replace menial tasks, but what happens when it starts doing research and things that we assume only humans can do? It’s not difficult to imagine a future where everyone has no goals for creation or knowledge, they just consume endlessly what their personal model feeds them, and afterwards, what benefit do the elite that control these systems have for keeping people around?

Back in the mid-90s, my programming manager had a saying: : “Good, fast, cheap - pick any two.” When I first heard him say it, I resented it, but on reflection, I realized it was true.

I have noticed over my decades in technology that the majority of people will prefer quantity over quality. Quantity of code is what AI brings to the table (more applications, faster), and that is what management will take.

4 Likes

I’m sure a colleague of mine once put part of our teams chat in chat gpt so he didn’t have to think of a reply to send me. His replies suddenly had the distinct “default” chatgpt way of speaking after months of his “regular” speech :smiley:

Just going by my stats and comparing with others I don’t think the general web is trodden by humans much anymore. I forgot to add Reddit to my list earlier but much like Twitter it seems to have a fair share of bots. A number of non-Reddit or non-Discord forums have closed up shop over the last couple of years too so I’m not sure how popular those are anymore. From interacting with the current 18-22 year old generation fairly often it seems Discord, TikTok and surprisingly Snapchat are their goto. Twitch and YouTube are their consumption sites for everything. Instagram and Reddit is far greyer than they let on and Facebook is really very grey outside of Marketplace.

As someone who has done photography for their supper in the past and would kind of like to start back up it’s a bit of a problem if actual humans don’t see my work. ML models don’t cut checks. Over the years I’ve also generally gone with a CC-BY-SA-NC license for my work as I never really cared if some individuals or non-profit orgs used it but if Pepsi or Ford wanted it for their ads well, pay me. Unfortunately after talking to IP lawyers it seems they regard most open source or copyleft licenses as de facto public domain unless you’ve got deep pockets to pay for the lawsuit. Automotive companies specifically are pretty notorious GPL violators but unless your project is big enough to get FSF or SFC umbrella’ed no one is going to enforce it. I’ve had to explain to project leads before that we can’t just “borrow” some code for an internal closed source tool because it’s GPLv3, you’ve got to publish changes back. It happened anyway. For most corporate or state entities out there open source == free labor.

It’s not just about economics IMO either but there’s a social aspect too. Much like the guilds or unions of old. Yeah work sucks but it’s easier when someone has your back or you feel like you’re contributing to a larger endeavor it’s a little better. Right now we’re just lining pockets.

All that too say if you’re not one of the top 25 to 30 apps or sites you’re mostly serving content to bots these days and that’s my problem. :slight_smile: Dead Internet Theory anyone?

@hatsnp

I’ve read Kaczynski’s manifesto and TedPosting has long been a meme. I see him as a broken clock is right twice a day kind of a person, like most other terrorists. Likewise, Bin Laden’s writings on America did have some salient critiques on our foreign policy. But, these people seem to prey on vulnerable people who feel left behind or done wrong by slipping a radical and violent ideology in behind valid observations or grievances. It gives them a boogieman to be angry with and someone to pin their problems on. Ted was right about complex technology an over reliance on a state or a corporate entity but he strays off into some really awful stuff like racist eugenics as well. At some point it gets hard to accept the message because the messenger is so far unhinged. I wouldn’t really rely on him as a source just because he’s so far gone. Plus, some online seem to have taken the fact that he had some points on technology that some of his ideas WRT eugenics and race are also correct. People are quick to fall into a cult like pattern with a figure like him.

The latter point has already happened with GPS and Google Maps. How many kids under 25 do you know who can use a paper map or atlas these days? Heck, most of them can’t browse a directory structure in Windows or macOS because they’re used to how smart phones and Google Drive presents their data in one big pile. The analog world representation of a file cabinet with folders has no meaning to them because they’ve hardly seen it. Metadata and search is all they’ve known.

I’m not sure what the fix is or where to draw that line. I have no idea how to saddle or bridle a horse but that’s hardly relevant for getting around in contemporary life anymore so is it important for me to know? I agree that at some point using AI for so many general life tasks will become a problem as people lose the ability to do those tasks but I also want to avoid being that “old man yells at cloud” meme. Where is that line? I’m not sure, it’s definitely somewhere though.

Corporate or state control of these resources is a huge problem, but despite the decentralized nature of the early internet and web folks seem to have decided they prefer a few big centralized resources. I’m not sure there’s going to be a lot of changing that behavior.

3 Likes

I’m not sure there’s a line, or if it’s there, it’s moving too fast for us to perceive it. The problem is that if you do need to start saddle and take care of a horse, you could probably learn it in a year or less, and we can’t say the same for all these new technologies coming out. Maybe learning is the wrong word, I’d say building/owning/maintaining it is more correct.

The last time I looked at Glaze and Nightshade they didn’t have a Linux version, nor were they open source. This still seems to be the case:

I’ve got access to an ARM Mac but that might be a no-go for a lot of people.

2 Likes

I am aware that everything I put online will be scraped and fed into a training database for LLMs.

But I have decided to ignore this.

One of the most insidious effects of the current AI hype is stifling creative output from people. I enjoy writing blog posts and code, and taking photos, and ML is not going to prevent me from doing it. My 2 cents. :wink:

4 Likes

For me personally I only upload low-res-high-compression wherever possible. Website portfolio, social media, everything that is more or less publicly accessible.

“Full Quality” is only seen by very select few people (or printer drivers). Everyone else gets low(er)-res-high-comp.

If they train on my data, they only train of preview-quality images.
Full Quality is the only differentiator I can withhold easily.

Apart from that: ai is the new crypto. It’s a hype that a lot of people are betting on, but…that bubble will burst.

1 Like

Yes, but by the time that happens, we will be knee-deep in the next one. :wink:

Unfortunately, I’m not so sure. The average Joe/Jane was was not affected by crypto: a small portion of the population bought or mined any.

All the AI stuff is now everywhere. Teenagers and adults are worried about careers and jobs, others are all too happy to use it every day to ‘help with’ school assignments, to create images for fun, and so on. I do hope the billions that have gone, and will go into, AI development will turn out not to give the expected result.

However, I fear there will be more than enough of a result. In that case, pretty much everyone who is not doing physical work (I mean the sort that is not worth automating), or work that requires the ‘human touch’ (care workers, doctors, teachers, hairdressers etc.) are at a risk of losing their jobs. While ‘universal basic income’ (perpetual unemployment benefit) may solve one aspect of the problem (after all, corporations will need consumers), the effects on society (no purpose in life, no motivation to learn etc.) will be profound, if that happens.

Have you read Player Piano by Vonnegut (published a little over 70 years ago)?

2 Likes

Over 2+ decades in IT that was one of our biggest problems: We really didn’t know much in reality, we just had to guess with as much assumed accuracy and contextual relevancy as possible. “Training” was a PDF and a video, both of which were sales-heavy. The inmates were running the asylum.

1 Like

If one thing is for sure it’s that this is not a fad. Unlike blockchain that failed to find any applications beyond financial speculation LLMs and image generators are already being used in education and business. There’s also the realization is that right now these models are as bad as they’ll ever be and they’ll only get better from here. Progress might become more incremental but they will continue to improve.

A year and some change ago the best video AI could make was that fever dream of Will Smith eating spaghetti. Now we’ve got Sora that while far from perfect is approaching believable outside of pretty intense scrutiny.

Anyway, robotics have so far proven to be more expensive and more difficult than ML models doing white collar work. Might want to pick up a trade or two. That’s always been my value proposition: I’m marginally smarter than a monkey and cheaper than a robot.

@Tamas_Papp I still create output but I’m reconsidering how and where I show it. Paper portfolio and prints, I actually send people physical letters sometimes, private online communities and so on. On one hand it is what it is and ignoring it is about all most of us can really do. On the other I don’t like where this is headed and I am spending my time, the world’s energy resources (servers need power) and financial resources (hosting isn’t free) just to mostly feed bots at this point. That rubs me the wrong way.

5 Likes

Well, if it is any consolation, the business model may be unsustainable. Scraping data and building an LLM on it without any compensation is leading to class-action lawsuits which are now trickling through the system. At the same time, serving LLM queries is expensive.

Also, interestingly, training LLM on LLM-generated data apparently leads to garbage in a few generations:

So the phenomenon may be self-limiting as ML-generated content floods the web.

I think it is a nice piece of technology that will turn out to be useful once well-understood, but it will not lead to either utopia or the apocalypse. All giant tech companies of course invested in it as they could not afford to miss out, but from a business perspective, it remains to be seen if that was a sensible move.

IMO we should just practice what doctors call “expectant management”, which is a fancy way of saying that you keep an eye on stuff but do nothing :wink:

2 Likes