AI code contributions

Brodie Robertson had an interesting youtube upload today about use of LLMs contributing code to projects. Should it be allowed or not and are there licence considerations. I was wondering about the policy with some of the projects here like Dartable, RT, Siril etc.

Here’s the video mentioned

I’m watching now and so far it addresses many of the questions I would have about it.

All hail the slop machine :robot:

Based on personal experience, AI coding tools can produce output that look like they were copied from open source software. I’m talking about code that is nearly identical to snippets from open source code. When I say it looks like they were copied, I mean it in the same way that a teacher or professor would look at a student’s work and suspect plagiarism. There’s nothing in the surrounding context to explain why the AI tool chose to write the code in that manner or why it decided the code needs to do what it does. The only reasonable explanation was that it copied the code rather than just coincidentally coming up with the same code.

This is not official RawTherapee policy, but I would say be VERY careful when using AI to contribute. No one should submit anything AI generated unless it is obvious that it could not have been copied from somewhere. The risk for violating licenses is too great.

2 Likes

where’s the issue with using LLMs to speed up implementing new functionality? especially when it comes to generate User Interfaces there’s no benefit in copying stuff manually instead of giving code from an already existing functionality to a llm to generate the gui for the new one in a similar way.
In the end the math behind those functions isn’t affected by licenses … and as long as llms are trained using infective GNU licensed stuff the result must be licensed the same way :wink:

So not a big issue for GNU licensed software …

you should be careful with a statement like that. The whole “what if the LLM spits out exactly my code again, does my license still apply” isnt definitively solved in courts yet.

This can have very very funny side effects … if you “copy” GPL code into your MIT/BSD licensed code by using an LLM. suddenly your code might not be clean MIT/BSD licensed anymore.

With a nice chain effect.

And LLM coding vendors are aware of the issue. that’s why they offer 2 forms of their LLM one safe copy and one which isnt so strict on the licenses/copyright of the ingested training set.

This works only for round 1 though. as soon as opensource developers start using the not so strict LLM for their code. they might introduce the problem above and suddenly the training set for the safe LLM is poisoned.

3 Likes

The initial post mentioned darktable, RT and siril - so no issues to be expected …

I presume it’d be possible to run a search on the generated code to see if it matches some already online, maybe this could be automated (may not be practical manually)

Are you sure about that?

So the license used by those projects has no license where code from an incompatible license can slip in? even within opensource you can have a lot of fun with compatibility.

I think it should be allowed.

I believe that most code that could “slip” from other projects would already be something generic enough that anyone could write it. I’m sure that there are guardrails against highly specific code, especially as these tools develop.

If the LLM is used by a professional who knows what it is outputting, I don’t see why he couldn’t use those tools to speed up his development.

How likely is an LLM to straight out copy code? I know we saw this when these tools were starting to get used, but I wonder if it could even happen nowadays with how much development has happened.

how likely is it that the LLM spits out a pattern that it has seen already to solve a certain problem?

very likely. that is its purpose.

3 Likes

But a pattern is not licensed code. As an example, I am sure a lot of wine developers end up implementing some API’s in exactly same way as MS does in windows, but that does not mean the code is “copied”. It can’t even be copied because they are forbidden to have worked for MS to contribute, so no legal challenges arise.

I am sure if you cross check thousands of FOSS projects you will find repeated code everywhere, without having been directly copied, are licenses infringed? I think the same applies to this LLM context.

1 Like

We will have a lot of fun in courts over the next few years. not just about code copyright and LLMs but also every other output flavor.

And I guess also about the behavior of the AI companies on the ingest side.

e.g. Ars Technica: "Copyright Office head fired after reporting AI tr…" - Mastodon :slight_smile:

1 Like

And last but not least … most of the code is often shit anyway. I would rather debug stuff where i know what my intentions where than having to debug AI slop where I dont know why it picked a certain construct and what the intentions of the original authors were, from which it copied the code.

Programming can already be a dance of “the code looks correct but has a subtle bug”

here for some real life data: Testing sourcery.ai and GitHub Copilot for cockpit PR reviews · Martin Pitt

2 Likes

I don’t doubt these companies break the law all the time and train their models illegally :smiley:

That is another question entirely. I’m only interested in what is being outputted, that’s why I mentioned highly specific code having guardrails in place so no direct copy happens, if it can even happen nowadays.

Fringe programming problems should always be implemented by human developers in my opinion. Using LLM’s for a lot of boiler plate code that anyone could write shouldn’t pose big copyright risks since it’s stuff people write similarly all the time and millions of examples were used to train those exact cases. The chance of a direct copy in those problems is likely zero.

I agree with you. This is why it should be used by people who know what they are doing. Junior developer’s code also suffers from the exact same problem as AI generated code :smiley:

1 Like

I’ve recently had an open source contribution to one of my projects that raised a bunch of red flags. Overly polite text, a bit too wordy. Somewhat mechanical code change in nature.

It is anyways my policy to always ask for clarification before taking anything seriously. Too often I received some drive-by collateral damage report that was neither actionable nor important to the reporter.

And again, the response was polite, wordy, and triggered my AI detector. And yet, the github account seemed legit, the contribution was valid, and didn’t appear part of a spam attack.

So I gave them the benefit of the doubt, continued the code review, and it all resulted in a useful contribution. I expect that an LLM was used to write part of their responses. But probably merely to help them express where their English skills lacked. No harm done.

In my own work, I have somewhat reluctantly rolled out the Copilot LLM to my team. It is genuinely useful, if used in moderation. But crucially, we use LLMs for brain storming and autocompleting snippets of code that we know exactly how to write already. Never to write code outright, or implement ideas I haven’t yet understood myself. Because at the end of the day, I will not sacrifice my long term development and learning for a short term gain from automation.

7 Likes

Yes, this is exatly what I was trying to express myself, albeit with difficulty.

2 Likes

It can’t be prevented, and probably never will be. As mentioned in the video, you could put in the prompt that you only want output “inspired” by MIT or BSD licensed code, but it will only ever be taken as a suggestion and if it has been trained on some matching GPL code, that may well be what you get back.

2 Likes

Perhaps an example would be helpful to understand the situation. The code I mentioned in my first comment is licensed under Apache 2.0. It is a very permissive license. RawTherapee is licensed under GPLv3. It’s a copyleft license that is more restrictive than Apache 2.0 because it includes conditions that seek to preserve the open-source nature of the protected code and derivatives. Apache 2.0 is compatible with GPLv3 in that Apache 2.0 code can be included in GPLv3 code.

This makes it fine for AI to copy Apache 2.0 code into RawTherapee, right? Yes and no. The Apache 2.0 license requires you to distribute the text of the license along with the code. A quick search of the RawTherapee codebase shows that there is a copy of the Apache 2.0 license, but in reference to the Droid Sans Mono Dotted font. Is this enough? I’m not sure, but suppose the license was not in the codebase or there’s a different license involved. You have to be careful.

Isn’t another big issue the copyright notice? All open source licenses I know, require you to keep the copyright notice if you copy the code. So even if the licenses are compatible in general, this aspect is something LLMs do not (and probably can not) respect.

When I program, my code is influenced by all the things I’ve seen in my life. No doubt there is inspiration from open and closed source code in there.

LLMs are not much different. The problem is that they sometimes unwittingly quote code outright, which my human brain is not capable of.

AFAIK, there’s no harm in building a filmic-like tone mapper into Lightroom, so long as no algorithm or data is copied exactly.

1 Like