I am concerned about LLM code in Python

I think there are actually good legal arguments why LLMs should be allowed to train on copyrighted material, and why LLM output is not copyrightable. (So it’s not just that the courts have been bought.)

Furthermore, that even though the law can be changed to make it illegal to train on copyrighted material, (and hence make it possible to copyright the output of LLMs,) this would be extremely bad for us. (And good for the companies that control the large AI models.)

The legal argument in short is that mathematically processing copyrighted information is allowed. For example, scraping the net to create a search-engine index is allowed. Further, publishing such an index would also be allowed. There is no legal basis that I’m aware of that can distinguish this process from the process of training and publishing an LLM.

The social argument is that

  1. Granting copy-right holders permission to prohibit LLM training doesn’t get us workers anything long term because the middle-men business owners will just force us to “sell” them that right.
  2. Granting copy-right holders permission to prohibit LLM training will torpedo future projects to create lightweight useful free-software LLMs. I think those light LLMs hold a lot of potential to do good in our society, if in no other way than by reducing the power of OpenAi and Antropic.
  3. As long as it is impossible to get copy-right on something that is almost entirely produced by AI, investors will have to continue paying workers.
  4. If it becomes possible to get copy-right on LLM output, soon literally everything you could ever create will be copy-righted before you create it, because if there is one thing LLMs are good at it is flooding the zone.

As for those small scripts that are output by LLMs quite literally[1], I would argue that those are below the boundary of what should be copyrightable. Copyright has gone too far. By the letter of the law, all of us are guilty of breaking copyright law daily. But including a function of 30 lines in your software that someone else wrote first shouldn’t get you in trouble with copyright law, any more than including a sequence of 10 notes in a song should.

If you want to read the words of someone who is able to phrase things more elegantly and more completely than me, I do rather like this blog: Pluralistic: Supreme Court saves artists from AI (03 Mar 2026) – Pluralistic: Daily links from Cory Doctorow

Not sure if this is the right place to be discussing LLM copyright theory, but here we are ^^


  1. I did read the OP ↩︎

While I don’t care :wink: I don’t have an axe to grind here - I only care about “is it useful for a given purpose or not?” Copilot volunteered that info (or lie), to clarify its answer - I didn’t ask for it. It makes no difference to me whether it was an accurate claim or not.

So you don’t trust the legal system to deliver good results either. Great! “Question everything”, regardless of source. I’d agree “especially so” with bots for now.

Meta-observation: seems to me that people most negative about bots are judging them by whether they deliver complete solutions with scant effort on their part. Then there are people like me, who never expected that.. We value them as partners to iteratively (back & forth) explore solution spaces, as sources of approaches we may not have thought of, and sometimes to work out the tedious parts.

Using a bot as part of early PR triage is in line with the latter view, doing some tedious work that may or may not pay off. It would never make a final decision.

If the purpose is identifying license-violating copied code then I don’t think that commercial LLM-based tools are good for that. Steven already pointed out their general unreliability but I have also seen references that they are biased to avoid linking to e.g. GPL-licensed code when asked as if that is perhaps even part of their RL training.

My main problem with these bots is that other people are using them and then imposing that on me through things like LLM-generated pre-PEPs or PRs to fix hallucinated issues. I wanted the other human to put the effort in rather than showing me the LLM output from their own possibly misguided prompts.

You dumping the output of Copilot above is the thin end of that wedge but it gets much worse at the other end.

This is a bit unrelated to just AI pull requests and contributions, but I have noticed a concerning trend of “AI re-engineering” of GPL code. There’s even a service for it (yes, it’s presented as a joke, but it does technically do what it says on the tin).

Anthropic also recently attempted to AI generate a C compiler. This attempt failed, but they used the GCC test suite to test the output. That suite is basically a spec in and of itself.

Is this sort of “clean room engineering" of existing GPL code good? What is the end goal? I know there’s been a big push lately to adopt the MIT license over GPL (especially with Rust projects for some reason). I see this as a massive coordinated push by the big tech companies to find a way to finally free themselves from the burden of the GPL and enclose the commons for good.

All of this other discussion about the exact minutia of allowable plagiarism in AI assisted code feels like a smokescreen for the more nefarious end goal of killing open source by making the existing licenses unenforceable. They may have stolen GPL code in their private codebases as is, but this gives them the legal framework to finally just take everything and claim it as their own.

Let’s suppose I train an LLM on just one codebase, then use it to generate text. Is it still not copyrightable?

What if it’s not an LLM but some more simplistic form of autocomplete?

The US Copyright Office has been publishing reports on how existing law may apply in these cases. I found Part 2 (is LLM output copyrightable?) informative.

The basic argument (as I understand it) is copyright requires “human authorship” (for example, photographs monkey’s captured with a camera are not copyrightable), and just writing a prompt does not offer sufficient control to warrant a protectable contribution. However, not being able to copyright an LLM’s output is not to say the LLM is not infringing on rights (e.g. if the LLM regurgitations word-for-word some other protected work). Part 3, which focuses on training on copyrightable works, briefly touches on this topic in III.D, and it sounds like another part will focus on it.

I don’t do that. I never try to pass off bot-created content as my own work, and always credit bots for ideas I get from them. When I quote a bot, I quote it verbatim, and clearly attributed to a bot. But the topic here isn’t at all about accepting bot-generated work, but about whether a bot can help in identifying bot-created work.

Toward that end, not being blessed with ideological certainty in the absence of evidence :wink: , I tried a related thing. A while back this strange post showed up in the Help category:

It, and the discussion that followed. pretty much baffled everyone. So I asked Copilot to look at it and see whether it could decide whether it was at least partly the product of AI. I know you seem to be annoyed by seeing anything produced by a bot, but I’m going to quote its reply now too. Its analysis was better than anything I saw humans give about that topic, and it caught a bunch of clues I also missed:

Copilot's analysis

I think it’s spot on, but too cautious in only giving it only a 90–95%.chance of being mostly AI-inspired word salad .And strong real-life evidence that Copilot’s claim 'This is exactly the kind of forensic-stylistic analysis I’m good at" was no hallucinated boast :wink:

And it wrote all that in mere seconds.

In other words, stuff that an LLM generates for you isn’t copyrightable by the LLM. That’s fair, but that isn’t what’s in question.

This is the part that’s in question, and also the related question of “if the LLM infringes, is its output also infringing”.

There is also this website that i sometimes refer to when people just force me to read too many AI content (again, i am not that anti AI, i am Jakarta/Indonesia city lead for buildclub.ai myself)

i see Python Developer Guide is already there about Generative AI, hm….

If you are going to paste the whole thing, it is better to hide it by default like you’d do for long logs generated by a program rather than quote the whole thing as if quoting a person.

ChatGPT thinks the same :wink:

On forums like Python Software Foundation’s discuss.python.org, the norm is closer to how you’d handle code or logs: prioritize readability and avoid overwhelming the thread.

Here’s a good rule of thumb:

:white_check_mark: When to quote normally

  • Short excerpts (a few lines to a paragraph)
  • Specific parts you want to respond to or critique
  • When context is needed inline

:backhand_index_pointing_right: Use blockquotes (>) and trim to only what’s relevant.


:white_check_mark: When to hide (collapse) it

  • Long ChatGPT responses
  • Full transcripts or multi-paragraph outputs
  • Anything that would clutter the thread

:backhand_index_pointing_right: Use a collapsible section like:

<details>
<summary>ChatGPT response</summary>

(paste here)

</details>

This is very common on technical forums and keeps discussions clean.


:balance_scale: Extra tips

  • Always summarize in your own words first, then include the full response if needed.
  • Make clear what you’re asking (don’t just dump the AI output).
  • Trim irrelevant parts—people won’t read walls of text.

Bottom line

Treat it more like program logs than normal quotes if it’s long: hide it by default, and surface only the key bits.

If you want, I can help you format a specific post so it fits community expectations.

Point taken! And thank you. It’s hidden now :slight_smile:

It certainly is “interesting times” when access to and use of copyrighted material is thought legal now :smile: The law seems made by the powerful but cloaked as if it is for the meek.

I think Python should at least look at what other projects do and not be an outlier - swim in the middle of the shoal.

These sources suggest you’re wrong, as far as I can tell:

Layer demoing it: https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567

Microsoft inadvertently demoing it: https://www.pcgamer.com/software/ai/microsoft-uses-plagiarized-ai-slop-flowchart-to-explain-how-github-works-removes-it-after-original-creator-calls-it-out-careless-blatantly-amateuristic-and-lacking-any-ambition-to-put-it-gently/

Field study saying the rate they managed to pin down seems 2-5% plagiarism at minimum: https://dl.acm.org/doi/10.1145/3543507.3583199

Study saying higher model performance apparently is tied to more plagiarism: https://www.sciencedirect.com/science/article/pii/S2949719123000213#sec6 “We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.”

You added that. That is not what the article says. You are equating in your mind memorization with plagiarism. They didn’t.

That is why I dismissed your entire post and wouldn’t waste much time discussing it. Your conclusions are built into your assumptions.

Still doing it. To copy is factual, plagiarism is a judgment.

Source code plagiarism—otherwise known as programming plagiarism—is, simply put, using (aka copying or adapting) another person’s source code and claiming it as your own without attribution.

Source: Source code plagiarism: What it is and its integrity impact

Call it whatever you prefer. I hope in context, my replies were clear enough.

Edit: and many of the sources I linked seem to include judgment, the lawyer from the first link: “This is a copyright infringement.” PC gamer, the second link: “Microsoft uses plagiarized AI slop” The field study, third link: “three types of plagiarism widely exist in LMs beyond memorization” But whatever, the wording shouldn’t be the point.

You may find this interesting (source):

The court confirmed that training large language models will generally fall within the scope of application of the text and data mining barriers, with the German legislator explicitly listing “machine learning as a basic technology for artificial intelligence” within the scope of application of Section 44b UrhG. However, the court found that the reproduction of the disputed song lyrics in the models does not constitute text and data mining, as text and data mining aims at the evaluation of information such as abstract syntactic regulations, common terms and semantic relationships, whereas the memorisation of the song lyrics at issue exceeds such an evaluation and is therefore not mere text and data mining.

Quite the contrary, they reinforce my point:

Follow enough links, and you get to a half-hour video. A comment in the intro:

Turned out: every question from the audience was about the Chardet case - a piece of LGPL-licensed software that had been rewritten and translated using Claude Code, then relabelled under an MIT licence.

Not literal text duplication, but plagiarism.

Likewise.

Rather than using his source files, it’s obviously been run through an AI image generator of some kind, which recreated the general form with a slide of slop. Arrows no longer cleanly point to where they should, some bits of the image that were intentionally light grey to not complicate the geometry are now stark black, and the words “continuously merged” have been transformed into “continvuocly morged.” The word “feature” also morged its way into “featue” in one bubble, and the chart’s vertical axis is now “Tim” rather than Time.

That kind of slop requires some level of “intelligence” to detect. Literal comparison can’t catch it. See my recent post for how Copilot identified “AI slop” via language and structural analysis, with no “copied text” in play.

Some people (including me) were certainly suspicious of the work that analysis was aimed at, but Copilot made an evidence-based case clearly and comprehensively. It even found huge clues people missed. Most damningly, Copilot’s “The Arduino Code Is a Dead Giveaway” appears to be exactly on target. I had no idea what Arduino was, and just assumed it was one of dozens of niche development platforms I was unaware of. But Copilot knew better: it’s a niche platform for developing HW microcontrollers, absurdly unsuited to the topic at hand (garbage collection in Python). And Copilot explained too why an LLM (but not a human) may have hallucinated it was a plausible approach.

This is Python, and the PSF will never be a billion-dollar company. I’m not really worried about bad actors seeking to compromise our code base. We have review and testing processes in place to guard against malicious code. But, as open source, we’re as vulnerable as any other project to being overwhelmed by quite possibly well-intentioned would-be contributors trying to make up for their lack of relevant skills by trying to pass off “AI slop” as their own work.

I’ll personally use AI to try to detect such cases far faster than I could. Everyone else can suit themselves. But “a ban” will never work. That’s just words. The bot is out of the bottle now, and can’t be put back in.

It’s likely that this thread, or others like it, will continue. But I still don’t see a productive way to engage with it.

The thread is getting dragged all over the place regarding the facts of these machines, both contested and established. It’s largely a waste of time to argue these facts without establishing the higher order reason to even have a conversation.

I’d like to change the nature of the conversation by focusing on something actionable in terms of learning @ell1e’s views – even if we don’t agree, we can at least get somewhere towards common understanding.

Taking two points from the OP, the title and the last line:

I am concerned about LLM code in Python

Python should consider banning LLM code submissions.

CPython has established a policy.

OP doesn’t agree with the policy and would prefer a ban.
That’s fine. If you think the policy is wrong, you should argue – loudly if necessary – for changing it.

But that argument has to confront at least the following facts:

  1. Establishing a ban would require a vote from CPython core developers
  2. Given the composition of the core dev group, this would likely be a highly divisive / schismatic vote, even if it did pass
  3. Establishing a ban guarantees that contributors will lie, with all the knock-on effects that entails

(1) is relevant in that convincing me (some random schmo on the Internet) or many other thread participants only helps you in that it strengthens your ability to argue your case. And you might be able to convince people to work alongside you.

(2) is major and possibly unsolvable. Even core devs who agree with all of your priors may believe that the harm of even holding a vote outweighs the benefits of adopting such a policy. How do you intend to overcome the hurdle that the core dev pool includes folks at the far other end of the spectrum on this topic? Is your solution that CPython should fork?

(3) is something I am quite uncomfortable with, as stated earlier in the thread.


@ell1e, is there any policy short of a total ban on the use of these technologies which partially addresses your concerns? What would that policy look like?

Even if I fully agreed with you, I don’t think that a ban is going to happen, as a practical fact. This is a case in which making an all-or-nothing argument probably results in getting nothing (other than, perhaps, a moral victory).

If that’s an outcome that you’re okay with, then okay, I guess we just have to part ways here, since there’s not much room for further discussion.

Otherwise, I’d like to know what we could actively work on.

For clarity: I was suggesting an LLM code contribution ban, which would e.g. include AI auto completion, AI vibe coding, AI code rewriting, and AI code reviews that suggest code.

However, it wouldn’t include AI code reviews describing problems in natural language. (Those can be seen as problematic for other reasons, but I’m a pragmatist.)

Beyond this ban suggestion, I have nothing to add. I hope my input was informative, even if it doesn’t happen.