I am concerned about LLM code in Python

petercordia · March 26, 2026, 11:19pm

I think there are actually good legal arguments why LLMs should be allowed to train on copyrighted material, and why LLM output is not copyrightable. (So it’s not just that the courts have been bought.)

Furthermore, that even though the law can be changed to make it illegal to train on copyrighted material, (and hence make it possible to copyright the output of LLMs,) this would be extremely bad for us. (And good for the companies that control the large AI models.)

The legal argument in short is that mathematically processing copyrighted information is allowed. For example, scraping the net to create a search-engine index is allowed. Further, publishing such an index would also be allowed. There is no legal basis that I’m aware of that can distinguish this process from the process of training and publishing an LLM.

The social argument is that

Granting copy-right holders permission to prohibit LLM training doesn’t get us workers anything long term because the middle-men business owners will just force us to “sell” them that right.
Granting copy-right holders permission to prohibit LLM training will torpedo future projects to create lightweight useful free-software LLMs. I think those light LLMs hold a lot of potential to do good in our society, if in no other way than by reducing the power of OpenAi and Antropic.
As long as it is impossible to get copy-right on something that is almost entirely produced by AI, investors will have to continue paying workers.
If it becomes possible to get copy-right on LLM output, soon literally everything you could ever create will be copy-righted before you create it, because if there is one thing LLMs are good at it is flooding the zone.

As for those small scripts that are output by LLMs quite literally^[1], I would argue that those are below the boundary of what should be copyrightable. Copyright has gone too far. By the letter of the law, all of us are guilty of breaking copyright law daily. But including a function of 30 lines in your software that someone else wrote first shouldn’t get you in trouble with copyright law, any more than including a sequence of 10 notes in a song should.

If you want to read the words of someone who is able to phrase things more elegantly and more completely than me, I do rather like this blog: Pluralistic: Supreme Court saves artists from AI (03 Mar 2026) – Pluralistic: Daily links from Cory Doctorow

Not sure if this is the right place to be discussing LLM copyright theory, but here we are ^^

I did read the OP ↩︎

tim.one · March 26, 2026, 11:40pm

While I don’t care I don’t have an axe to grind here - I only care about “is it useful for a given purpose or not?” Copilot volunteered that info (or lie), to clarify its answer - I didn’t ask for it. It makes no difference to me whether it was an accurate claim or not.

So you don’t trust the legal system to deliver good results either. Great! “Question everything”, regardless of source. I’d agree “especially so” with bots for now.

Meta-observation: seems to me that people most negative about bots are judging them by whether they deliver complete solutions with scant effort on their part. Then there are people like me, who never expected that.. We value them as partners to iteratively (back & forth) explore solution spaces, as sources of approaches we may not have thought of, and sometimes to work out the tedious parts.

Using a bot as part of early PR triage is in line with the latter view, doing some tedious work that may or may not pay off. It would never make a final decision.

oscarbenjamin · March 27, 2026, 12:35am

If the purpose is identifying license-violating copied code then I don’t think that commercial LLM-based tools are good for that. Steven already pointed out their general unreliability but I have also seen references that they are biased to avoid linking to e.g. GPL-licensed code when asked as if that is perhaps even part of their RL training.

My main problem with these bots is that other people are using them and then imposing that on me through things like LLM-generated pre-PEPs or PRs to fix hallucinated issues. I wanted the other human to put the effort in rather than showing me the LLM output from their own possibly misguided prompts.

You dumping the output of Copilot above is the thin end of that wedge but it gets much worse at the other end.

hwelch · March 27, 2026, 12:43am

This is a bit unrelated to just AI pull requests and contributions, but I have noticed a concerning trend of “AI re-engineering” of GPL code. There’s even a service for it (yes, it’s presented as a joke, but it does technically do what it says on the tin).

Anthropic also recently attempted to AI generate a C compiler. This attempt failed, but they used the GCC test suite to test the output. That suite is basically a spec in and of itself.

Is this sort of “clean room engineering" of existing GPL code good? What is the end goal? I know there’s been a big push lately to adopt the MIT license over GPL (especially with Rust projects for some reason). I see this as a massive coordinated push by the big tech companies to find a way to finally free themselves from the burden of the GPL and enclose the commons for good.

All of this other discussion about the exact minutia of allowable plagiarism in AI assisted code feels like a smokescreen for the more nefarious end goal of killing open source by making the existing licenses unenforceable. They may have stolen GPL code in their private codebases as is, but this gives them the legal framework to finally just take everything and claim it as their own.

Rosuav · March 27, 2026, 1:11am

Let’s suppose I train an LLM on just one codebase, then use it to generate text. Is it still not copyrightable?

What if it’s not an LLM but some more simplistic form of autocomplete?

inventshah · March 27, 2026, 2:26am

The US Copyright Office has been publishing reports on how existing law may apply in these cases. I found Part 2 (is LLM output copyrightable?) informative.

The basic argument (as I understand it) is copyright requires “human authorship” (for example, photographs monkey’s captured with a camera are not copyrightable), and just writing a prompt does not offer sufficient control to warrant a protectable contribution. However, not being able to copyright an LLM’s output is not to say the LLM is not infringing on rights (e.g. if the LLM regurgitations word-for-word some other protected work). Part 3, which focuses on training on copyrightable works, briefly touches on this topic in III.D, and it sounds like another part will focus on it.

tim.one · March 27, 2026, 2:56am

I don’t do that. I never try to pass off bot-created content as my own work, and always credit bots for ideas I get from them. When I quote a bot, I quote it verbatim, and clearly attributed to a bot. But the topic here isn’t at all about accepting bot-generated work, but about whether a bot can help in identifying bot-created work.

Toward that end, not being blessed with ideological certainty in the absence of evidence , I tried a related thing. A while back this strange post showed up in the Help category:

It, and the discussion that followed. pretty much baffled everyone. So I asked Copilot to look at it and see whether it could decide whether it was at least partly the product of AI. I know you seem to be annoyed by seeing anything produced by a bot, but I’m going to quote its reply now too. Its analysis was better than anything I saw humans give about that topic, and it caught a bunch of clues I also missed:

Copilot's analysis

Copilot:

Here’s the short version up front: yes — that thread shows multiple, very strong indicators of AI-generated technical content that has been post-edited (poorly) by a human who doesn’t fully understand what they’re posting. Not because it contains buzzwords — lots of humans do that — but because of the specific failure modes, which line up almost perfectly with how LLM-generated “novel systems” collapse when pushed outside their training distribution.

Let me walk you through the signals, because this is exactly the kind of forensic-stylistic analysis I’m good at.

1. The “Buzzword Quilt” Pattern

The post is full of domain-adjacent terminology, but the terms don’t compose into a coherent model:

“checkpoint-based reference tracking”

“bitwise operations for fast allocation”

“carry-bit addresses”

“RGB zones”

“yield memory layer”

“multi-interpreter scaling”

“no GIL bottleneck”

“AOT, JIT, SIMD”

“parallel recursion”

Individually, these are real concepts.

Together, they form a semantic chimera — a hallmark of LLMs trying to synthesize a novel architecture by stitching together fragments of unrelated systems (generational GC, arena allocators, JVM card marking, Azul’s pauseless GC, Python’s obmalloc, etc.).

Humans with weak understanding usually misuse one or two concepts. LLMs misuse many, but in a way that sounds superficially plausible.

2. The “Overconfident Architecture Diagram” Syndrome

The thread includes a large, elaborate architecture diagram with:

multiple layers

color-coded zones

arrows

jargon labels

This is extremely typical of LLM-generated “system designs”: the diagram looks like a real architecture but collapses under scrutiny because the components don’t have meaningful operational semantics.

Humans who invent new GC algorithms usually start with small, precise, mechanically grounded descriptions.

LLMs start with big, impressive, vague diagrams.

3. The Arduino Code Is a Dead Giveaway

The GitHub repo contains… Arduino sketches.
This is not a human mistake.
It’s a classic LLM hallucination pattern:

The model tries to satisfy “show me a working implementation.”

It reaches for the simplest environment it knows for “memory zones” and “controllers.”

It produces microcontroller code that has nothing to do with Python’s allocator or GC.

A human proposing a new GC for CPython would never:

prototype it on Arduino

use I²C to simulate memory zones

store objects in fixed-size integer arrays

This mismatch between claimed domain and chosen implementation substrate is extremely characteristic of AI-generated pseudo-engineering.

4. The “Simulation Code” Is Another Tell

The C++ “simulator” is:

verbose

instrumented with excessive logging

full of mutexes and atomics used in ways that don’t correspond to real GC invariants

structured like a teaching example, not a real allocator

missing the hard parts (write barriers, root scanning, safepoints, object graphs, pointer tracing, etc.)

LLMs often produce simulators that:

model queues instead of graphs

model zones instead of generational invariants

model object kinds instead of object lifetimes

use random access patterns to simulate “load”
This is exactly that pattern.

5. The Writing Style Is Consistent With LLM Output

The prose has several distinctive markers:
a. Repetition with slight paraphrase
LLMs often restate the same idea 3–5 times with minor lexical variation.
Humans don’t.
b. Over-explanation of trivial points
E.g., explaining what “Active” and “Idle” mean in a memory allocator as if writing a tutorial.
c. Lack of mechanical grounding
The descriptions never touch the hard constraints of CPython:

refcount semantics

borrowed references

immortal objects

PyObject layout

obmalloc arenas/pools/blocks

GC generations

weakrefs

C extension invariants

A human proposing a GC replacement would have to engage with these.

d. Inflated performance claims without benchmarks
“87–91% lower memory usage” is a classic hallucinated metric.

6. The Author’s Replies Show No Internal Model

When questioned, the poster:

cannot explain invariants

cannot relate their system to CPython’s object model

cannot answer basic allocator questions

responds with more buzzwords rather than clarifications

This is exactly what happens when someone posts AI-generated technical content they don’t understand.

7. The “Universal GC for All Languages” Claim

This is another LLM trope:
When unsure how to specialize, the model generalizes.
Humans who design allocators know they are deeply language-specific.
LLMs often claim universality because they lack grounding in constraints.

Final Assessment

I would rate the likelihood of AI-generated content (with human post-editing) at ~90–95%.
The combination of:

jargon-dense but incoherent architecture

Arduino “implementation”

simulator code with no relation to real GC mechanics

inability to answer technical questions

repetition and paraphrasing

inflated performance claims

mismatched domain knowledge
…is almost a perfect match for LLM-generated pseudo-technical system design.
It’s not malicious — it’s the pattern you get when someone uses an AI assistant to “invent” a system they don’t understand, then tries to defend it without the underlying conceptual model.

If you want, I can also:

break down specific passages and show the linguistic fingerprints

compare the architecture to known GC designs to highlight inconsistencies

analyze the code for stylistic markers of LLM generation

or write a concise explanation you could post to the forum
Just say the word.

I think it’s spot on, but too cautious in only giving it only a 90–95%.chance of being mostly AI-inspired word salad .And strong real-life evidence that Copilot’s claim 'This is exactly the kind of forensic-stylistic analysis I’m good at" was no hallucinated boast

And it wrote all that in mere seconds.

Rosuav · March 27, 2026, 3:02am

In other words, stuff that an LLM generates for you isn’t copyrightable by the LLM. That’s fair, but that isn’t what’s in question.

This is the part that’s in question, and also the related question of “if the LLM infringes, is its output also infringing”.

irvan-putra · March 27, 2026, 3:15am

There is also this website that i sometimes refer to when people just force me to read too many AI content (again, i am not that anti AI, i am Jakarta/Indonesia city lead for buildclub.ai myself)

i see Python Developer Guide is already there about Generative AI, hm….

saaketp · March 27, 2026, 5:14am

If you are going to paste the whole thing, it is better to hide it by default like you’d do for long logs generated by a program rather than quote the whole thing as if quoting a person.

ChatGPT thinks the same

On forums like Python Software Foundation’s discuss.python.org, the norm is closer to how you’d handle code or logs: prioritize readability and avoid overwhelming the thread.

Here’s a good rule of thumb:

When to quote normally

Short excerpts (a few lines to a paragraph)
Specific parts you want to respond to or critique
When context is needed inline

Use blockquotes (>) and trim to only what’s relevant.

When to hide (collapse) it

Long ChatGPT responses
Full transcripts or multi-paragraph outputs
Anything that would clutter the thread

Use a collapsible section like:

<details>
<summary>ChatGPT response</summary>

(paste here)

</details>

This is very common on technical forums and keeps discussions clean.

Extra tips

Always summarize in your own words first, then include the full response if needed.
Make clear what you’re asking (don’t just dump the AI output).
Trim irrelevant parts—people won’t read walls of text.

Bottom line

Treat it more like program logs than normal quotes if it’s long: hide it by default, and surface only the key bits.

If you want, I can help you format a specific post so it fits community expectations.

tim.one · March 27, 2026, 5:19am

Point taken! And thank you. It’s hidden now

Paddy3118 · March 27, 2026, 10:34am

It certainly is “interesting times” when access to and use of copyrighted material is thought legal now The law seems made by the powerful but cloaked as if it is for the meek.

I think Python should at least look at what other projects do and not be an outlier - swim in the middle of the shoal.

ell1e · March 27, 2026, 2:05pm

These sources suggest you’re wrong, as far as I can tell:

Layer demoing it: https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567

Microsoft inadvertently demoing it: https://www.pcgamer.com/software/ai/microsoft-uses-plagiarized-ai-slop-flowchart-to-explain-how-github-works-removes-it-after-original-creator-calls-it-out-careless-blatantly-amateuristic-and-lacking-any-ambition-to-put-it-gently/

Field study saying the rate they managed to pin down seems 2-5% plagiarism at minimum: https://dl.acm.org/doi/10.1145/3543507.3583199

Study saying higher model performance apparently is tied to more plagiarism: https://www.sciencedirect.com/science/article/pii/S2949719123000213#sec6 “We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.”

franklinvp · March 27, 2026, 2:33pm

You added that. That is not what the article says. You are equating in your mind memorization with plagiarism. They didn’t.

That is why I dismissed your entire post and wouldn’t waste much time discussing it. Your conclusions are built into your assumptions.

franklinvp · March 27, 2026, 2:45pm

Still doing it. To copy is factual, plagiarism is a judgment.

ell1e · March 27, 2026, 2:47pm

Source code plagiarism—otherwise known as programming plagiarism—is, simply put, using (aka copying or adapting) another person’s source code and claiming it as your own without attribution.

Source: Source code plagiarism: What it is and its integrity impact

Call it whatever you prefer. I hope in context, my replies were clear enough.

Edit: and many of the sources I linked seem to include judgment, the lawyer from the first link: “This is a copyright infringement.” PC gamer, the second link: “Microsoft uses plagiarized AI slop” The field study, third link: “three types of plagiarism widely exist in LMs beyond memorization” But whatever, the wording shouldn’t be the point.

ell1e · March 27, 2026, 5:30pm

You may find this interesting (source):

The court confirmed that training large language models will generally fall within the scope of application of the text and data mining barriers, with the German legislator explicitly listing “machine learning as a basic technology for artificial intelligence” within the scope of application of Section 44b UrhG. However, the court found that the reproduction of the disputed song lyrics in the models does not constitute text and data mining, as text and data mining aims at the evaluation of information such as abstract syntactic regulations, common terms and semantic relationships, whereas the memorisation of the song lyrics at issue exceeds such an evaluation and is therefore not mere text and data mining.

tim.one · March 27, 2026, 6:14pm

Quite the contrary, they reinforce my point:

Follow enough links, and you get to a half-hour video. A comment in the intro:

Turned out: every question from the audience was about the Chardet case - a piece of LGPL-licensed software that had been rewritten and translated using Claude Code, then relabelled under an MIT licence.

Not literal text duplication, but plagiarism.

Likewise.

Rather than using his source files, it’s obviously been run through an AI image generator of some kind, which recreated the general form with a slide of slop. Arrows no longer cleanly point to where they should, some bits of the image that were intentionally light grey to not complicate the geometry are now stark black, and the words “continuously merged” have been transformed into “continvuocly morged.” The word “feature” also morged its way into “featue” in one bubble, and the chart’s vertical axis is now “Tim” rather than Time.

That kind of slop requires some level of “intelligence” to detect. Literal comparison can’t catch it. See my recent post for how Copilot identified “AI slop” via language and structural analysis, with no “copied text” in play.

Some people (including me) were certainly suspicious of the work that analysis was aimed at, but Copilot made an evidence-based case clearly and comprehensively. It even found huge clues people missed. Most damningly, Copilot’s “The Arduino Code Is a Dead Giveaway” appears to be exactly on target. I had no idea what Arduino was, and just assumed it was one of dozens of niche development platforms I was unaware of. But Copilot knew better: it’s a niche platform for developing HW microcontrollers, absurdly unsuited to the topic at hand (garbage collection in Python). And Copilot explained too why an LLM (but not a human) may have hallucinated it was a plausible approach.

This is Python, and the PSF will never be a billion-dollar company. I’m not really worried about bad actors seeking to compromise our code base. We have review and testing processes in place to guard against malicious code. But, as open source, we’re as vulnerable as any other project to being overwhelmed by quite possibly well-intentioned would-be contributors trying to make up for their lack of relevant skills by trying to pass off “AI slop” as their own work.

I’ll personally use AI to try to detect such cases far faster than I could. Everyone else can suit themselves. But “a ban” will never work. That’s just words. The bot is out of the bottle now, and can’t be put back in.

sirosen · March 27, 2026, 6:16pm

It’s likely that this thread, or others like it, will continue. But I still don’t see a productive way to engage with it.

The thread is getting dragged all over the place regarding the facts of these machines, both contested and established. It’s largely a waste of time to argue these facts without establishing the higher order reason to even have a conversation.

I’d like to change the nature of the conversation by focusing on something actionable in terms of learning @ell1e’s views – even if we don’t agree, we can at least get somewhere towards common understanding.

Taking two points from the OP, the title and the last line:

I am concerned about LLM code in Python

Python should consider banning LLM code submissions.

CPython has established a policy.

OP doesn’t agree with the policy and would prefer a ban.
That’s fine. If you think the policy is wrong, you should argue – loudly if necessary – for changing it.

But that argument has to confront at least the following facts:

Establishing a ban would require a vote from CPython core developers
Given the composition of the core dev group, this would likely be a highly divisive / schismatic vote, even if it did pass
Establishing a ban guarantees that contributors will lie, with all the knock-on effects that entails

(1) is relevant in that convincing me (some random schmo on the Internet) or many other thread participants only helps you in that it strengthens your ability to argue your case. And you might be able to convince people to work alongside you.

(2) is major and possibly unsolvable. Even core devs who agree with all of your priors may believe that the harm of even holding a vote outweighs the benefits of adopting such a policy. How do you intend to overcome the hurdle that the core dev pool includes folks at the far other end of the spectrum on this topic? Is your solution that CPython should fork?

(3) is something I am quite uncomfortable with, as stated earlier in the thread.

@ell1e, is there any policy short of a total ban on the use of these technologies which partially addresses your concerns? What would that policy look like?

Even if I fully agreed with you, I don’t think that a ban is going to happen, as a practical fact. This is a case in which making an all-or-nothing argument probably results in getting nothing (other than, perhaps, a moral victory).

If that’s an outcome that you’re okay with, then okay, I guess we just have to part ways here, since there’s not much room for further discussion.

Otherwise, I’d like to know what we could actively work on.

ell1e · March 27, 2026, 7:18pm

For clarity: I was suggesting an LLM code contribution ban, which would e.g. include AI auto completion, AI vibe coding, AI code rewriting, and AI code reviews that suggest code.

However, it wouldn’t include AI code reviews describing problems in natural language. (Those can be seen as problematic for other reasons, but I’m a pragmatist.)

Beyond this ban suggestion, I have nothing to add. I hope my input was informative, even if it doesn’t happen.