Coding has changed forever. I don’t remember the last time I built something entirely from scratch without libraries, tools, or AI assistance,
Libraries and tools other than LLM AI assistance don’t have a plagiarism problem. Reading StackOverflow, articles, and code, at least has very low likelihood of producing a verbatim plagiarism problem. (I’m excluding intentional or uninformed copy and paste in both cases since it seems agreed that policy documents aren’t good at impacting intentional rule-breaking if the rules are hard to enforce but can impact how well informed contributors are.)
In large part because I infer no ill intent on the part of those who violate, say, any online service’s Terms of Service. Very few people read these things to begin with. Some who do rationalize in pragmatic ways, like, in this case, “Ah. They’re worried about licensing violations sneaking in. But I triple-check carefully, and judge the risk of my falling at caching a violation is too close to 0 to matter. Let’s get on with it!”
It might be different if the PSF Contributor Agreement they explicitly agree to said more, but it doesn’t. It’s very brief and friendly:
That was drafted by a capable OSS lawyer (Van Lindberg), and was as brief as he could make it. Ironically, to fill it out, you first have to click to accept Adobe’s Terms of Use just to access the form. I seriously doubt anyone (except perhaps other lawyers) has ever read all of that
As is, our Contributor Agreement doesn’t spell out a contributor’s obligations, and doesn’t make explicit that the contributor has to have the legal right to license their contributions. Hinted at telegraphically via referencing the contributor’s “valid copyright notice”. It’s all hiding in “valid”.
At a higher level, I have no reason to imagine that anyone using bots now has 'ill intent", and our current policy says nothing against it. If they haven’t changed, in what sense would “ill intent” spring into existence just because words in the policy changed? Assuming they even noticed the change. You could say they showed disdain for the rules then, but that’s a different kind of “ill intent” (if you’re inclined to view it that way).
Plagiarism has been a major ongoing problem for years, long before AI assistants. Even for peer-reviewed scientific journals. Here from a decade ago (2016):
Plagiarism is common and threatens the integrity of the scientific literature.
…
In 400 consecutively submitted manuscripts [to a major American specialty medical journal], 17 % of submissions contained unacceptable levels of plagiarized material with 82 % of plagiarized manuscripts submitted from countries where English was not an official language.
What does “unacceptable” mean? Read the paper for clues. Some level of literal duplication is expected; for example, in citations and conflict-of-interest statements, which are all much alike across papers.
My apologies, but I couldn’t quite figure out how that’s relevant for what I said. Perhaps this is a misunderstanding:
I wasn’t arguing for you taking the risks away. My point instead was “I don’t think the current proposal reflects the risks well”.
In my opinion, a fix would involve a clear statement that LLMs seem to significantly raise the accidental, undetected plagiarism risk, and not suggesting any amount of reviewing is likely to substantially fix that.
The proposed text seems to not reflect that, while wanting the contributor on the hook for the fallout. I think that’s worrisome.
I would argue the amount shown in this clip or in this incident that LLMs clearly sometimes cause. If this were a human contributor, the Python project would probably ban them.
Trusting someone else regarding the authenticity of their work does not protect you from plagiarism either.
You claim that reviewing alone cannot determine whether a work is plagiarized, yet your argument relies on cases where plagiarism was actually detected through reviewing. I’m not sure how that makes sense.
Tips:
Ask LLMs to check whether a work is similar to existing ones.
I don’t understand where the conflict is, to be honest. Feel free to elaborate.
Obviously, only plagiarism cases that were found can be pointed to as examples. This is however independent of 1. how much plagiarism review in pull requests via related services is sensible, (I would argue not much, but then why encourage LLM commits?) 2. whether you would detect plagiarism in all cases where somebody else later may find it and sue you, 3. and independent of the argument that for well-meaning contributors I think the proposal is worrisome.
Neither expertise nor asking an LLM is a reliable plagiarism detector, as far as I know. Neither would guarantee somebody can’t find it later and get a well-meaning contributor into trouble.
AI assistants can be helpful when drafting code or documentation, but they also introduce specific risks that contributors must consider. These tools generate output based on patterns learned from large training datasets, and may produce material that is derived from copyrighted or licensed sources. Such derivation can occur even when the resulting text or code is extensively rewritten or does not resemble the original in form or wording.
Because the PSF can only accept contributions that the contributor has the legal right to license, all submitted material must be free of copyright or licensing conflicts, including material produced with the assistance of AI tools. Contributors are responsible for ensuring that their submissions meet this requirement.
Above part seems reasonable. (This isn’t legal advice.) But then:
Why not state next that you doubt with current LLMs, any contributor would be able to ensure this? It seemed like you share those doubts, based on posts above. Wouldn’t a risk-informing policy mention that?
I also wonder if it’s fair to expect reviewers to spot LLM plagiarism. Therefore the followup section that seemingly implies they should try, worries me too.
That leaves the question what other process could be suggested that isn’t a ban.
Anyway, I’m not a lawyer, I don’t want to suggest actual policy text. But I hope this explains at what exact point my concerns begin.
Is this a reasonable summary? (Apologies that some further posts have come in since I started this!)
@Tim.one and others are suggesting that a “best-effort” basis of both trying not to submit plagiarised code and detecting it are sufficient, along with warnings that LLMs (and humans!) can sometimes (I don’t think we know if it is “often”) produce such problematic code. Moreover, a ban would likely be ignored or circumvented by some (many?), and would be counterproductive.
On the other hand, my understanding of @ell1e’s position is that neither the warning nor the underlying policy could be rigorous enough to ensure no LLM-generated plagiarised code made it through. I also take it to mean that this suggestion would involve needing to police and banning transgressions not of just plagiarism but of any discovered LLM submissions (which would perhaps be just as difficult as plagiarism detection itself?).
I think the “best effort” point is crucial. There is no way to ensure that no plagiarised code is submitted (whether or not there is an LLM ban), *and* there is no way to ensure that no LLM-generated code at all is submitted (in the case of a ban). So in either case there will have to be some amount of trust, and crossed fingers regarding legal responses.
It should, in principle, be easy to find verbatim copies of code if one knows the dataset those copies would be drawn from. It is just a search problem. The size of the dataset (trillions of bytes?) seems to rule out using linear searches, but if the dataset were indexed first, it might be feasible. Near-verbatim matches might not be much more difficult.
I think the Overton Window on this discussion is too centered on the need for a ban being a given, but pragmatic concerns making it hard.
I think that we should not ban LLM-generated code, even if such a ban was feasibly enforceable. To me, the risks don’t outweigh the benefits.
CPython development has a lot to gain from using LLMs to diagnose, triage, and fix issues IMO. The core team suffers from various bottlenecks and lack of developer time, and LLM usage by the team can greatly alleviate some of that if they’re used well. Responsible and skillful usage of LLMs by contributors can also help, but it’s less of a given.
That leaves the uphill battle against slop, which I don’t think a ban addresses. Just as plagiarism was a concern before LLMs, worthless contributions taking developer time was also a thing (think hacktoberfest). LLMs greatly augment drive-by contributors ability to harm development with slop and we do need mechanisms (other than a ban IMO) to deal with that.
All in all I appreciate that this discussion is taking place and just wanted to voice a position I feel hadn’t been represented.
I’d like to stress my last post wasn’t meant to say a ban is obviously the only solution.
However, the only other proposal so far to me seemed like harder to do well than a ban on LLM code submissions.* If you have an idea how to make it work, now might be a good point to jump in.
*This doesn’t include analysis without code suggestions, for triage and diagnosis.
This overlooks that we already have a policy. “Keep the status quo exactly as is” is a third proposal, and wins by default.
To judge from results, there is so far no known case of AI-enabled license violations in the CPython code base - or of any other kind. I expect there are some anyway, but dating back years.
I don’t seek to change the intents of the current policy in any way, but to be more explicit about what they are, and especially for the benefit of less experienced developers than the core dev team.
There is one extremely visible case of AI-enabled claimed copyright violation (a 100% valid claim to my eyes) in the larger Python ecosystem: the chardet case. Which has nothing to do with verbatim text copying, and in which the putative violator was open about using Claude AI to entirely rewrite a code base under LGPL, and then slap the MIT license on Claude’s output.
But that kind of thing - blatantly in-your-face - can’t be stopped, and no proposal pretends to address it (your proposal doesn’t because no core dev would merge clearly disclosed rewrites (whether or not AI was involved) of an LGPL code base regardless of any stated policy),.
There are various paid and free services that already try to work like that. I have no personal experience with them, though. My understanding is that they still require major human effort to use effectively: they can be useful for initial triage, but not for making finall decisions. “False positives” are far too common.
For example, like many other OSS projects, Python does not ask contributors to assign their copyrights to the PSF, just to give the PSF permission to _re_license their work under other OSS licenses the Board unanimously approves. The contributor retains their copyright. Nothing at all “wrong” about them contributing the identical code to any other number of other projects too. Or to sell it. Or anything else they want to do. It’s still their work, to do with as they please. So not even literal duplication of strings of thousands of characters necessarily implies wrongdoing.
Sorting that out requires a level of contextual analysis beyond mere string searching (whether literal or fuzzy).
You mentioned several times that you are not a lawyer, so let it stay that way. Proposing policies should require at least a solid knowledge of the law, not gut feeling, guilt by association, vibes, or similar logical fallacies.
I don’t think “typically” is a credible assessment regardless of source. Best I can tell, AI assistants now produce billions of lines of code every month. We’re not seeing billions of claims of licensing/copyright violations. In the case of CPython, we’ve seen 0 so far. A researcher cherry-picking the 3 most obvious examples they’ve discovered is some cause for concern, but in context very far from “typical”.
They are qualitatively comparable: all “creative work” builds on memories the creator has accumulated over a lifetime. Humans can easily unwittingly copy too (copyright violations do not require literal copying - but the laws are open to interpretation - there is no “bright line”). “Hand-write code from scratch on their own” is just the end of a process. and, frankly, I know of almost no developer doing substantial work who doesn’t pause to consult colleagues. papers, and web sources numerous times along the way. Increasingly they also consult AI assistants All can be sources of unintended copyright violations, although I expect AI more likely to become one than lower-bandwidth sources..
It’s up to them to weigh cost/benefit by their own lights under the current policy, and I only seek to give them a little more explicit light to better inform their decisions.
I’m not claiming to “solve” anything. And I don’t think it can be “solved” - just mitigated by fostering a culture that encourages thought about responsible use of tools that simply are not going to go away.
If someone decides “oh, it’s all so complicated I’ll just never use AI at all!”, that’s fine too.
You’ve seen zero reported for now, studies suggest a steady rate.
Anyway, I guess I can merely sum up my thoughts:
It is my opinion claims of 1. LLMs work like humans, 2. humans plagiarize (without noticing) on a remotely similar level like LLMs, 3. automated plagiarism detectors are reliable enough to make up for a substantial change, 4. no reported plagiarism yet means you’ll be safe in the future, 5. there isn’t a steady stream of studies indicating plagiarism is recurring and not cherry-picked incidents, don’t seem backed up by the data. If you agree, then in my opinion a policy shouldn’t suggest reviewers & contributor responsibility for tenatively encouraged LLM use are fair or will fix this.
If you disagree, which clearly you do, then I guess it is what it is.
So not even literal duplication of strings of thousands of characters necessarily implies wrongdoing.
Thanks. That’s a great observation. Are there other causes for false positives? If not, then maybe all that’s left is to have the/a tool provide the source of the duplications and have a list for comparison of sources which we already have permission to use.
What rate, specifically? And are they directly relevant to the OSS context? I, for example, don’t care about the rate of copying when bots are asked for lemon meringue pie recipes
None of which I recall people making. Please “play fair”.
Very alike in some ways, very unalike in others.
I have no credible idea of rate in either case. “Not zero” is certainly true, though.
Nobody has even remotely suggested that.
The case to me is much clearer for widespread plagiarism in scientific journals, backed up by decades of studies published in highly regarded, peer-reviewed journals, with good replication rates by independent research efforts.
That context is very different, though. For example, “publish or perish!” isn’t yet a trope in OSS-world - although some people act like it is
Not banning something is not the same as encouraging the thing.
I disagree most with mischaracterizations of what “the other side” is actually saying. Nuance matters.
I have no disagreement with the “high-order bit”: use of AI tools significantly increases some kinds of real dangers.
But it also offers benefits, and I give real value to that side of the tradeoffs too. Nuance
There’s a ton of info about this already available on the web, and asking an AI assistant for pointers would be far more efficient for everyone than asking me .
I’m not a subject expert here. Tools already carve out exceptions for cases they know about, but specialized to context. For example, already noted that detection tools for scientific publications expect a lot of literal duplication in citations and conflict-of-interest disclosures. Perhaps curiously so, also in abstracts (summarizing “state of the art” is hard to do wholly creatively - and novelty would be counterproductive in that context).
But there are still a ton of false positives for humans to slog through. “Academic language” has its own rhythms. So does “programmer speak”.
I’ve noted before that Copilot is better than I am at detecting signs of AI origination, and by its own account of its work does almost nothing in the way of looking for text duplication. More “style analysis”. But I don’t know deatils.
My apologies, I was intending to sum up general arguments about LLMs contrasted to my own conclusion. I felt like at least some reflected notions here.
Anyway, I’ll bow out here.
It continues to be my opinion a ban is likely better than both the current policy and the proposal, for the project health. Despite enforcement questions, etc. I don’t see any such clarity with the LLMs copying as you seem to.