I am concerned about LLM code in Python

This is beginning to resemble an association fallacy.

It seems that I’m either miscommunicating, or else we disagree strongly.

I’m not saying “we shouldn’t discuss amending the policy”. I prefer that we try to figure out what we’re amending it to do before we start trying to make changes.

I’m aiming to:

  1. Proactively educate contributors on the special dangers use of AI tools pose to the project. Not aimed at experienced contributors, but at the many well-intentioned contributors we get who may simply not know any better yet. The seeming supreme certainty AI tools speak with can be seductive to the less experienced, who don’t yet have caution baked into their bones.

  2. Encourage a culture of AI-use disclosure.

Not ambitious, not Draconian, not claiming to solve anything “once and for all”. Incremental improvements to a policy I already approve of.

What, e.g, are you aiming at? If you don’t have anything specific in mind, that’s “meta” too :wink:

Because I don’t think this thread is on course to result in policy change[1], I’m interested in making sure that the conversation actually has some outcome other than just burning out the folks who read it and participate.

Ideally, anyone who reads it will have a better idea of what issues people who participated here consider important. If a policy change is analogous to a diff, I’m trying to get at the issue description which motivates that diff.


  1. I don’t ask anyone to agree on this; only please understand that’s where I’m coming from. ↩︎

I think it was self-evident what the OP wanted to accomplish (“ban AI use”), and nearly as evident what I wanted. Which you prompted me to flesh out explicitly in a later post (my brief “two aims” follow-up), which I agree was worth writing up. There’s also the issue report I opened against the devguide repo, which I don’t think I mentioned here before:

These are all perfectly ordinary steps for those who want any kind of change to make. The “meta” stuff is just a distraction from the processes and the base issues. Yes, most proposals of all kinds fail in the end. That’s no reason, though, to “spare” people from hearing them. Don’t like one, that’s fine, but please oppose it for what it says (or fails to say that you think must be said).


  1. I don’t ask anyone to agree on this; only please understand that’s where I’m coming from. ↩︎

When coming at this from an angle of plagiarism and doubts there’s any good “safe” length for code copying, the natural result is a strict policy suggestion. Sorry if it seems intense.

I’m happy some people seem to think the discussion has been helpful :slightly_smiling_face: despite the differing opinions.

FWIW, this seems unfulfillable to me for most contributors, while “don’t use LLMs” seems fulfillable.

Just to explain why my suggestion differs, even though I share the same concerns.

FWIW, this seems unfulfillable to me for most contributors, while “don’t use LLMs” seems fulfillable.

It’s currently unfulfillable but forward-looking. In essence it states why the current generation of LLMs present a legal risk to the project, without excluding the possibility that some LLMs in the future may address their predecessors’ shortcomings. Further, it demonstrates to LLM providers the concerns projects have with adoption of their technologies and indicates what features would make them acceptable.

Other projects I participate in have taken this same approach. Instead of saying contributors can’t use LLMs, they explain the risks without assuming that those risks will persist indefinitely.

I feel like forward looking would be a ban until a real, existing LLM with better licensing can be pointed to. (Apologies if this seems redundant.) But perhaps it’s really just me.

I feel like forward looking would be a ban until a real, existing LLM with better licensing can be pointed to.

It’s effectively a ban, yes, but a ban which lays out a roadmap for ending the ban. Rather than proscribing a particular tool, it proscribes detrimental behaviors which those tools happen to encourage at the moment (but perhaps won’t always).

Contributing to Open Source has always been scary if you listen “too much” to lawyers :wink: Contributors have always been responsible (morally and legally) to ensure their contributions are free of copyright and license violations.

Which is unfulfillable for just about everyone, and always has been. I do not, e.g., pay an IP firm to check my contributions for potential violations, and the PSF doesn’t either. Never did, almost certainly never will.

My suggested wording reminds people of that responsibility, emphasizing “including material produced with the assistance of AI tools”. Newer contributors in particular may well not be aware of that use of bots actually increase the chances of unintended violations.

Contributors bear primary responsibility regardless, with or without bots. It’s up to them to judge the tradeoffs. It’s also on the PSF to perform due diligence in review. “Don’t use bots, period”, is also partly fantasy: saying something is banned doesn’t make it stop.

I would much rather people be open about their use than try to hide it. These tools can be wonderfully helpful, but take extra care to use responsibly.

Sorry, I don’t get what you mean.

Are you saying if people hand-write code from scratch on their own, you think that typically causes plagiarism issues? (Or in any way comparable to LLMs?)

I talked about this here.

Yes, trivially. Every contributor is able to claim that they wrote the code from scratch (whether they did it or not) and on their own (this one doesn’t really exist). So, this universe is a superset of content created by LLM and acknowledged to be so.

I still don’t get it, sorry! What do you mean with “claim” here?

To declare it to those reviewing.

I don’t recall any policies or concerns when centralized code repositories, Stack Overflow, digital books, and similar resources became widely available. What is so different about an algorithm generating a stream of text in this context?

If people can plagiarize even when writing code from scratch, then LLMs aren’t really the issue here, the problem is how people use any tool.

I was arguing about a well-intentioned contributor.

Of course people can always plagiarize on purpose, but shouldn’t somebody following the policy not be lured into a legal trap? (Not that I would know for sure, IANAL.)

With manual code, one wouldn’t typically unknowningly steal. Or am I missing something?

I think you have as an assumption that a person working “on their own” is in a fundamental level different from an LLM.

  • I go read a book to learn algorithm A,
  • implement it without looking at the book “from scratch”,
  • didn’t know that algorithm A really refers to a meta-algorithm and what the book had written contained multiple specific choices within that meta-algorithm that could have been done differently.
  • In the resulting implementation from scratch I repeat those same choices.

I don’t know if plagiarism or not, the person reviewing asked me to make some changes because it looked too much as coming from that book.

Well, you keep using words that contain a judgment, but learning (both human and machines) fundamentally involves imitation and some things allow room for modification while still accomplishing the same goal, some don’t. The degree of similarity that would be judged as “stolen” is not the same for say, a function doing binary search, than for the design of a UI.

I’m not a lawyer but I thought I heard algorithms cannot be copyrighted, but ask somebody smarter than me and assume I’m wrong. Anyway, I’m therefore assuming this example is probably not relevant until somebody smarter tells me I’m wrong (feel free to).

I’m saying plagiarism because that’s what many sources claim LLMs do, letter by letter. This isn’t the same as just replicating an algorithm.

With manual code, one wouldn’t typically unknowningly steal. Or am I missing something?

I have absolutely reviewed patches in the past from well-meaning contributors who thought that “manually writing code” meant copying and pasting examples from online forums, then maybe slightly tweaking them and gluing them together. It’s an education problem. Even when they agree to the DCO or a binding legal contract like a CLA, they don’t necessarily realize that the way they “learned to write code” was inherently problematic, copyright-wise.