Should CPython intern all constant strings automatically?

Issue: Use interned versions of string constants if they're already present · Issue #140328 · python/cpython · GitHub
PR: gh-140328: Use interned versions of string constants if they're already present by albertedwardson · Pull Request #140688 · python/cpython · GitHub

Hello everyone! I just wanted to discuss these changes here, since the issue itself is somewhat stale and has not received much attention.


Right now, CPython’s behaviour around automatic string interning is heuristic and selective. In addition, sys.intern() does not truly “fully” intern strings in all cases (see the comment referenced in the issue discussion: Use interned versions of string constants if they're already present · Issue #140328 · python/cpython · GitHub) In particular, interning is tied to specific interpreter state and pools, and does not necessarily guarantee a single canonical instance across all contexts.

In the linked PR, constant strings that are already interned get reused across the interpreter’s intern pools when loading code objects. The idea is to avoid creating duplicate string objects when an interned version already exists. This reduces duplication and makes the behaviour more consistent with explicit sys.intern() calls.

It is also important to mention that in the free-threaded build, all constant strings are already interned. In other words, we already have a configuration where the behaviour is effectively “intern all compile-time constants,” while the default build still uses identifier-like interning. This difference raises the question of whether the two builds should continue to diverge in this respect.

And so my questions for discussion are:

  • Should Python’s automatic interning be expanded to all compile-time constants (as is effectively the case in the free-threaded build), or should it remain limited to identifier-like strings?

  • If not, is it still a valid and worthwhile idea to deduplicate already-interned strings at code object load time? In my opinion, it feels somewhat inconsistent to keep multiple “interned” instances of the same string value when a canonical version already exists.

I would be very interested in feedback on any trade-offs here, as well as whether there are semantic or architectural reasons to preserve the current divergence between builds.

CC @efimov-mikhail @sergey-miryanov @colesbury

As we discussed before, it is not obvious for me what benefits we get if we intern all constant strings? If yes, how it changes performance of the interpreter?

Interning helps when you do lookup in a dict (for instance and class attributes, global variables). There are no great benefits of interning error messages or docstrings.

Interning all constant strings will blow out the memory consumption for generated code. We already have such issue even with ASCII identifier-like strings, but it will be worse with all strings.

There are no great benefits of interning error messages or docstrings.

Right, but in the free-threaded build we already intern all constant strings.

Is that behaviour considered temporary or experimental, or is it expected to remain long-term?

Even without expanding interning to all constants, reusing an already-interned string instead of creating another copy seems like a nice improvement in terms of deduplication.