Issue: Use interned versions of string constants if they're already present · Issue #140328 · python/cpython · GitHub
PR: gh-140328: Use interned versions of string constants if they're already present by albertedwardson · Pull Request #140688 · python/cpython · GitHub
Hello everyone! I just wanted to discuss these changes here, since the issue itself is somewhat stale and has not received much attention.
Right now, CPython’s behaviour around automatic string interning is heuristic and selective. In addition, sys.intern() does not truly “fully” intern strings in all cases (see the comment referenced in the issue discussion: Use interned versions of string constants if they're already present · Issue #140328 · python/cpython · GitHub) In particular, interning is tied to specific interpreter state and pools, and does not necessarily guarantee a single canonical instance across all contexts.
In the linked PR, constant strings that are already interned get reused across the interpreter’s intern pools when loading code objects. The idea is to avoid creating duplicate string objects when an interned version already exists. This reduces duplication and makes the behaviour more consistent with explicit sys.intern() calls.
It is also important to mention that in the free-threaded build, all constant strings are already interned. In other words, we already have a configuration where the behaviour is effectively “intern all compile-time constants,” while the default build still uses identifier-like interning. This difference raises the question of whether the two builds should continue to diverge in this respect.
And so my questions for discussion are:
-
Should Python’s automatic interning be expanded to all compile-time constants (as is effectively the case in the free-threaded build), or should it remain limited to identifier-like strings?
-
If not, is it still a valid and worthwhile idea to deduplicate already-interned strings at code object load time? In my opinion, it feels somewhat inconsistent to keep multiple “interned” instances of the same string value when a canonical version already exists.
I would be very interested in feedback on any trade-offs here, as well as whether there are semantic or architectural reasons to preserve the current divergence between builds.