Inconsistency in the output of id()

BlackShell · October 23, 2021, 8:38am

First have a look at the code:

comment = “Hello”
id(comment)
id(“Hello”)

Memory location or output for both id(comment) and id(“Hello”) is same but, look at the second scenario…

remark = “Hello World”
id(remark)
id(“Hello World”)

Here, the output or the memory location is different for both id(remark) and id(“Hello World”).

Means, when a single word string is assigned to a variable, the memory location is same weather we use the variable name or the assigned value. But, when a string of multiple words are assigned to a variable then the memory location is different for the variable name and for the assigned value.Can anyone please explain to me why doe this happen??

Thanks in advance

eryksun · October 23, 2021, 10:08am

Python source code gets compiled to a code object that contains bytecode, variable names, constants, and so on. It happens that string constants are automatically interned if they contain only ASCII alphanumeric and underscore characters. An interned string is referenced by an internal mapping in the interpreter state, which supports reusing a single string object instead of creating multiple objects for the same value. See the following functions in the CPython source: intern_string_constants(), all_name_chars(), and PyUnicode_InternInPlace().

A string that contains a space will not be automatically interned. Here’s an example that manually interns the “Hello World” string.

>>> s1 = sys.intern('Hello World')
>>> s2 = sys.intern('Hello World')
>>> id(s1) == id(s2)
True

That some strings are automatically interned is just an implementation detail. Python code generally should not compare strings by ID.

steven.daprano · October 23, 2021, 10:16am

The ID number of an object is an implementation detail of the
interpreter. There is almost no good reason to look at or care about the
ID number of objects.

You should remember that the id() function does not return the memory
location of the object except by accident. It is an accident of the
implementation that CPython happens to use memory locations for ID
numbers, but that is not a language guarantee and if CPython ever
changes to a compacting garbage collector, that absolutely will change.

Just like the Jython and IronPython interpreters, which already have
compacting garbage collectors, have ID numbers which are sequential
numbers 1, 2, 3, … And in PyPy, the interpreter does a lot of work to
preserve ID numbers that look like memory addresses even when the object
may have been unboxed into a machine value.

Another accident of implementation is that the interpreter might,
sometimes, cache small strings and reuse the same object multiple times.
In CPython, they may be reused if they look like identifiers. So when
you have two strings which look like identifiers, if the interpreter
caches them, you will get a single object and the ID number will be the
same:

s = "mynum"
t = "mynum"
id(s) == id(t)  # returns True

But another interpreter, say, Jython or IronPython, may have different
rules for caching strings, or no cache at all. For example, in Jython
2.7.1 the same code returns False.

So if you are writing portable code that will run under any version of
Python, any interpreter, you cannot rely on accidents of implementation
like string caching.

There is almost no good reason to care about the id() of objects, and
there is no reason to treat it as a memory address.

Back to CPython. Here’s another example where the interpreter doesn’t
cache the strings even though the string looks like an identifier (in
this case, a really long identifier):

s = "abcdef"*10000
t = "abcdef"*10000
id(s) == id(t)  # returns False in CPython 3.7

Now look at this:

s = "Hello World!!!"; t = "Hello World!!!"

If you run that line of code in the CPython interactive interpreter,
then id(s) == id(t) will return True. But it must be in the
interactive interpreter, and the two assignments must be on the same
line separated by a semicolon. If you put them on different lines, it
won’t work. And it doesn’t work in a script.

Strings are immutable, so the Python interpreter, whether it is CPython,
MicroPython, IronPython, Jython, Stackless, PyPy, RustPython or some
other interpreter, is free to cache whatever strings it likes, whenever
it likes, for whatever reason it likes (saving memory, or speeding up
code, or both, or some other reason).

As a Python programmer, you cannot rely on strings being cached. The
rules for when they will be cached vary from version to version, and
from interpreter to interpreter, from platform to platform, and they can
change at any time with no warning.

Do not rely on strings having the same ID if they are equal.

Quercus · October 23, 2021, 11:32am

See Python: 3. Data model.

Note the result of running this in IDLE with Python 3.10:

f = "fizz"
b = "buzz"
fb = f + b
fzbz = "fizzbuzz"

print(fb)
print(fzbz)
print("equivalence: ", fb == fzbz)
print("identity: ", fb is fzbz)
print("id equivalence: ", id(fb) == id(fzbz))
print("id identity: ", id(fb) is id(fzbz))

Output:

fizzbuzz
fizzbuzz
equivalence:  True
identity:  False
id equivalence:  False
id identity:  False

eryksun · October 23, 2021, 12:43pm

steven.daprano:

Back to CPython. Here’s another example where the interpreter doesn’t
cache the strings even though the string looks like an identifier (in
this case, a really long identifier):
s = "abcdef"*10000
t = "abcdef"*10000
id(s) == id(t)  # returns False in CPython 3.7

The “abcdef” constant is interned, but the result of "abcdef"*10000 is too big for the compiler’s constant folding in CPython. Thus the expression has to be evaluated at runtime, separately for each assignment. Here’s the disassembled code:

>>> code = compile(r'''
... s = 'abcdef' * 10000
... t = 'abcdef' * 10000
... ''', '', 'exec')
>>> dis.dis(code)
  2           0 LOAD_CONST               0 ('abcdef')
              2 LOAD_CONST               1 (10000)
              4 BINARY_MULTIPLY
              6 STORE_NAME               0 (s)

  3           8 LOAD_CONST               0 ('abcdef')
             10 LOAD_CONST               1 (10000)
             12 BINARY_MULTIPLY
             14 STORE_NAME               1 (t)
             16 LOAD_CONST               2 (None)
             18 RETURN_VALUE

If it were within the size upper bound for constant folding, both assignments would use the same pre-computed constant. For example:

>>> code = compile(r'''
... s = 'abcdef' * 3
... t = 'abcdef' * 3
... ''', '', 'exec')
>>> dis.dis(code)
  2           0 LOAD_CONST               0 ('abcdefabcdefabcdef')
              2 STORE_NAME               0 (s)

  3           4 LOAD_CONST               0 ('abcdefabcdefabcdef')
              6 STORE_NAME               1 (t)
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE
>>> code.co_consts
('abcdefabcdefabcdef', None)

steven.daprano:

Now look at this:
s = "Hello World!!!"; t = "Hello World!!!"
If you run that line of code in the CPython interactive interpreter,
then id(s) == id(t) will return True. But it must be in the
interactive interpreter, and the two assignments must be on the same
line separated by a semicolon. If you put them on different lines, it
won’t work. And it doesn’t work in a script.

For this case, id(s) == id(t) should be true in a script that’s executed in CPython. This case is unrelated to automatic interning (which acts at the interpreter level across all code objects). Since the string has a space in it, the string object doesn’t get interned automatically. Instead the behavior in this case is simply due to constant reuse within a single code object. For example:

>>> code = compile(r'''
... s = 'Hello World!!!'
... t = 'Hello World!!!'
... ''', '', 'exec')
>>> dis.dis(code)
  2           0 LOAD_CONST               0 ('Hello World!!!')
              2 STORE_NAME               0 (s)

  3           4 LOAD_CONST               0 ('Hello World!!!')
              6 STORE_NAME               1 (t)
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE
>>> code.co_consts
('Hello World!!!', None)

steven.daprano · October 23, 2021, 2:16pm

Exactly.

The interpreter can make whatever decisions it likes about which strings

get interned. In this case, the decision is that the string is too big

for constant folding, and so it is evaluated at runtime, not interned.

But it isn’t within the size bound, now. Next version of Python, who

knows what will happen?

Maybe the keyhole optimizer will be removed and no strings at all will

be interned. Maybe even huge strings of a million characters will be

interned. Maybe only words containing the letter “X” will be interned.

(Probably not any of those things, but you never know…)

The point is that all of these things are implementation details which

can and will change from one interpreter to another, and from one

version to another. They are not language features, and we must not rely

on them.

(I know that Eryk knows these things, I’m just repeating it for the

benefit of anyone else reading this thread.)

That’s interesting, because I have run examples where that has failed,

but now I can’t replicate it (except in Jython).

In any case, all of these things – constant reuse, the keyhole

optimizer, interning – are features of an implementation, not of the

language.

Quercus · October 23, 2021, 5:56pm

This is very interesting matter for experimentation, but of course in a production environment we all know to depend upon only that which is guaranteed, which in this case is equivalence.

BlackShell · October 25, 2021, 6:32am

Thank you guyz, I think I got a pretty good idea of the concept.
As @Quercus mentioned its an interesting matter for experimentation.
So, I would try some more twists and tricks. Thanks once again for all your explanations.