Why integers are referring to same object when instantiated in a list/tuple?

Calling id(50000) in individual line returns different object - which is expected since they are mutable immutable.

id(50000) # 4769402608
id(50000) # 4769407472
id(50000) # 4769402672

But when called a tuple/list we do not get same objects,

id(50000), id(50000), id(50000) # 4769413456, 4769413456, 4769413456

Why?


Edit: I meant to say immutable.

1 Like

Integers are not mutable.

The short answer to every question like this is that there are not any useful guarantees, and that code that cares about the id of objects is usually wrong-headed. If you have mutable objects, then you will need to care about whether they are getting “shared” between two different places. Sometimes it is a good idea to share them deliberately. Sometimes there must be separate objects, so that they can update separately. But in a large amount of cases, it won’t matter - especially if you won’t actually mutate the objects.

The exact results will depend on implementation details. In CPython (it has been like this as far as I can recall, but as far as I know there is no guarantee of it staying this way): Since 50000 is not one of the “small” integers that get cached ahead of time, each separate line that says id(50000) would create an object for the 50000 integer, figure out its id (i.e., the address of the object in memory - keep in mind that this is “virtual” from the operating system), and then throw both things away afterwards. If you do this several times in a row, normally the results will be the same. The reason is that after each line is done and the memory is cleaned up, the overall structure of the memory - which addresses are in use - is the same. Therefore, when Python tries to allocate memory for the next integer, the operating system will probably tell it to use the same part of memory that it used last time. (It may do this, because the previous integer was “freed” - it doesn’t exist any more).

The same thing happens in a line like id(50000), id(50000), id(50000) in the interpreter: each 50000 doesn’t have to exist any more, after id was called on it - only the objects for the id results need to stay around. But those results can be separate integer objects, even though they have the same value. For example, I just got:

>>> xy = [id(50000), id(50000)]
>>> xy # the id() results have the same value....
[140443434156976, 140443434156976]
>>> id(xy[0]) # but they are separate objects:
140443434488528
>>> id(xy[1])
140443434156944

If we directly put the same integer into the list, though, Python can tell ahead of time that it’s the same integer, and reuse it:

>>> xy = [50000, 50000]
>>> id(x[0])
140443434156976
>>> id(x[1])
140443434156976

It couldn’t do this in the other version, because when Python figures out the code for building a list, it doesn’t consider what the id function will do. It has to assume that id could give something different every time, even for the same input. (After all, that’s exactly what happens if you use e.g. random.randint.)

1 Like

See also: oop - In Python, when are two objects the same? - Stack Overflow

What you’re seeing in this particular example is actually one of compilation units. Compare these very similar blocks of code:

>>> id(50000), id(50000), id(50000)
(140277446787856, 140277446787856, 140277446787856)
>>> id(50000)
140277446787856
>>> id(50000)
140277446787856
>>> id(50000)
140277446787856
>>> print(id(50000))
140277445526160
>>> print(id(50000))
140277445526224
>>> print(id(50000))
140277445525584
>>> for _ in range(3):
...     print(id(50000))
... 
140277446787856
140277446787856
140277446787856

Now let’s try those same examples but saved as a script file.

# ids.py
print(id(50000), id(50000), id(50000))
print(id(50000))
print(id(50000))
print(id(50000))
for _ in range(3):
	print(id(50000))

The result is:

$ python3 ids.py 
139824601347568 139824601347568 139824601347568
139824601347568
139824601347568
139824601347568
139824601347568
139824601347568
139824601347568

Why the difference??

Well, in the first place, Python makes absolutely no guarantees here. It’s perfectly acceptable for the IDs of indistinguishable integers to be the same, and it’s perfectly acceptable for them to be different. But CPython is trying its best to be efficient, so it follows the classic waste-reduction pattern:

  • Reduce: Use only the objects it actually needs - there won’t be an object representing 50000 unless you actually refer to it in your program or have it as the result of a calculation
  • Reuse: Whenever it’s reasonable and not too much effort, use the same object in multiple places
  • Recycle: When you’re done with an object, dismantle it into spare parts, notably memory. CPython does this intelligently, keeping track of “ready-to-go integer memory” (called a “free list”).

What you’re seeing here is a difference in the “Reuse” step. As it’s going through your source code and compiling it, CPython takes note of things like integers, tuples, and strings. If it sees the same one again, it’ll reuse the object, since there’s no way for it to make a difference, and it’s more efficient to reuse it. But in the interactive interpreter (the REPL), each line you type in is compiled separately and immediately run, so this feature can’t really kick in - it’s not worth the effort to go searching for other uses of a number. That’s why, when you do three of them in a single tuple, or compile a single loop that checks three times, they all have the same ID; the “Reuse” optimization is able to find those.

Incidentally, I mentioned tuples and strings just now. You can actually do the exact same trick with strings, and as long as they contain nothing but literals, tuples. Try this, both in a script and at the interactive prompt, and let me know what you see!

print(id((1,2,3)))
print(id((1,2,3)))
print(id((1,2,3)))
print(id("spam"))
print(id("spam"))
print(id("spam"))

I expect that, in a script, you would see the same ID for each tuple, the same ID for each string, but the tuple and string have different IDs. At the REPL, it’s theoretically possible to have one, two, or six different IDs shown, without violating Python’s rules (since none of the objects exist at the same time).

1 Like

I get three different value for above three cases on interpreter.

I get what you’ve predicted. Can I safely conclude interpreter execution is different than script execution? Does it also mean I am effectively changing control flow when dropping to python interpreter by calling pdb.set_trace()?

This is so strange. I am from Java background, and each new object invocation returns an unique references.

I wanted to investigate how far Python goes in the reuse step. First section with variable behaves as expected, but in the second part, why id comparison of object() passes by is comparison fails?

a, b = object(), object()
print(a is b) # False
print(id(a) == id(b)) # False
print(id(a)) # 4306304960
print(id(b)) # 4306304976

print(object() is object()) # False 
print(id(object()) == id(object())) # True
print(id(object())) # 4306304944
print(id(object())) # 4306304944

Yes, but more in the compilation phase than the execution phase (other than the part where values get printed out automatically). You can sometimes see a difference between running three separate lines in the REPL and running those same three lines grouped with an if True: or a loop or something (which forces them all to be compiled together).

Probably not significantly, though again, things compiled separately may not take notice of optimizations across the compiation unit.

Yep, it’s less efficient though, so Python tries to save memory!

This part’s separate so I’ll split it into a separate response.

What you’re seeing here is kinda subtle, but there’s an important part of the Python rules about IDs: objects are allowed to reuse the IDs of deleted objects. (Not all Python interpreters do, but it’s legal to do so.) IDs are unique among concurrently-existing objects. Here’s how each of those is executed:

print(object() is object())`
  1. Look up the name print. Hang onto that.
  2. Look up the name object and construct one. Hang onto that.
  3. Look up the name object (again) and construct one. Hang onto that.
  4. Compare the two things from step 2 and 3 and see if they’re the same.
  5. Print that out.

Note that, at step 4, both objects must exist.

print(id(object()) == id(object()))
  1. As above, look up print
  2. Look up id, we’ll need that.
  3. Look up object and construct one.
  4. Call the id function and hang onto its result.
  5. Okay, we’re done with that object, can throw it away.
  6. Look up id again (Python doesn’t assume that it’s still the same)
  7. Look up object and construct one
  8. Call id to get the ID of this object
  9. Cool, we’re done with that object now.
  10. Hey, those two integers, are they equal?
  11. Print out the result.

Note how, in this version, the two objects actually don’t exist concurrently. We’re completely done with the first one before the second one gets constructed.

Incidentally, if you want to find out what Python does (which I did as part of writing up those sequences, to make sure I didn’t bloop), you can use the dis module. It’s quite detaily but highly informative.

3 Likes

This is not completely true. For example, while == is unreliable for Java’s strings, there is some attempt to reuse string objects that represent the same value, and the ability to require that explicitly (the intern method).

Python is a little more powerful here. When the class is called, there are two different hooks: __new__, which actually figures out which object to use, and __init__, which sets the attributes. Normally, __new__ will fall back to the object.__new__, which has some internal implementation that actually allocates the memory (eventually, you reach a level that you can’t describe in pure Python). But some classes can override __new__ so that it just returns an already existing object. For example, the type of the special None object can make sure that it’s impossible to create more instances - None is as true of a singleton as is possible in Python.

In order to evaluate either is or ==, the left-hand side and right-hand side must both still exist. For object() is object(), there are two calls to object - Python does not reuse these, but makes a new instance for each call.

But for id(object()) == id(object()), the left-hand side and right-hand side of == are the results from id. Python is free to compute the entire left-hand side before it computes the right-hand side, or vice-versa. Suppose it computes the left-hand side; it creates an object, then creates the int to represent the id that this object has. But now it doesn’t need the object any more; it only needs the int. So the object can get garbage-collected. And when the right-hand side is evaluated, the new object for the right-hand side could be in the same place in memory that the first one was.

1 Like

Technically, Python is required to compute the entire LHS before computing the RHS, but relevant to this discussion, Python is free to then dispose of the temporary object and reuse its ID.

1 Like

I understand that in a script compiler can introspect and reuse objects. But then why we are getting different id values in the final three lines on this snippet? Is it because we are on a different scope and optimizer is not kicking in?

Which line in the disassembly indicates that? Here is a link showing disassembly for this particular code snippet :Dis This: Online Python Disassembler

Continuing here,

x = [(), (), ()]
print(id(x[0]), id(x[1]), id(x[2]))

x = ['', '', '']
print(id(x[0]), id(x[1]), id(x[2]))

x = [object(), object(), object()]
print(id(x[0]), id(x[1]), id(x[2]))

x = [[], [], []]
print(id(x[0]), id(x[1]), id(x[2]))

4400647152 4400647152 4400647152
4400610552 4400610552 4400610552
4377214912 4377214928 4377214896
4421491264 4421702976 4421703232

I understand the first two groups where all id values are the same - happening due to the Reuse step.

But why the same is not happening in the last two groups?

It isn’t a line in the disassembly. To fully understand when an object is disposed of, we need to track two things: the current stack, and the reference count of the object in question. (There are other considerations like reference cycles, but they won’t affect us here.) I’m going to use this function for the example:

def f():
    print(id(object()) == id(object()))

The disassembly in the version of Python I’m using (main branch as of a couple months ago - probably should update at some point but it’s not going to affect this) looks like this:

  1. RESUME (irrelevant to this discussion)
  2. LOAD_GLOBAL print - stack contains: [print]
  3. LOAD_GLOBAL id - stack contains: [print, id]
  4. LOAD_GLOBAL object - stack contains: [print, id, object]
    (Note that, at this point, these are simply the functions themselves; they haven’t been called yet.)
  5. CALL with 0 arguments. This removes the top entry from the stack, calls it, and pushes the result back onto the stack. Stack now contains: [print, id, object #12345]
    The object is created with just one reference. It’s never assigned to a name or attached to any other object; the only reference to it is the one right here on the stack.
  6. CALL with 1 argument. This removes the top TWO entries, they being the function and its argument, and again, pushes the result. Stack now contains: [print, 12345]
    The stack no longer carries a reference to the object, only to the print function and the integer 12345. Thus, there are no references remaining, and the object can be disposed of. This reference disposal is an inherent part of the call; during the function’s execution, the function itself needs a reference to the object, but after that, there aren’t any more.
  7. LOAD_GLOBAL id. Stack: [print, 12345, id]
  8. LOAD_GLOBAL object. Stack: [print, 12345, id, object]
  9. CALL, no arguments. Stack: [print, 12345, id, object #23456]
  10. CALL, 1 argument. Stack: [print, 12345, 23456]
  11. COMPARE_OP 72 (==). As a binary operator, this removes the top two objects from the stack, compares them, and pushes the result back. Since I’ve given them distinct IDs for convenience, they’re not equal. Stack now has [print, False]
  12. CALL, 1 argument. The word “False” is printed out, and the return value from print is pushed. Stack: [None]
  13. POP_TOP. Dispose of the top element of the stack. Stack: []
  14. RETURN_CONST 0 (None). Terminate the function, returning None.

So you can see here that after steo 6 and before step 9, neither object exists. The CPython interpreter will dispose of that first object the moment its reference count htis zero (once the call to id() is done with it), and that spot in memory can be reused. A different interpreter might store all objects in a massive sequential list, and again might reuse the slot; or it might assign IDs to objects only on an as-needed basis (if you ask for an object’s ID, it picks the next number and stores that on the object), and might never reuse IDs. By the rules of IDs, both of these are valid.

Note that if you change the function slightly, this is no longer the case.

def f():
    print(id(first := object()) == id(object()))

This function MUST return False. Its execution is extremely similar, but with one very important distinction: between steps 5 and 6 we have an additional bit of work.

  1. COPY. This duplicates the top stack element. Stack: [print, id, object #12345, object #12345]
  2. STORE_FAST into name first. This removes the top stack element and stores it in the given local variable. Stack: [print, id, object #12345].

The local variable is now its own reference to the object. Even though it’s not on the stack any more, it has to continue to exist. It won’t be disposed of until the function ends, or the variable is reassigned or del’d. Thus both objects exist concurrently, and their IDs have to be different.

3 Likes

The notion of reusing objects only works if they’re indistinguishable and immutable. This is true of tuples, and it is true of strings, and also a few other types, but it isn’t true of lists (if you have two empty lists, they’re still quite distinct, and appending to one is not the same as appending to another), and vanilla object()s exist solely for their identities, so they are also not reused. This kind of reuse requires deliberate interning; you can achieve this for your own objects by doing something like this:

class ReusableObject:
    _cache = {}
    def __new__(cls, value):
        if value in cls._cache: return cls._cache[value]
        obj = cls._cache[value] = super().__new__(cls)
        obj.value = value
        return obj

Given equal values, this will return identical objects (from cache). Example:

>>> x = 12345 / 2
>>> y = 12345 / 2
>>> x is y
False
>>> ReusableObject(x) is ReusableObject(y)
True

This is called “interning”, and is extremely useful when the objects in question might be costly to construct. For example, imagine constructing a DatabaseConnection object, where you identify the DB to connect to, but it guarantees to always return the same connection every time. All of the work goes into __init__, but it only gets called the first time you construct any particular connection.

2 Likes

I would say that integers are a better example than tuples here, since the elements of a tuple could themselves be mutable.

1 Like

Yep, but it’s equally true of all of them (with that caveat).