A shared object for identical lists

This is more of a general language design query.

I want to understand what the challenges are to ensure if we have:

x = [1, 2, 3]
y = [1, 2, 3]

Then ;

x is y == True just like strings.

I noticed Ruby and Javascript V8 like Python do not support this which makes me think there must be some challenges that made this feature to not be supported that I want to understand.

I have some ideas that I want to verify on this topic.

Why is using “is” important to you? The general suggestion is to not use it, and use == instead.

Am caring for object equality i.e id(x) should equal id(y) hence why “is” is important.

In your example, they aren’t the same object. For example, you can mutate one without it affecting the other. If you want them to be the same, just use “y=x”.

Yes, I know the semantics, I just want to understand if any one considered ensuring two identical lists referenced the same object in memory as a feature, if not what are the challenges from a general language design perspective?

To keep the current semantics you’d need to do copy on write. It doesn’t seem worth the hassle. What’s the benefit?

We would possibly have the same benefits as Strings, memory efficiency and other things I may not know about.

It seems like a lot of complications for a feature that wouldn’t provide any benefit to the average program. If you could demonstrate real software that would benefit, then maybe it’s worth considering.

1 Like

I don’t think this:

a = []
b = []
assert a is b            # fails today
assert id(a) == id(b)    # fails today
a.append(1)
assert b == []
assert a != b
assert a is not b
assert id(a) != id(b)

could ever actually work. It might be possible to share the C-level arrays of object pointers where they happen to be identical, but a and b are always going to have to be separate objects (a is b must always be False) so that a.append(1) doesn’t result in adding 1 to b. Any workaround for that would have to result in muddying what is and id() actually mean.

We don’t have this problem with str because str is immutable: any change to the string results in a new str object (though IIRC we cheat a bit there; if there is exactly 1 reference to the string in s = 'abc'; s += 'd', we will directly edit the object instead of creating a new one to assign to s).

1 Like

Good point, @zware. I agree that it can’t work with mutable objects, because then the id would have to change.

It occurs to me that this could be reasonable for tuple, even if it isn’t for list: I don’t see any semantic reason that the current

a = ()
b = ()
assert a is b   # succeeds today

couldn’t be extended to allow for

a = 1,2,3
b = 1,2,3
assert a is b    # fails today

for basically any hashable (maybe even unhashable?) tuple.

Whether that would actually result in any tangible benefit (presumably memory savings in the best case, but that might be at the cost of a memory penalty in the common case, and likely a speed penalty in all cases) is left as an exercise for the reader :slight_smile:

But to go further off the deep end into fevered-dreaming territory, if you were searching for every possible memory savings and this were implemented for tuple, list could essentially be reimplemented as a wrapper around a tuple with copy-on-write, such that in

t1 = (1, 2, 3)
t2 = (1, 2, 3)
l1 = [1, 2, 3]
l2 = [1, 2, 3]

t1 is t2 and t2 is not l1 and l1 is not l2, but there’s still only one underlying array holding references to 1, 2, and 3 shared between the 3 objects, whereas today there would be 4.

This really doesn’t seem like it would ever be worth the effort required to make it happen, but it might be fun to play with :slight_smile:.

Maybe even get dict_keys in on the action?

I’ll stop now :slight_smile:

1 Like

So if this were to implement COW, then an object’s identity would suddenly change based on an action later, while right now an object’s identity is guaranteed to be stable for the object’s lifetime. This can be important if I did something like compare to objects by identity and start ignoring one of them since the objects are the same. But if COW allowed me to change one that would break that identity equality. IOW identity becomes conditional on execution state and side-effects.

1 Like

I think that you may be working from a false assumption. There is no
guarantee that if x == y then x is y is also true for strings.

The behaviour here is implementation-dependent, so it will vary from
version to version. This is CPython 3.9:

>>> x = 'spam'
>>> y = 'spam'
>>> x is y
True

But:

>>> x is x.lower()
False
>>> y = 's' + x[1:]
>>> x is y
False

In addition:

>>> x = 'hello world'
>>> y = 'hello world'
>>> x is y
False

Identity of equal objects is not a language feature, it is an
implementation optimization that may or may not apply in any specific
circumstance.

Python the language only requires object identity for a handful of
built-in objects, such as None, Ellipsis, True, False. CPython the
interpreter currently may reuse certain objects, such as small ints,
some strings, and the empty tuple.

Extending that to mutable objects would be challenging. Either one has
surprising side-effects:

x = [1, 2, 3]
# ... later, in a different part of your code
y = [1, 2, 3]  # same object as x
y.append(4)
assert x == [1, 2, 3]  # Fails!!!

which is clearly unacceptable, or one would need copy-on-write, and that
brings its own challenges to do with object ids.

I believe that PyPy does something similar with lists. I’m not an expert
on PyPy, so I probably have the details wrong, but I understand that
there are times where a loop over a list of ints or floats can be
changed to operate over a low-level array of machine values, and that
the PyPy JIT compiler may actually destroy the Python list and recreate
it only when it becomes visible to the interpreter and other objects
again. To do so, it has to ensure that the list keeps the original ID.

I am probably explaining that very badly, maybe a PyPy expert can chime
in and word it better.

2 Likes

I agree its more of an optimization. Copy-on-write is the only option here. PyPy uses storage strategies, which allows for the layout of homogenous collections, unboxed in memory. I don’t know if this even helps.

It does not fail today

>>> a = 1,2,3; b = 1,2,3; assert a is b

if tuples are defined in the same unit of compilation.

In REPL every user input is compiled separately.

3 Likes

Operator is tests that both operands are the same object. Not different objects which happent to be equal, but the same object. And if you mutate that object, you will see changes via all references to that object.

The list display creates a new list object. You can mutate it without affecting all other existing list objects. Breaking this rule would break all Python code. It would mean that you have only one list object over all program, and any list operation works with that list object. Well, it would be interesting experience to work with such particular implementation of Turing machine, but it is not Python.

2 Likes

@nanjekyejoannah My intuition is that it’s very rare for two lists not only to have the same contents but to remain identical during the course of execution. Under that hypothesis, this optimization would probably be bear most costs than real-world benefits.

OTOH, it does make sense for immutable types such as strings.

1 Like

As a datapoint, some C++ library implementations used to have a copy-on-write std::string type (note: std::string is mutable in C++), but C++11 made such an optimization illegal.

Presumably, the C++ standards committee gathered existing data about the costs and benefits of a copy-on-write mutable datatype, and decided that copy-on-write wasn’t worth the hassle. Yet, C++ is clearly a performance-minded language.

3 Likes