Demoting the `is` operator to avoid an identity crisis

gpshead · September 29, 2018, 6:54pm

After noticing that the is and is not operator have inadvertently been used thousands of times over the decades throughout our code base at work in lieu of == or != comparisons to string literals and numbers… Could we address this wart in the language itself or within the CPython VM?

I call it a wart because code reads and writes wonderfully if value is 'thing': just sounds logical. That it does not do what a reader blissfully unaware of identity vs equality may expect is unfortunate because it is so easy to read and write as English without realizing there was something to think about.

Object identity is an important concept. But it is not normally something someone needs to use. The common valid use cases in Python are is None and is not None where it can be important to avoid triggering an objects __eq__ or __ne__ methods which can (and often do) do the “wrong” thing. The other common but infrequent use case for an identity check is comparing against custom singletons, usually a module’s named instance of object() or a dummy type or similar.

I don’t want a language breaking change! We can’t remove the is operator or have it blindly start behaving like ==.

An interesting approach pointed out by a colleague is that PyPy gave up on is being only for identity in all situations as so much existing code failed due to using it to compare to immutable basic types by virtue of CPython’s implementation detail of having singletons for widely used values so said code “worked” despite itself. Their choice effectively normalized the CPython implementation detail practice of is working for a known subset of comparisons. (TODO: dig into their code and see what logic they chose for this situation)

So has the ship sailed? I’m not convinced. We could alter is to behave differently when both sides are known basic immutable types, triggering an actual equality check. This could break some code but that should be rare - within reason for a normal feature release. What I think would be bad is ever triggering dunder method calls. I believe we’d only want to do this for our own known built-in basic immutable types, not offer it to user defined types or C extensions. The goal would be to work around the “identity crisis” whenever a VM happens to decide to use singletons some or all of the time for some set of values as we do for bytes, str, and int.

id(LHS) == id(RHS) is effectively a slow replacement for is - It could be used when someone rare actually wants to know if "foo" was returned from a C API string building function and thus is a different object or of it is the same instance of literal "foo" because it was generated by Python code within their module (today’s CPython VM implementation detail).

guido · September 29, 2018, 9:31pm

I’m guessing this is only a problem for int and str – and one could argue float, but float comparisons are known to be iffy – writing if x == 4.2 is not much saner than if x is 4.2.

I’m reluctant to drop the connection between is and id – this is old DNA and would likely break old code. (IIRC Jython also struggled with this.) Though it may be the only recourse we have other than the status quo.

Perhaps there’s a way that we can change the implementations of int and str to intern all values? This seems unlikely though – we don’t want to have to maintain a (weak!) table of potentially millions of integers just to solve this problem.

Maybe static analysis is the way to go? Presumably that’s how you found this in your code base…

I’m sorry I don’t have anything more encouraging… I agree that is represents an attractive nuisance due to it being one of the shortest and most common English words.

storchaka · September 29, 2018, 9:53pm

If the problem is with value is 'thing', it is easy to make the compiler emitting a syntax warning if one of arguments is a literal.

barry · September 29, 2018, 10:28pm

Yes, you do see it, but more common IME is thing == None, i.e. using == where is is more appropriate. I think this ship has sailed, and static analyzers and code reviews are the best way to educate inexperienced users. Most seem to get it once you explain it to them.

gpshead · September 30, 2018, 3:36am

But == None actually works in virtually all cases until you get into odd numpy types with strange comparators. So I don’t mind that as much despite is None being more technically correct.

pylint flags all of these situations fwiw.

A SyntaxWarning (do we even have such a thing?) is an interesting idea but would probably be as universally hated as other import time warnings - showing up to code users rather than developers.

storchaka · September 30, 2018, 8:33am

https://bugs.python.org/issue34850

njs · September 30, 2018, 10:14pm

I asked about this in #pypy, and IIUC their system is:

They preserve the rule that x is y is equivalent to id(x) == id(y).
To preserve compatibility with CPython’s de facto behavior, they also make the rule that x is y is equivalent to x == y, in the following special cases: “ints, longs, floats, complexes, unbound methods, empty string/unicode/tuple/frozenset, single-char strings, single-char unicodes”
They do this by giving up on id(x) returning a pointer to the object itself. So for example, id(int) is (int << 4) | 1, and id(b"x") is (ord(b"x") << 4) | 11.

CPython has the additional constraint that people do assume that id(obj) returns an actual PyObject* (e.g. this is regularly abused in ctypes code). I guess PyPy’s approach mostly preserves this, basically as a variant of the tagged pointer trick: either id(obj) returns a PyObject* or it returns an odd value. I guess not too many people depend on id(int) or id(b"") returning a pointer to the actual object. (I certainly hope not.)

guido · September 30, 2018, 10:29pm

That’s not bad!

storchaka · October 1, 2018, 5:16am

01.10.18 01:24, Nathaniel J. Smith пише:

To preserve compatibility with CPython’s de facto behavior, they also make the rule that x is y is equivalent to x == y , in the following special cases: “ints, longs, floats, complexes, unbound methods, empty string/unicode/tuple/frozenset, single-char strings, single-char unicodes”

This is needed because these types don’t preserve identity in lists.


    x = 1

    a = [x]

    assert a[0] is x

PyPy uses a compact mode for lists of these types. Indexing lists causes boxing values. a[0] is not the same object that was added into the list, it is a new int object. But Python requires that a[0] is x be true.

steve.dower · October 2, 2018, 1:50pm

Many of those who do are using it to spray the heap in the hopes of finding exploitable memory. Arbitrary IDs (like IronPython) are a good thing for security reasons. (IOW, I would gladly break the assumption that id() leaks memory addresses.)

njs · October 2, 2018, 8:18pm

Getting better security here would require breaking the id-to-pointer correspondence in general though, right, not just for int and single-character strings? And that seems hard, because I think there’s a lot of code that exploits the id-is-pointer trick for complex objects – it’s pretty much the standard way to peek under the covers with ctypes. E.g., here’s jinja2 using it to manipulate traceback objects: https://github.com/pallets/jinja/blob/7a6704db55fcd9bbe5a90c3868e03f5a5fcf176a/jinja2/debug.py#L350 (this particular example is avoidable in 3.7+, but you get the idea)

An opt-in lockdown mode that scrambles id return values and disables ctypes seems more viable.

guido · October 2, 2018, 8:44pm

IIRC in Google App Engine we scrambled the return value of id() for security reasons. But we only supported a small set of vetted extensions, so that kept it manageable.

nas · October 4, 2018, 6:14pm

FWIW, my tagged pointer experiment obviously breaks this. In my current implementation id(1) is 0x3 if the value is tagged, could be a real pointer address if the int is heap allocated. You can’t use id() to compare ints.

vstinner · October 22, 2018, 2:43pm

The id() function and the “a is b” operator are confusing many users. As I was very confused with the “a === b” operator in PHP.

Maybe we can keep the id() function but make it less visible? For example, move the id() function from builtins to sys?

For the is operator… It would be a highly backward incompatible change, “is None” is a common and recommended expression.

vstinner · October 22, 2018, 2:45pm

The “is” operator is also a common way to distinguish func() call from func(arg=None) using a “sentinel” singleton object.

barry · October 23, 2018, 12:20am

The id() function and the “a is b” operator are confusing many users

Is it? I don’t see a ton of confusion, although I do occasionally see people doing things like foo == None or assert bar is True. I guess the latter reads nicely, which is probably why people use it more than just assert bar.

That’s one reason why I like is and think using it to compare singletons makes sense. It just reads better.

uranusjr · November 13, 2018, 8:59am

I am not sure whether it qualifies as confusing, but from my personal experience teaching Python to first-timers, almost all people instinctively parse is as “equal”, and need to explicitly memorise its actual meaning in Python. The same goes for id(), but it is generally dismissed as “you don’t need to know it” (so making it non-global would make a lot of sense to me as well).