Adding the `is in` and `is not in` boolean operators: membership by `id()`

I just had a simple idea that I think would nicely fit in with the existing language, but wanted to get some community feedback before potentially formally pursuing this as my first proper PEP. So, here’s my PEPP (Python Enhancement Proposal Proposal) :stuck_out_tongue:

Thoughts/feedback/support appreciated!


Abstract:

For container types, we already have the membership test operators in and not in, which test for membership using equality or identity with the container’s members: x in y is equivalent to any(x is e or x == e for e in y).

However, there may be situations in which one might want to test for membership strictly by identity. We propose new membership test operators is in and is not in which will strictly check identity for container types: x is in y would be idiomatically equivalent to any(x is e for e in y). There will also be a new dunder method __contains_id__() which allows the is in operator to be implemented in user-defined types.

Motivation:

The combination of the is operator (as an identity test) and the in/not in operators (as membership tests) into the is in/is not in operators (as membership tests by identity) seems to be a natural merger of both the operator syntax and the abstract notion behind it, making this a near-seamless extension to the language from the programmer’s perspective.

Admittedly, however, the intended behavior of x is in y for builtin containers is already easily created using the short expression any(x is e for e in y) mentioned above. If syntactic neatness is insufficient cause to add a new feature, I should add that adding a new dunder method __contains_id__()would allow user-implemented types to override both __contains__() and __contains_id__() as they see fit. Having two separate definitions of “membership” assigned to keyword operators as needed may prove quite useful.

Backwards Compatibility:

No significant concern. The expressions use only existing hard keywords in new combinations; in current versions of the language the token sequences is in and is not in always raise syntax errors. All currently valid python code would carry the same exact same meaning with the new operators implemented, with the sole exception of code making improper use of the currently-undefined dunder method __contains_id__().

Specification:

I’m just going to mirror / extend the existing section in the language reference on the in and not in operators for now; obviously a full PEP would need much more thorough specifications (including additional modifications to the sections describing dunder methods and the full language specification, as well as other things I’m sure I’m missing).

Additionally, all builtin container types would need to have a __contains_id__() method implemented in cPython, as well as the abstract types in collections.abc (Implementing Container.__contains_id__(self, x) as return any(x is e for e in self) may be sufficient.)

6.10.2. Membership test operations

(Description of the in and not in operators)

6.10.2.1. Membership identity test operations

The operators is in and is not in test for membership strictly by object identity. x is in s evaluates to True if x is the same object as a member of s, and False otherwise. x is not in s returns the negation of x is in s. All built-in sequences and set types support this as well as dictionary, for which is in tests whether the dictionary has a given key. For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e for e in y).

+++ Implementation details regarding string and bytes types is up for discussion.

For user-defined classes which define the __contains_id__() method, x is in y returns True if y.__contains_id__(x) returns a true value, and False otherwise.

For user-defined classes which do not define __contains_id__() but do define __iter__(), x is in y is True if some value z, for which the expression x is z is true, is produced while iterating over y. If an exception is raised during the iteration, it is as if is in raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines __getitem__(), x is in y is True if and only if there is a non-negative integer index i such that x is y[i], and no lower integer index raises the IndexError exception. (If any other exception is raised, it is as if is in raised that exception).

The operator is not in is defined to have the inverse truth value of is in.

Open Issues:

  • in and not in are defined for string and bytes objects and search for substrings. However, any substring obtained by slicing, etc. would be a new object and could not have the same id as the left hand operand, meaning is in and is not in. Potential options include:
    • Calling x is in string always returns False, and x is not in string always returns True, mirroring the fact that x is string[slice] will (to my understanding) always be false.
    • Attempting to call x is [not] in string raises TypeError.
    • Calling x is [not] in string returns the same value as x [not] in string.
  • The code if x is in y: ... more closely mirrors natural English language than if x in y: ...; however, the latter will produce results which feel more “natural” to a beginner whereas the former may produce unexpected results to a non-advanced programmer.
1 Like

How often do we need to test for membership strictly by identity? How many occurrences of such idiom there is in the stdlib and other large projects?

3 Likes

Ehh, the PEP format is a whole lot of hassle for an initial proposal, but hey, whatever :slight_smile:

This is, in fact, my biggest concern here. The vast majority of situations do not require identity matching. In most programs, is/is not are only ever used when testing against sentinels (eg x is None), and those sentinels will generally only compare equal with themselves anyway; using is when == would have been more appropriate leads to weird data-dependent bugs such as x is 5 appearing to work, but failing with x is 500. Having a comfortable and very English-like syntax for behaviour that is often going to be undesirable is, IMO, not a good tradeoff.

Rather than an operator, this might be better served by a utility function:

def contains(haystack, needle): # or switch the params if you prefer
    return any(x is needle for x in haystack)

which could be added to your personal library, tested out, and then potentially proposed for stdlib inclusion. I would love to see the use-cases for it, as they would make it a lot easier to discuss the benefits.

9 Likes

I am not convinced be the proposal either, but the disadvantage of such a contains helper function would be that it would need to iterate through the container, while some containers (like sets) could do better.

However, the is in operator would be of dubious help in case of sets anyway.

>>> def contains(haystack, needle):
...     return any(x is needle for x in haystack)
...
>>> s = set()
>>> s.add(3.14)
>>> s.add(x:=3.14)
>>> s
{3.14}
>>> x in s
True
>>> contains(s, x)
False

Here x was added to the set, but it does not contain x, but the other instance of 3.14.

any(x is e for e in y) is occasionally useful for tree traversal, where child nodes or their containers are mutable sequences or mappings. Use of in and == would be incorrect in many of these cases (e.g. traverse_list(child) if child in pruned_tree has a significant chance of traversing the wrong child).

However, I don’t see the need for special syntax to do this. The dunder suggestion __contains_id__() does not appear to be motivating; at least, I don’t see the use case for overriding such a dunder, because is itself cannot (and should not) be overridden for good reason.

2 Likes

I think the suggested feature has the following natural (or necessary) counterpart: it then should also be possible to get/extract an element from a set and a key from a dictionary looking it up by equality. See this SO question.

I don’t feel like answering on SO and getting downvoted to oblivion for no reason, so I’ll just say it here: this sort of thing is best done with a dictionary that maps things to themselves.

_INTERN_CACHE = {}
def intern(obj):
    return _INTERN_CACHE.setdefault(obj, obj)

This will enforce that only one object with a given value exists. This implementation doesn’t differentiate by type (so intern(12345) is intern(12345.0)), but you could add that if you wanted.

However, it’s usually better to avoid needing to care about identity.

2 Likes

I do rarely mistype is in for in. I would absolutely hate for that to be a subtle bug instead of an instant SyntaxError.