Dictionaries should have key-based set operations

Erotemic · August 5, 2023, 1:25am

Perhaps it makes sense to move forward more slowly – similar to how the | operator was implemented for dictionaries.

While I feel that set operations are beautiful and elegant, I acknowledge that most are not that useful. Third party libraries like ubelt can be the home for opinionated implementations like that.

However, for core stdlib Python, I would love a & operator between dictionaries and sets that echos the aforementioned mathematical elegance. I think it would be generally useful.

Like PEP 584, let’s move slowly. I’d like to introduce a new actionable proposal for introducing something like:

from typing import Iterable


def __and__(self: dict, other: Iterable):
    """
    Example:
        self = {'a': 1, 'b': 2}
        other = {'b': 1, 'c': 3}
        __and__(self, other)
    """
    common = self.keys() & other
    return self.__class__((k, self[k]) for k in common)

to the stdlib dictionary.

Would there be support for that limited proposal?

Gouvernathor · August 5, 2023, 6:53am

That implementation would not work with defaultdict, as its first parameter is the factory function and the mapping is the second parameter. So it would need a specific override of the dict.__and__ method.
Although that’s possibly what already happened when | was added, and subclasses will likely need to port/copy some information about themselves to newly created instances.

Otherwise I’m all for that latter proposition and implementation, incl. treating right-hand dicts as mere iterables. It effectively ports to dicts what slice does to lists, and that’s something I wanted for quite some time.

Since I mentioned defaultdict, I see Counter hasn’t been mentioned in this thread : it’s a subclass of dict and it also implements & as an operator, and as you can expect it works in a different way than what you propose. But its | already works differently to the dict version so I think we can afford for & to also have separate meanings in dict and Counter.

storchaka · August 5, 2023, 7:17am

Please read the thread above. Your code does not solve problems which are common for any other implementation.

It produces a mix of keys from one dictionary and values from other dictionary.
What about the order?

Erotemic · August 5, 2023, 10:21pm

That implementation would not work with defaultdict, as its first parameter is the factory function and the mapping is the second parameter. So it would need a specific override of the dict.and method.
Although that’s possibly what already happened when | was added, and subclasses will likely need to port/copy some information about themselves to newly created instances.

You’re right, my implementation neglects defaultdict. And it does look like it already has special handling for __or__, so I imagine the same thing would happen for any proposed implementation of __and__.

Your code does not solve problems which are common for any other implementation.

Not quite sure what you mean here. I think we established that having __and__ between dictionaries and iterables would be a generally useful operation for getting a subset of a dictionary. Likewise, I think we should keep __sub__ between dictionaries and an iterables in the discussion, because I’ve also found it useful to be able to remove a list of keys from a dictionary or ensure that some keys are not present in a dictionary.

It produces a mix of keys from one dictionary and values from other dictionary.

I don’t think it really matter if the resulting keys don’t identify with one another as long as they are equal to one another. In my ideal world - which I’m not proposing - I think the LHS dictionary should get priority for which keys are taken, as that is where the values will have to come from, but if there is a performance benefit to mixing keys from the arguments, then I think implementing it that way is reasonable and will not be noticed in 99.9% of real world cases - where dictionaries will generally have a string as the key type, and the user will only use them in the context of __eq__. Yes, it would be a new corner-case gotcha, but I think it would be limited to the realm of trivia rather than an issue that makes people’s lives harder.

What about the order?

I would prefer that the order of the LHS dictionary was kept, but I really don’t think it matters in most cases where someone would use this. The easiest thing for the Python language to do is leave it undefined (similar to what it is for sets) for awhile - go with the implementation that is fastest - and then consider specifying an order in the language spec after the operation has been in use for a while.

I’ve written a script that analyzes the properties of candidate implementations of the __and__ and __sub__ operation: https://github.com/Erotemic/misc/blob/main/tests/python/dict_set_discussion.py It also includes average times for each candidate implementation, although I expect the times to not be that informative given that a C implementation will likely be able to do better, but it’s probably correlated to what sort of properties we think these operations should maintain.

As people have pointed out, the above implementation of __and__ is not gaurenteed to preserve LHS identity or order. However, for a minor performance penalty (e.g. 1300ns vs 900ns), we can preserve LHS order.

The script currently produces the following table (pased as an image to display coloring):

For each candidate implementation (denoted as op) it shows how many random cases if preserved RHS order, LHS order, RHS identity and LHS identity. It also gives an idea of the average runtime for each implementation. at the op of each table.

EDIT: original table had a bug. It is fixed now.

Erotemic · August 5, 2023, 11:03pm

see Counter hasn’t been mentioned in this thread : it’s a subclass of dict and it also implements & as an operator, and as you can expect it works in a different way than what you propose

I didn’t know Counter had & and -, but I went through some tests cases and it works how I would expect it to. I think it makes sense to look at these examples to get an insight into previous design decisions where there could have been alternative decisions made.

    # Union
    print(Counter({True: 2}) | Counter({1: 1}))
    print(Counter({True: 1}) | Counter({1: 2}))
    print(Counter({1: 2}) | Counter({True: 1}))
    print(Counter({1: 1}) | Counter({True: 2}))
    # Counter({True: 2})
    # Counter({True: 2})
    # Counter({1: 2})
    # Counter({1: 2})

    # Intersection
    from collections import Counter
    print(Counter({True: 2}) & Counter({1: 1}))
    print(Counter({True: 1}) & Counter({1: 2}))
    print(Counter({1: 2}) & Counter({True: 1}))
    print(Counter({1: 1}) & Counter({True: 2}))
    # Counter({True: 1})
    # Counter({True: 1})
    # Counter({1: 1})
    # Counter({1: 1})

    # Difference
    print(Counter({True: 2}) - Counter({1: 1}))
    print(Counter({True: 1}) - Counter({1: 2}))
    print(Counter({1: 2}) - Counter({True: 1}))
    print(Counter({1: 1}) - Counter({True: 2}))
    # Counter({True: 1})
    # Counter()
    # Counter({1: 1})
    # Counter()

It looks like in the above cases, Counter prefers the LHS keys over the RHS keys. This is what I would prefer for a candidate dictionary __and__ operation.

EDIT: I updated my script to generate test cases for Counters and run the same analysis with the buildin and_, or_ and sub operator.

--- and_ ---
LHS_order  LHS_id  RHS_order  RHS_id
True       True    True       False     2523
True       True    True       True      7477
--- or_ ---
LHS_order  LHS_id  RHS_order  RHS_id
True       True    False      False     1848
True       True    True       False      675
True       True    True       True      7477
--- sub ---
LHS_order  LHS_id  RHS_order  RHS_id
True       True    True       True      10000

LHS order is preserved.

Gouvernathor · August 5, 2023, 11:44pm

I think these matter if doing the operation in-place (__iand__), i.e we don’t want to alter the dict more than necessary. So the implementation would look like so:

def __iand__(self, other):
    for k in self:
        if k not in other:
            del self[k]
    return self

    to_pop = self.keys() - other
    for k in to_pop:
        del self[k]
    return self

I think these satisfy your two requirements. (I’m not sure which version would be faster)

As for the normal __and__, I’m not convinced the two points you listed should be blocking.
For one, this is a new operator, so we could hypothetically warn that the resulting ordering is undefined when using that operator. People can’t complain that it messes up an existing code relying on ordering : either they rely on it and they do it another way (or keep their current implementation), or they don’t care about ordering and key identity and all is fine on the & front.
In any case, the idea of turning to the set.__and__ implem for inspiration had merits, and I think we can look back to this :

As I understand the objections made to that idea, the issue is with point 2b, because _Py_dict_lookup is private API. In that case, how about implementing __and__ as described for dict and subclasses, which will have no problem if I understand correctly, and only do it the first way regardless of the side of the left-hand dict for other mapping types ? In that case we keep order and key identity as wanted (right-hand-side rules), and we optimize the operation whenever it’s possible. Win-win.

ilotoki0804 · October 1, 2023, 1:50pm

In line with @NeilGirdhar’s suggestion, I assert that the right operand of subtractive operations in dictionaries should be confined exclusively to sets^[1]. This approach would promote a more intuitive and consistent structure for these operations.

When performing subtractive operations with dictionaries, several complexities arise, such as value priority and order. Using sets instead solves this problem; the order and values from the dictionary should be used because sets don’t have any order or value that corresponds to a key.

Allowing containers (a more general type than iterable) as the right operand can also lead to ambiguity, similar to what is encountered in loosely typed languages.

To illustrate these complexities, consider the following examples^[2]:

# The following operations would be valid if containers were allowed as the right operands:
{} - []  # I think I've seen this on wtfjs...
asdict(mydataclass) & os.listdir('my_dir') - range(100)

# This dict-list operation is much slower (about 4 times) than the dict-set one.
dict.fromkeys(range(99950, 100050)) & list(range(100000, 100100))

# This operation is particularly confusing since it appears to intersect keys but doesn't do so.
{i: i * 2 for i in range(100)} & (i ** 2 for i in range(100))
# expected: {0: 0, 1: 2, 4: 8, 9: 18, 16: 32, 25: 50, 36: 72, 49: 98, 64: 128, 81: 162}
# result: {0: 1, 1: 2}

These examples highlight the potential confusion and performance issues that can arise when allowing containers as the right operands.

Sets are designed with limitations on set (binary) operations to be only between sets; that could be another reason not to permit this functionality on subtractive operations of dictionary^[3].

{1, 3} - range(2, 100)  # TypeError: unsupported operand type(s) for -: 'set' and 'range'
{1: 2, 2: 4}.keys() - [2, 3]  # This should be forbidden.

If we categorize the types of operands used in the operations, it looks like this:

Additive operations:
- dict | dict
- dict ^ dict
Subtractive operations:
- dict - set
- dict & set

It’s neat and consistent.

A Python implementation of this proposal (be conscious that this implementation returns a dictionary, not NewDict):

from collections.abc import Set


class NewDict(dict):
    def __and__[K, V](self: dict[K, V], other: Set) -> dict[K, V]:
        assert isinstance(other, Set), 'Real implementation should raise a TypeError.'
        return {k: v for k, v in self.items() if k in other}

    def __sub__[K, V](self: dict[K, V], other: Set) -> dict[K, V]:
        assert isinstance(other, Set), 'Real implementation should raise a TypeError.'
        return {k: v for k, v in self.items() if k not in other}

    def __xor__[K1, V1, K2, V2](self: dict[K1, V1], other: dict[K2, V2]) -> dict[K1 | K2, V1 | V2]:
        return {k: v for k, v in self.items() if k not in other
                | {k: v for k, v in other.items() if k not in self}}
        # or alternatively, (self - other.keys()) | (other - self.keys())

    # __rand__ = __and__  # Maybe implementing __rand__ makes sense?

Here are some use cases:

# Using set for the right operand
NewDict({i: i * 2 for i in range(100)}) & {i ** 2 for i in range(100)}

Instead of allowing KeysView to be used as a right operand, pass KeysView instead.
NewDict({i: i * 2 for i in range(1, 10)}) & {i: i * 2 for i in range(5, 15)}.keys()

# Examples @Erotemic proposed previously work well too.
def func(**kwargs):
    print(f'Called with: {kwargs = }')

    kwargs = NewDict(kwargs) 
    expected = {'foo', 'bar', 'baz'}

    config = kwargs & expected
    unexpected = kwargs - expected

    if unexpected:
        raise ValueError(f'Unexpected {unexpected = }!')

print('-- Call 1 --')
func(foo=1, baz=3)
print('-- Call 2 --')
func(bar=1, buz=3)

collections.abc.KeysView is a subclass of collections.abc.Set, so whenever I say “set,” it includes KeysView. ↩︎
By removing the assert statement from my NewDict implementation below, you can simulate what it would be like. ↩︎
It’s unclear why dict.keys() was implemented to allow iterables as the right operand, unlike set, but the operation should be limited to sets, just like set. ↩︎

Rosuav · October 1, 2023, 2:14pm

Ilotoki0804:

To illustrate these complexities, consider the following examples[2]:

# The following operations would be valid if containers were allowed as right operands.
{} - []  # I think I've seen this on wtfjs...
asdict(mydataclass) & os.listdir('my_dir') - range(100)

# This dict-list operation is much slower (about 4 times) than dict-set one.
dict.fromkeys(range(99950, 100050)) & list(range(100000, 100100))

# This operation is particularly confusing since it appears to intersect keys but doesn't actually do so.
{i: i * 2 for i in range(100)} & (i ** 2 for i in range(100))
# expected: {0: 0, 1: 2, 4: 8, 9: 18, 16: 32, 25: 50, 36: 72, 49: 98, 64: 128, 81: 162}
# result: {0: 1, 1: 2}

These examples highlight the potential confusion and performance issues that can arise when allowing containers as right operands.

The wtfjs example has nothing to do with container operations, and is only weird because JS allows anything to be converted to a number for subtraction. So we can ignore that.

But fundamentally, there’s nothing wrong with removing a sequence of keys - whether that’s a set, a list, or a dict - from a dict. They would all behave exactly the same way. In fact, Pike does exactly that. The syntax is a bit different, but you can use any data type just fine.

> (["asdf": "aaaaa", "qwer": "qqqq", "zxcv": "zzz"]) - ({"asdf"});
(1) Result: ([ /* 2 elements */
              "qwer": "qqqq",
              "zxcv": "zzz"
            ])
> (["asdf": "aaaaa", "qwer": "qqqq", "zxcv": "zzz"]) - (["asdf": 111]);
(2) Result: ([ /* 2 elements */
              "qwer": "qqqq",
              "zxcv": "zzz"
            ])
> (["asdf": "aaaaa", "qwer": "qqqq", "zxcv": "zzz"]) - (<"asdf">);
(3) Result: ([ /* 2 elements */
              "qwer": "qqqq",
              "zxcv": "zzz"
            ])

This is a mapping (dictionary) minus an array (list), another mapping, and a multiset (set), and they all behave the same way.

There may very well be performance issues, but that’s why we have different data structures. There’s nothing wrong with supporting something on all data types and letting programmers decide whether it’s worth going for sets. But confusion issues? I dispute that; a collection of keys is a collection of keys, whether it’s a set or anything else.