Dictionaries should have key-based set operations

Erotemic · June 24, 2023, 3:41am

In another thread (Scientific Utilities - Contributor & Development Discussion - Scientific Python) I was encouraged to submit some of my ideas to the Python discussion forum. So that’s what I’m doing. This post is about my thoughts on extending Python’s dictionary.

I have a strong opinion that dictionaries in Python should have key-based set operations. I have a baseline implementation in my ubelt library call SetDict - ubelt.util_dict module — UBelt 1.3.2 documentation

I spent a lot of time writing the docs, so I’ll just paste that description:

    """
    A dictionary subclass where all set operations are defined.

    All of the set operations are defined in a key-wise fashion, that is it is
    like performing the operation on sets of keys. Value conflicts are handled
    with left-most priority (default for ``intersection`` and ``difference``),
    right-most priority (default for ``union`` and ``symmetric_difference``),
    or via a custom ``merge`` callable similar to [RubyMerge]_.

    The set operations are:

        * union (or the ``|`` operator) combines multiple dicttionaries into
            one. This is nearly identical to the update operation. Rightmost
            values take priority.

        * intersection (or the ``&`` operator). Takes the items from the
            first dictionary that share keys with the following dictionaries
            (or lists or sets of keys).  Leftmost values take priority.

        * difference (or the ``-`` operator). Takes only items from the first
            dictionary that do not share keys with following dictionaries.
            Leftmost values take priority.

        * symmetric_difference (or the ``^`` operator). Takes the items
            from all dictionaries where the key appears an odd number of times.
            Rightmost values take priority.

    Note:
        The reason righmost values take priority in union /
        symmetric_difference and left-most values take priority in intersection
        / difference is:

            1. intersection / difference is for removing keys --- i.e. is used
            to find values in the first (main) dictionary that are also in some
            other dictionary (or set or list of keys), whereas

            2. union is for adding keys --- i.e. it is basically just an alias
            for dict.update, so the new (rightmost) keys clobber the old.

            3. symmetric_difference is somewhat strange. I'm don't have a great
            argument for it, but it seemed easier to implement this way and it
            does seem closer to a union than it is to a difference. Perhaps
            unpaired union might have been a better name for this, but take
            that up with the set theorists.

        Also, union / symmetric_difference does not make sense if arguments on
        the rights are lists/sets, whereas difference / intersection does.

    Note:
        The SetDict class only defines key-wise set operations.  Value-wise or
        item-wise operations are in general not hashable and therefore not
        supported. A heavier extension would be needed for that.

    TODO:
        - [ ] implement merge callables so the user can specify how to resolve
              value conflicts / combine values.

    References:
        .. [RubyMerge] https://ruby-doc.org/core-2.7.0/Hash.html#method-i-merge

    CommandLine:
        xdoctest -m ubelt.util_dict SetDict

    Example:
        >>> import ubelt as ub
        >>> a = ub.SetDict({'A': 'Aa', 'B': 'Ba',            'D': 'Da'})
        >>> b = ub.SetDict({'A': 'Ab', 'B': 'Bb', 'C': 'Cb',          })
        >>> print(a.union(b))
        >>> print(a.intersection(b))
        >>> print(a.difference(b))
        >>> print(a.symmetric_difference(b))
        {'A': 'Ab', 'B': 'Bb', 'D': 'Da', 'C': 'Cb'}
        {'A': 'Aa', 'B': 'Ba'}
        {'D': 'Da'}
        {'D': 'Da', 'C': 'Cb'}
        >>> print(a | b)  # union
        >>> print(a & b)  # intersection
        >>> print(a - b)  # difference
        >>> print(a ^ b)  # symmetric_difference
        {'A': 'Ab', 'B': 'Bb', 'D': 'Da', 'C': 'Cb'}
        {'A': 'Aa', 'B': 'Ba'}
        {'D': 'Da'}
        {'D': 'Da', 'C': 'Cb'}

Python dicts already has the union “|” operator. It would be really nice to get the rest of them. Set notation is so concise and expressive. It hurts to type {k: v for k, v in d.items() if k in s} when I could just type d & s. Is it possible to get a PEP started for this?

hauntsaninja · June 24, 2023, 3:51am

Note that dictionary key views do implement the full set interface, so you can do d.keys() & s.keys()

Erotemic · June 24, 2023, 4:05am

Yes, it’s good to mention the fact that the dict_keys and dict_values already support most of the proposed operations. I think it makes sense to go the extra step and just give the set operations to the dictionary itself.

In terms of precedence, I think the fact that this already exists in the explicit case and the fact that we can write k in d and also k in d.keys() is a good reason to allow k in d & s and also k in d.keys() & s.

jamestwebber · June 24, 2023, 4:57am

This came up in PEP 584, perhaps the discussion from that PEP would be useful (linked there).

Erotemic · June 30, 2023, 4:11am

I have seen PEP 584 before, and I’ve just included it in the ubelt.SetDict references. As far as I can tell the pep just mentioned that the rest of the set operators were beyond the scope of PEP 584 due to the question of how to handle values in dictionary intersections.

The pep does mention that it might be defined as “Last seen wins”, but I think there are better arguments for breaking consistency with union and using “first seen wins” for set intersection. While it is not homogenous there is a symmetry in this view:

intersection and difference take left most priority because they are subtractive operations.
union and symmetric_difference take right most priority because they are additive (modulo 2 for ^).

Note: I’ve also formulated symmetric_difference as the nary implementation when used as a method. (which is the same as chaining binary ^ ops).

The example in the PEP in my implementation is:

In [1]: d1 = {"spam": 1, "eggs": 2}
   ...: d2 = {"ham": 3, "eggs": 4}

In [2]: d1, d2 = ub.udict(d1), ub.udict(d2)

In [3]: d1 & d2
Out[3]: {'eggs': 2}

The reason this implementation makes sense is to allow for the far more useful pattern: when the second argument is a set, and not a dictionary.

In [4]: d2 = {'ham', 'eggs'}

In [5]: d1 & d2
Out[5]: {'eggs': 2}

This is the modivating case for this proposal. The pattern of being able to intersect kwargs with expected values and then subtract them out to get the unexpected values is something that I believe is lacking. Consider the utility of the following example in no-external-import Pure-Python code.

import ubelt as ub

def func(**kwargs):
    print(f'Called with: {kwargs=}')
    kwargs = ub.udict(kwargs) 
    expected = {'foo', 'bar', 'baz'}
    config = kwargs & expected
    unexpected = kwargs - expected
    print('unexpected = {}'.format(ub.urepr(unexpected, nl=1)))
    if unexpected:
        raise ValueError(f'Unexpected {unexpected=}!')

print('-- Call 1 --')
func(foo=1, baz=3)
print('-- Call 2 --')
func(bar=1, buz=3)

Yes you could wrap things with sets and use keys, but the pure version of the code without ubelt is very ellegant - but also currently unsupported:

def func(**kwargs):
    expected = {'foo', 'bar', 'baz'}
    config = kwargs & expected
    unexpected = kwargs - expected
    if unexpected:
        raise ValueError(f'Unexpected {unexpected=}!')

And if you did want the values, it’s a lot more code than just set(d.keys()). That being said this brings me to another reason why I think this needs to be in stdlib instead of 3rd party library. Even with dictionary comprehensions, selecting a subset of a dictionary is tedious. With dictionary intersection (that also supports sets - which I think is key to this proposal) selecting a subset of a dictionary is trivial. Compare:

# Case 1
{k: v for k, v in d.items() if k in s}

# Case 2
d & s

Case 2 seems much easier to express, and I believe it makes intuitive sense if you are already used to set operations.

I realize I have to differentiate this from the other proposals with unclear foundations. I think there are solid foundational arguments for why this implementation of nary-dict/set operators is “the right one” (or at least pretty close, I do think my handling of values for set intersection is the most reasonable default, but other handling of values is valid, and I think they can be implemented using ideas from ruby’s merge).

WRT to the discussion on PEP 584, was there a particular part of that you had in mind?

I can’t seem to edit the docs I copied in my original post, but the updated descriptions that includes discussion of pep584 are here: ubelt/ubelt/util_dict.py at dev/1.3.3 · Erotemic/ubelt · GitHub

jamestwebber · June 30, 2023, 2:11pm

Your rationale makes sense to me! It’s a bit of sugar but useful, in my opinion. My main concern (which seems to be similar to the historical concern from PEP 584) is that it could be confusing and non-intuitive in how it’s working. But good documentation could help solve that, as long as behavior is consistent across various combinations of types.

I didn’t follow that discussion at the time, and it’s on the old mailing list so it’s a bit of a pain to go through now. I see this post suggested the same type of operations and there is a discussion related to it there.

edit: What do you propose for x | y when y is a set? Should this be supported?

Melendowski · June 30, 2023, 2:24pm

I’m unsure what you mean by symmetric difference being additive and therefore should take right most.

Symmetric difference should be one were there is no question in if the value should be from the left or right (unlike intersection or union).

Symmetric difference of two dicts should yield a new dict whose keys are unique to either the left or right dict but are not present in both. If a key is in both then it’s not included

Imo the intersection and union should act like and and or in an assignment (given their parallel), but this goes directly against what pep 584 implemented.

If you and two truthy objects that are true, the right most is picked and if you or two truthy objects that are true then left most gets picked.

print("a" and "b")
print("a" or "b")

b
a

Erotemic · June 30, 2023, 9:58pm

@Melendowski, What you are saying is true about binary symmetric difference. In addition to this think about the n-ary version of symmetric difference, which is the same as reducing the binary sym-diff over a list of arguments. This becomes equivalent to including the elements that appear an odd number of times. This does raise a question about which value is taken.

Consider the following:

import ubelt as ub

# Define a set of dictionaries for successive symmetric intersection
data = [
    {         'b': 21, 'c': 31},
    {'a': 12, 'b': 22, 'c': 32},
    {'a': 13, 'b': 23, 'c': 33},
    {'a': 14,          'c': 34},
    {         'b': 25, 'c': 35},
]

# Natural iteration order
result1 = ub.udict({})
for item in data:
    result1 ^= item

# Reversed iteration order
result2 = ub.udict({})
for item in data[::-1]:
    result2 ^= item

# Nary versions
result3 = ub.udict.symmetric_difference(*data)
result4 = ub.udict.symmetric_difference(*data[::-1])

print(f'{result1=}')
print(f'{result2=}')
print(f'{result3=}')
print(f'{result4=}')

This results in:

result1={'a': 14, 'c': 35}
result2={'a': 12, 'c': 31}
result3={'a': 14, 'c': 35}
result4={'a': 12, 'c': 31}

Depending on the order in which you iterate through data, “reduced” (better term?) calls to symmetric difference will result in a different order. The item in result1 is the “natural” iteration order, and that corresponds to rightmost priority in the nary operation.

I’m handwaving a bit with the terms “additive” and “subtractive”, and any help in refining this termonology would be appreciated, but I believe there is a real mathematical symmetry here that informs what could defaults should be. What I’m observing is:

intersection and difference always produce a result that is a subset of one of the arguments. I chose the term “subtractive” because these operations are always taking something away from an input.
union and symmetric difference may produce a result that is not a subset of any argument. I chose the term “additive” because these can produce new sets. perhaps “constructive/descructive” would be better?

However, there is an additional modivation for thinking of sym-diff as additive. It is because it mirrors the binary xor operation, which itself is addition modulo 2.

WRT to the relationship to and and or, I think this boolean-algebra interpretation of those operations is at odds with the set-theoretic interpretation I’m proposing. If you view union as an update, then rightmost priority makes the most sense, and if you view intersection as a filter, then leftmost priority makes the most sense. I don’t have a great explanation of this at the moment, but I think I could come up with one given some time to think about it. Regardless, any new dictionary set operations must be compatable with PEP-584, so union will always have rightmost priority by default. (Although I will restate that even though I believe my proposal has natural defaults, the other value-priorities are still valid, and should probably be taken care of using something like Ruby’s merge feature).

@jamestwebber I think some amount of confusion is unavoidable, but I also think the primary use case of intersection and difference dictionary set operations will be between a dictionary and a set, and rarely between two dictionaries. It might make sense to scope any new PEP as such, but I think if you consider the intersection(a: Dict, b: Set | Dict)-case as a first class citizen, then the corresponding dictionary implementations fall into place.

It also seems to me that in the discussion there are some people saying that the rest of the set operations would be useful and noting the ambiguities, but the issue is that nobody has put in the work to propose what Python should do to resolve those ambiguities and then back that reasoning up. I believe my reference implementation in ubelt.SetDict goes a long way towards addressing these open questions.

What do you propose for x | y when y is a set? Should this be supported?

It’s undefined. The union and symmetric difference require that all arguments are dictionaries. I don’t see a way to make sense of it otherwise. This also goes towards my point of them being the “additive” operations: constructing a new dictionary requires the inputs all have values.

On the other hand, the “subtractive” operations of intersection and difference do not have this problem, and arguments can feely be mixed between dictionaries and sets.

Melendowski · June 30, 2023, 11:51pm

Thank you for the clarity. Your explanation instantly reminded me of the use of xor gate as a parity check from a Computer Architecture class. Makes so much more sense then.

pylang · July 8, 2023, 11:25pm

Python dicts already has the union “|” operator.

As indicated in your docstring, it is key to emphasize that these are not genuine set operations between dictionaries, since dict values can’t guarantee set-like properties. The docs correctly makes a distinction through calling | a merge (not “union”) operator.

So I’d suggest 1) correcting “union” to “merge” operator and 2) similarly renaming the other set analogs since the terminology can easily get confusing. Example, the same docstring contradicts itself suggesting “set operations” such as the union operator applies to dictionaries:

The set operations are:

union (or the | operator) combines multiple dicttionaries into one.

This is not true, i.e. dicts don’t have set operations; dict keys do. The union operator doesn’t apply to dicts; the merge operator does.

Beyond that, regarding the overall idea, sure. Why not?

Erotemic · July 9, 2023, 6:56am

We can define set operations on dictionaries. They correspond very well, and are even sometimes interoperable. The only thing “genuine” about regular set operations is that they have nice mathematical properties. Dictionaries with the operations as I’ve defined them have a generalized version of those nice properties. The resemblance to sets is uncanny, so I think it’s important to keep the same names.

I wanted to add a more explicit connection between why “additive”/“subtractive” operations should default to rightmost/leftmost priority.

Notice that set-based union and symmetric_difference share an “additive” property, in that they can result in sets that are not subsets of any input.

In contrast, the set-based intersection and difference operations share a “subtractive” property. They always result in a set that is a subset of one of the inputs.

In this sense, a “additive” operation can expand into a never before seen set, whereas “subtrative” operations cannot do this.

An “additive” operation must consider all inputs because it is not guarenteed to see its rightmost input.

A “subtractive” operation’s leftmost input will always be a superset of the result. You are guarenteed to have seen all possible elements that could be in the result after you have seen the leftmost input.

In the context where you are streaming inputs, so you see the “leftmost” element first and the “rightmost” last, and for “addative” operations you must wait until the rightmost until you are gaurenteed to have the information needed to allocate the output. For “subtractive” operations, you have all of the information you need immediately after you see the leftmost element.

The above is for pure sets. For dictionaries, one could invent any prioritization to define which values tag along with the keys. However, I believe this “minimum number of items to allocate the output” is the “natural” default way to assign a value to each output key.

Imagine each of the union, intersection, difference, and symmetric_difference methods had a keyword argument: merge, which took an iterator of values, with the first being the leftmost and last being the rightmost candidate value for that key.

One could implement fancy things such as: merge=set to take all values and produce a new dictionary with a set of all input values for that key.

Taking the leftmost would correspond to:

def merge(values):
    return next(values)

And the rightmost would be:

def merge(values):
    for v in values:
        ...
    return v

storchaka · July 9, 2023, 7:14am

It is not always so.

>>> set(range(1000)) & {True}
{True}

What result do you expect from {1: 'a', 2: 'b'} & {True: 'c'}? {1: 'a'}, {1: 'c'}, {True: 'a'} or {True: 'c'}?

There is a performance reason why set intersection is implemented this way. Will a dict intersection sacrifice performance to guarantee getting the subset of the leftmost input?

Rosuav · July 9, 2023, 8:18am

Serhiy Storchaka:

It is not always so.
>>> set(range(1000)) & {True}
{True}
What result do you expect from {1: 'a', 2: 'b'} & {True: 'c'}? {1: 'a'}, {1: 'c'}, {True: 'a'} or {True: 'c'}?

There is a performance reason why set intersection is implemented this way. Will a dict intersection sacrifice performance to guarantee getting the subset of the leftmost input?

Sets have the odd behaviour of taking the elements from the smaller set, which is distinctly quirky.

>>> {1,2} & {True}
{True}
>>> {1,2} & {True,3,4}
{1}

The algorithm for s & s broadly is: Iterate over smaller set; does the larger contain this element? Add this element. In theory, the algorithm for a dictionary could be defined to take from the left dictionary without losing too much performance, albeit at the cost of having two separate search styles:

Is the left-hand dictionary smaller than the right-hand?
a. Iterate over the left operand.
b. Look up each element in the right operand.
c. If found, insert the key/value from iteration into the output.
Otherwise (left-hand is larger than right-hand):
a. Iterate over the right dict.
b. Look up each element in the left dict, using _Py_dict_lookup to get the index
c. If found, insert the key/value from the left dict’s index into the output.

This would have a cost, but not as big a cost as “always iterate over the left dict” would have.

Unless I’m completely misreading the CPython sources, in which case, my apologies

storchaka · July 9, 2023, 8:52am

The problem is that there is no public API for getting a key in dict which is equal to the specified key. For implementing intersections between dicts (and only dicts) you need to use private API, but you cannot implement intersection between other mapping. | was implemented for different dict-like types, with & you will have problems.

Other problem – order. Dicts are ordered now, unlike to sets. What order will be preserved in the result: of the leftmost or of the rightmost input?

Erotemic · July 10, 2023, 12:32am

This is good discussion. It’s possible (likely) I’ve made some mistakes. Regardless I think the idea of having d & s be available syntax is too useful to pass up. I’m going to consider these points and think about this over the coming days, but I want to make my goal clear: giving dictionaries an intersection-like operator would be immensely useful, but I don’t want to do it haphazardly, and I want to make sure the implementation and design decisions have solid justifications.

In the meantime, in my current reference implementation:

import ubelt as ub
ub.udict({1: 'a', 2: 'b'}) & {True: 'c'}

results in {1: 'a'}, as it uses the leftmost key and value in its implementation.

Using the smallest input would agree more with the current implementation and it does punch a hole in one of my arguments. However, this may not result in something as useful. Intuitively leftmost priority in intersection feels extremely natural to me, and I’m searching for an argument to make that concrete - perhaps I’m wrong and it it doesn’t exist. I can of course see validity in other priorities, but I hadn’t considered “smallest first” as a potential rule to compare against “leftmost”. Currently I can’t think of something natural that distinguishes the two.

I believe the question of order should come down to the priority, which it seems isn’t as solved as I would have thought.

NeilGirdhar · July 10, 2023, 3:54am

Have you considered intersecting dictionaries with sets instead? Then it’s much simpler:

d = {1: 'a', 2: 'b'}
d2 = {True: 'c'}
s = {True}
d & s  # {k: v for k, v in d.items() if k in s}
d & d2  # Not allowed.
d & d2.keys()  # Makes sense.
s & d  # Allowed?

I think whatever you push for needs some real world motivation.

flyinghyrax · July 10, 2023, 12:31pm

That has been discussed earlier in the thread.

Jon Crall:

The reason this implementation makes sense is to allow for the far more useful pattern: when the second argument is a set, and not a dictionary.
In [4]: d2 = {'ham', 'eggs'}

In [5]: d1 & d2
Out[5]: {'eggs': 2}
This is the modivating case for this proposal. The pattern of being able to intersect kwargs with expected values and then subtract them out to get the unexpected values is something that I believe is lacking. Consider the utility of the following example in no-external-import Pure-Python code.

oscarbenjamin · July 10, 2023, 1:55pm

I don’t think that taking from the smaller set is quirky. What is odd is that sets and dicts allow things like 1, 1.0 and True to be considered equivalent in the first place. The same problems arise without intersection:

>>> {1, True}
{1}
>>> {True, 1}
{True}

If we must have things like 1 == 1.0 then it would be better to have some different equality operator or a different dunder for things like sets and dicts. It could be e.g. a === b and it could call a.__stricteq__(b) and then we could have 1 === 1.0 be False. If there were both __eq__ and __strict_eq__ then it would be possible to choose which of them sets and dicts should use. (I expect it is too late to change this though.)

Every time I implement __eq__ I find this same tension arising. In some situations the 1 == 1.0 behaviour is nice but also in others it is definitely not desired. Essentially the == operator is overloaded in conflicting ways so that I always find that designing an __eq__ method requires compromising one sort of usage in favour of another.

flyinghyrax · July 10, 2023, 1:59pm

I think this is off topic for this thread. Extending the whole implementation of equality and “truthiness” in Python should be split to its own idea thread.

That said, in the context of sets and dict keys, this type coercion issue only applies if you are using heterogenous collections, I.e. your set values aren’t all the same type. I would argue this is generally not useful unless you’re using a polymorphic class hierarchy, in which case you would override __eq__ and __hash__ anyway.

Rosuav · July 10, 2023, 2:06pm

But we MUST match by something, and it would be nightmarishly confusing to use something other than either identtiy or equality. Identity’s out (there’s no way we’d want string keys to fail to match just because they’re the wrong object - yes, there are a few specific situations where you want an identity dict, but for the most part, no), and a “type and value equality” concept like you’re suggesting is its own bizarreness. Introducing a new type of equality, this “strict equality”, only brings in new problems - just ask JavaScript programmers.