Add Symmetric Difference Operators To Dict

eladshoshani · April 19, 2024, 5:30pm

Hello Python community,

Following the introduction of dictionary merging and updating operations (| and |=) in PEP 584, I am proposing the addition of symmetric difference operations for dictionaries. This would introduce two new operators, ^ for symmetric difference and ^= for in-place symmetric difference update.

Rationale:

The symmetric difference between two sets is the set of elements that are in either of the sets but not in their intersection. For dictionaries, this operation could similarly be defined as the operation that results in a dictionary containing all key-value pairs from two dictionaries for keys that are exactly in one of the dictionaries. This operation is particularly straightforward and less controversial because it inherently resolves which values to take: if a key exists in both dictionaries, it simply does not appear in the result. This is a natural extension of the dictionary operations introduced in PEP 584 and aligns with Python’s philosophy of consistency and readability.

Proposed Operations and Discussion Points:

Here’s a brief overview of the proposed operations:

d1 ^ d2: Returns a new dictionary that contains the symmetric difference of d1 and d2. If a key exists in both dictionaries, the key-value pair from neither dictionary would be included in the result.
d1 ^= d2: Updates d1 with the symmetric difference of d1 and d2. This is the in-place equivalent of the ^ operation.

Additionally, to maintain consistency with Python’s data model, we might consider adding corresponding methods:

dict.symmetric_difference(other)
dict.symmetric_difference_update(other)

Unlike PEP 584, where dict.update already existed, the decision to add these methods provides a new point of discussion: Should we introduce these methods for symmetry with the set API and for use cases that prefer method calls over operator usage?

I am looking forward to your thoughts and any feedback on this proposal, especially regarding the potential inclusion of these additional methods.

Is it possible to get a PEP started for this?
Thank you!

pf_moore · April 19, 2024, 5:42pm

What is the use case for this? Simply being a “natural extension” of existing operations isn’t enough, the new operations need to be able to stand on their own as useful in their own right.

eladshoshani · April 19, 2024, 6:15pm

Thank you @pf_moore for your question. You’re right that new features should offer clear and standalone utility.
The new operations can be useful in many comparison use cases, those are actually quiet similar use cases such as the ^ operation for sets.
One common case where you want to know the keys that are only in one of the dictionaries is in configurations. For example:

staging_config = {"cpu": "8cores", "ram": "32GB", "disk": "1TB", "screen": "16 inches"}
production_config = {"cpu": "8cores", "ram": "32GB", "disk": "1TB", "os": "ubuntu"}

Here we would like to have an easy way to get the {"os": ubuntu", "screen": "16 inches"} data.

mikeshardmind · April 19, 2024, 6:26pm

Is there a particular use case you have in mind? To me, the only case where the values might matter, I also care about which dict provides them, and if the values differer, neither of those are shown supported in the example of the data you expect this to extract.

For instance,

staging_difference = {k: v for k, v in staging_config if k not in production_config or production_config[k] != v}

and I don’t generally have a problem with explicitly stating it.

In cases where I just want to confirm both dicts have the same set of keys:

if (disjoint_keys := staging_config.keys() ^ production_config.keys()):
    # raise an error about differing configuration structures, it's not just staging values being different.

eladshoshani · April 19, 2024, 6:38pm

I think that such as the ^ operation is less useful than the & or | on sets, so will be in this case with dictionaries.
Also as @chepner noted in the following comment - this operation is already available for .items() - which suggests the benefit of using it.

chepner · April 19, 2024, 6:40pm

Instances of dict_items already support ^.

>>> dict(staging_config.items() ^ production_config.items())
{'os': 'ubuntu', 'screen': '16 inches'}

I don’t know if this is an argument for or against your proposal.

eladshoshani · April 19, 2024, 6:46pm

@chepner Thanks for noting that, I think that this argument is for the proposal because the of Python’s Zen principle:

“There should be one-- and preferably only one --obvious way to do it.”

While it’s possible to achieve the symmetric difference via dict_items, this method is not immediately obvious to many programmers and involves additional steps.

eladshoshani · April 19, 2024, 7:09pm

I would like to note something important - most of the times when we merge dictionaries, we do it when the dictionaries do not have colliding values.
In some sense the “override” of values by the second dictionary is already bad.

eladshoshani · April 19, 2024, 7:22pm

I believe that the new symmetric difference operation would be more useful and less problematic than the current merge operation in Python. Currently, dict comprehension can cause bugs due to the override mechanism that it uses. For instance, when we have the same key twice in the list, it can lead to confusion and real bugs.

When the symmetric difference operation is available, developers will choose to use it over the | merge operation in cases where they want to ensure there are no colliding keys. The proposed ^ operator for dictionaries, unlike the current | merge operation, focuses on identifying keys unique to each dictionary, which is crucial in many practical scenarios such as configuration and feature management.

While the merge operation (| and |=) combines two dictionaries and allows the second dictionary to override values from the first in the event of key collisions, the ^ operator highlights discrepancies
without such overrides. This makes the ^ particularly useful where identifying differences, rather than merging settings, is required—providing clearer and more intentional data handling where merging might obscure critical discrepancies.

tim.one · April 19, 2024, 7:44pm

I think this is oversold. The keys that survive after applying symmetric difference to a sequence of dicts are those that appear in an odd number of the inputs. That includes keys that appear in only one (an odd number) dict, but also includes keys that appear in 3, 5, 7, …dicts. So it doesn’t eliminate “override surprises”, it makes them subtler.

If eliminating overrides in a single pair of dicts is the use case, right, it does work for that. But in the general (multiple input) case, a fancier approach is needed to identify keys that appear uniquely.

eladshoshani · April 19, 2024, 10:37pm

You are right, in the general case, the override problem stays the same. But we should note 2 important things about the symmetric difference operation:

It is self-explanatory, as mentioned in PEP 584, the output of d1 ^ d2 is what you expect it to be, no more and no less. In that sense, I believe there are no down sights for adding this feature.
Most use cases of the operation won’t be for a sequence of dictionaries, but only to 2. Also the use cases with a sequence of more than 2 dictionaries is the one you would expect the xor operation result.

*Note: maybe in another PEP we should “fix” the overriding problem, i.e. that instead of dict comprehension such as {i: i +1 for i in [1,2,3,1]} processing normal output we will have an error indicating that the 1 key appears more than one time.
**This note is out of this discussion and should be debated in another unrelated discussion.

Do you think this proposal is ready for a PEP? Any other related topic that we haven’t covered yet?

alicederyn · April 20, 2024, 6:07am

Not without a core dev willing to sponsor, and it doesn’t seem like anyone was excited about this.

eladshoshani · April 20, 2024, 12:54pm

I agree, when applying the symmetric difference to a sequence of dicts it doesn’t eliminate “override surprises”. However this is the behavior you would expect from such an operator, and I believe that is the proposal’s biggest strength - THERE ARE NO SURPRISES.
I would like to prepare a detailed PEP for this proposal, with the implementation in CPython and possible use cases from large open source projects.

Is it possible to ask you to be the sponsor of this PEP as a CPython core developer?

MegaIng · April 20, 2024, 1:03pm

Can you first provide examples of where this would be used? Preferably at least a few examples in the stdlib and a few examples in decently large third party packages. If you can’t find any examples, I would suggest to not go ahead with this proposal.

eladshoshani · April 20, 2024, 2:58pm

Sure, I want to note that in addition to the examples that I gave on this replay, almost all of the examples given of PEP 584 are more suitable to be used with the ^ operator that I am proposing instead of the | operator

All the examples that I took are from python3.10/site-packages that are on my computer.
On pip/_vendor/distro/distro.py
Before:

        props = {}
        for line in lines:
            kv = line.strip("\n").split(":", 1)
            if len(kv) != 2:
                # Ignore lines without colon.
                continue
            k, v = kv
            props.update({k.replace(" ", "_").lower(): v.strip()})
        return props

After:

        props = {}
        for line in lines:
            kv = line.strip("\n").split(":", 1)
            if len(kv) != 2:
                # Ignore lines without colon.
                continue
            k, v = kv
            props ^= {k.replace(" ", "_").lower(): v.strip()}
        return props

On pip/_internal/configuration.py
Before:

        retval = {}

        for variant in OVERRIDE_ORDER:
            retval.update(self._config[variant])

After:

        retval = {}

        for variant in OVERRIDE_ORDER:
            retval ^= self._config[variant]

This case may look like a case where we would want to use the |= operator, but in the context of this code we are not expecting for common keys in the dicts. Thus, it would be better and safer to use the ^= new operator that I am proposing.

Here’s another example that I thought about with a usage of the argparse module:

import argparse

parser = argparse.ArgumentParser(description="Main parser")
parser.add_argument('--foo', help='foo help')

subparsers = parser.add_subparsers(help='sub-command help')

subparser1 = subparsers.add_parser('sub1', help='subparser1 help')
subparser1.add_argument('--bar', help='bar help for subparser1')

subparser2 = subparsers.add_parser('sub2', help='subparser2 help')
subparser2.add_argument('--baz', help='baz help for subparser2')

# Hypothetically using the ^ operator to find exclusive args
exclusive_args = subparser1._option_string_actions ^ subparser2._option_string_actions
print("Do something with the exclusive arguments: ", exclusive_args)

MegaIng · April 20, 2024, 3:14pm

At least for the second example you are very clearly wrong: The list of dictionaries is called OVERRIDE_ORDER. Later lists overwriting earlier ones is to 100% the expected behavior. And I suspect that for the first example this is also the case.

In fact, it isn’t “safer” in any way. Instead of later entries overwriting earlier once, they now vanish. It might be safer to raise an exception instead, but changing the behavior from one to other is no improvement (assuming duplicate keys are actually unexpected, which they aren’t for either of these examples).

You shouldn’t look for cases where currently |, |= or update are being used. You should look for cases where the code is currently calculating something close to the symmetric difference. I don’t know how you would do that with something like grep. I don’t know of common patterns for doing this, since I never really encountered a need for it.

pf_moore · April 20, 2024, 3:43pm

As a pip maintainer I would reject both of the changes to pip that you suggest. Neither improves the maintainability or safety of the code, the second is wrong (as @MegaIng pointed out, the code is designed to overwrite) and “symmetric difference” is a less fundamental concept than disctionary update, so the code is harder to read.

In the argparse example, you don’t give any explanation of why you want to know the exclusive arguments, or what you would do with them. So this is nothing more than “if I need a symmetric difference, the symettric difference operator would allow me to compute it”. Which is hardly compelling…

gcewing · April 20, 2024, 3:51pm

| Elad Kimchi Shoshani eladshoshani
April 20 |

| - |

This case may look like a case where we would want to use the |= operator, but in the context of this code we are not expecting for common keys in the dicts. Thus, it would be better and safer to use the ^= new operator that I am proposing.

Safer in what way?

The question you need to ask yourself is what should happen if there are common keys in the dicts. Using |=, they get silently overridden; using ^=, they get silently deleted. What makes you think that’s an improvement in each of these examples?

eladshoshani · April 20, 2024, 4:09pm

I understand what you are saying, and the pip example is not good. Finding the usages in exsiting code in this case will be harder because there is no “template” of such behavior currently in python.

I think the discussion went a little off topic - I suggest that the fact that the new operator is so self explanatory is good enough reason to add it in the first place.
Think about it - currently there is no “obvious way” to perform symmetric difference to dictionaries, and the new change will add such a way efficiently.

I don’t think its really matters how common are the use cases of this operation, because every use case that exist will be easier with the new operator:
d1 ^ d2 = {k:d1.get(k, d2[k]) for k in d1.keys() ^ d2.keys()}

What I say is that we don’t know how much good this operation will bring, but certainly there is no hurm in adding it.

MegaIng · April 20, 2024, 4:30pm

And multiple core devs are telling you that this isn’t correct. You should demonstrate at least some real world usecases.

There is always the harm of extra maintenance effort, extra tests that need to be verified, and for this case probably even a PEP that needs to reviewed. You technically have to justify dozens of hours of work/attention by core devs. Python has many other issues/feature requests/pull requests/PEPs/ideas that also demand this attention. If you don’t want this idea to be ignore, you have to justify it.

Site note, I was looking for if anyone else would actually notice this: The suggestion by @chepner above using dict(a.items() ^ b.items()) does not have the same behavior as you described, since the values are taking into account when checking if the entries are equal. And in fact, this IMO highlights an important distinction: I don’t consider the behavior of the suggested ^ operator when both dicts have the key already but with different values obvious. I don’t know what I would expect to happen. Maybe an error, maybe they should both vanish, maybe one should override the other, maybe they should be joined together somehow. Either way, the decision for what to do seems usecase specific. Which is another reasons why usecases should come first when proposing a feature: It allows us to make decisions about edge cases.