PEP 615: Support for the IANA Time Zone Database in the Standard Library

pganssle · April 18, 2020, 5:24pm

There has been a decent amount of discussion about this PEP on the steering council thread on Github, and right now one of the remaining questions is @vstinner’s concerns about the __eq__ and __hash__ implementation, with the discussion starting here.

In the current implementation, I do not override __eq__ and __hash__, because the semantics of these things are very much geared around object equality, and so I think it makes sense to have object equality correspond to value equality. That said, I would say that in the abstract, there are at least four valid ways to consider two ZoneInfo objects to be equal, assuming ZoneInfo objects z1 and z2, I would say the most reasonable choices are:

z1 == z2 if and only if z1 is z2
z1 == z2 if z1.key == z2.key
z1 == z2 if z1 and z2 have the same behavior for all datetimes - which is to say that all the transition information is the same.
A combination of 2 and 3: z1 == z2 if they keys are the same and all the transitions are identical.

In almost all real cases, these will all give the same answer, because most people will be calling zoneinfo.ZoneInfo, which will always return the same object for the same key. However, there are some implications around the notion of equality that compares all transition information.

Unlike options 1 and 2, options 3 and 4 do provide extra, otherwise inaccessible, information about the zones, so while you can easily write a comparison function to mimic options 1 and 2 in a world using option 3, you cannot write a comparison function using option 3 in a world where we use option 1 or 2.

We would also presumably have the option of making it so that zoneinfo.ZoneInfo("UTC") == datetime.timezone.utc if we have a custom, value-based comparison method, which might conceivably be convenient for trying to “normalize” your UTC or other fixed-offset time zones (though I suspect this would only be really meaningful for UTC, and you can special-case that by checking against str(zi) == "UTC", which, incidentally, would work for pytz as well).

I think the most important thing about this is how it would affect how these things get hashed. If we go with option 2, then it would not be possible to hold two different instances of zones with the same key together in a set:

>>> s = {ZoneInfo("America/New_York",
...      ZoneInfo.no_cache("America/New_York")}
>>> s
{ZoneInfo('America/New_York')}

Which means that {dt.tzinfo for dt in bunch_of_datetimes} won’t necessarily give you all the ZoneInfo objects used in bunch_of_datetimes.

If we go with option 3, then zones that are links to one another or are distinct zones with the same behavior could not co-exist in a set together:

>>> s = {ZoneInfo("America/New_York"),
         ZoneInfo("US/Eastern")}
>>> s
{ZoneInfo('America/New_York')}

If we go with option 4, though, you wouldn’t be able to tell whether two zones are identical to one another even if they have different keys, so you can’t do something like this:

with open(some_file, "rb") as f:
    unknown_zi = ZoneInfo.from_file(f)

print(unknown_zi == ZoneInfo("America/New_York"))

You also wouldn’t have any way to detect whether two zones have the same behavior but different names (e.g. "US/Eastern" and "America/New_York").

In the end, I can sort of imagine uses for having some sort of value-based equality in ZoneInfo, but there’s no one obvious choice here. I don’t know why people would want to use these things as keys in a dictionary, but maybe they would. I can also see some reasons for putting them in a set, but nothing so common that there’s one obvious use case.

In terms of performance, option 1 is the cheapest for both hashing and equality, and options 3 and 4 are most expensive, but we can use a cache to at least make the hash comparison a one-time cost.

My proposal: I think that we should stick with option 1 (default implementation - comparison by object identity) for equality, because that most closely matches the semantics people will care about (and for the same reasons that we have pickle serializing by key).

If a lot of people are chafing at the inability to do “comparison by value”, in a future version we can offer an .equivalent_transitions() method that exposes the results of option 3. We would also have the option of changing __hash__ to be value-based in the future, since hash values aren’t guaranteed, and all we’d be doing is introducing some hash collisions, but that would allow people to create subclasses (or wrapper functions) with the __eq__ and __hash__ behavior described in either options 3 or 4.