PEP 615: Support for the IANA Time Zone Database in the Standard Library

pganssle · March 13, 2020, 1:47am

I do suggest carefully reading the PEP and the discussions, because I believe I’ve covered a lot of this already. The caching behavior here seems complicated when it’s explained in detail, but this complexity is designed to make it so that it does what you would expect by default.

What you seem to be suggesting with this “upgrade” function would be something where you mutate existing time zones at some point. Even if you’re not mutating the object in-memory, this is a non-starter, because it would change equality relationships between existing datetimes, and also break the invariant that if x and y are hashable and x == y, then hash(x) == hash(y), because time zone offsets are calculated lazily, but they are a component of datetime equality and hash calculations. So, under your scheme:

zi = ZoneInfo("Europe/Italy")
dt0 = datetime(2021, 1, 1, tzinfo=zi)
dt0_utc = dt0.astimezone(timezone.utc)
print(dt0 == dt0_utc)   # True
hash(dt0)  # Hash is cached - the alternative is worse

# Italy changes and data is updated, mutating zi
dt1 = datetime(2021, 1, 1, tzinfo=ZoneInfo("Europe/Italy"))
print(dt0 == dt1) # True
print(hash(dt0) == hash(dt1)) # False!
print(dt0 == dt0_utc) # False!

This is actually one reason that I will soon need to update the PEP to rule out the possibility of using ICU on Windows - after some investigation it seems that we cannot load all the data into a single object using the ICU headers that Windows exposes, and without that there are some gnarly edge cases that we can run into.

The current system will give people pretty much what they expect. The .nocache() and .clear_cache() functions exist to enable people with non-standard needs or preferences to make other trade-offs with respect to comparison semantics and freshness of data.

Marco_Sulla · March 13, 2020, 11:45am

It depends how ZoneInfo __hash__() is implemented. I suppose that hashing the zone name (“Europe/Rome”) will be univocal, and does not depend on changes (until the zones themselves change!).

I think that this is good. dt0_utc is a future date and was calculated with old informations, and it’s no more reliable.

This way you have only to recalculate dt0_utc. Without the ZoneInfo update, you have to recalculate all the objects: zi, dt0 and dt0_utc.

pganssle · March 13, 2020, 12:47pm

No, it does not. It depends only on how datetime.datetime is implemented, which is well established and cannot be changed.

Correct. In other words, these values are immutable - a very well-known property of datetime. People would be surprised if it worked the other way. I suggest looking into the reasoning behind why x == y and hash(x) == hash(y) are coupled together: see for example this blog post on the subject.

There are two invariants that we must not break for very sound reasons:

The hash of an object must not change during its lifetime.
Two objects that are equal must have the same hash

See the documentation for __hash__ on this point:

The only required property is that objects which compare equal have the same hash value;

Given that datetime is immutable and that both its equality properties and its hash are determined by the UTC time it represents, we must consistently return the same UTC offset for the same datetime.

Marco_Sulla · March 13, 2020, 2:53pm

Mmmmmhhhhh… You’re right. And it’s very good that a tz-aware object depends only by its UTC representation.

So there’s no way to update the timezone in the cache without changing the hash of a tz-aware object.

Anyway, I see you already discussed about the cache and the data update, and one point is not clear to me: what happens if the timezone data of a zone changes, you have already cached it and you call ZoneInfo.nocache? The old cached ZoneInfo will be removed, or continues to reside in the cache?

Furthermore, you said that the hash of ZoneInfo will depend by the key only. What do you think about by adding also the version of tzdb data?

The version of the source data could be good also to resolve the unpickling. If the pickled data has an older version, the platform data could be used instead.

guido · March 13, 2020, 3:59pm

Sadly I’m going to have to mute this discussion. Paul, if you need help reviewing the PEP contact me via email or one of the mailing lists.

steve.dower · March 13, 2020, 8:24pm

If this is the equality condition, then it’d be fine for us to replace the internal representation with the UTC time and use the time zone object to convert back to its local time, right?

That would likely impact subclasses of datetime, but it shouldn’t matter significantly for tzinfo implementations. Then the only impact of an tzinfo object changing would be that the local time becomes more accurate (assuming that tzdata updates are more accurate).

What have I missed here?

Marco_Sulla · March 13, 2020, 9:33pm

That if a ZoneInfo is updated, a tz-aware datetime that uses it changes. And this can be problematic if, for example, you used it as a dict key or as a set member.

steve.dower · March 13, 2020, 10:03pm

But if its identity is based on its UTC time, then it doesn’t change.

Marco_Sulla · March 13, 2020, 10:18pm

Well… I suppose the code is more clear:

static Py_hash_t
datetime_hash(PyDateTime_DateTime *self)
{
    if (self->hashcode == -1) {
        PyObject *offset, *self0;
        [...]
        offset = datetime_utcoffset(self0, NULL);
        [...]

        /* Reduce this to a hash of another object. */
        if (offset == Py_None)
            [...]
        else {
            PyObject *temp1, *temp2;
            int days, seconds;

            [...]
            days = [...];
            seconds = [...];
            temp1 = new_delta(days, seconds,
                              DATE_GET_MICROSECOND(self),
                              1);
            [...]
            temp2 = delta_subtract(temp1, offset);
            [...]
            self->hashcode = PyObject_Hash(temp2);
            Py_DECREF(temp2);
        }
        Py_DECREF(offset);
    }
    return self->hashcode;
}

Briefly, a tz-aware datetime object is converted to an UTC date and the hash is calculated. If the offset changes, the hash changes.

pganssle · March 13, 2020, 10:19pm

To be fair, this is not the actual issue, because what Steve was suggesting is that we adjust the value of the naive portion of the datetime to match the UTC time when it was created. Since the hash value is based on the UTC time for aware datetimes, this wouldn’t have that particular problem.

Steve was suggesting that we change the canonical value of the datetime so that the UTC time remains constant when the offset changes rather than the naive portion being constant with the UTC time changing, so this criticism does not hold in his proposed scenario.

Also, it’s not true that the hash changes, because the hash is cached on first calculation for each datetime object and will not change. If the UTC offset changes the hash will remain the same, but the datetime will no longer compare equal to other objects with the same hash.

I think this is a reasonable way to design a datetime class, but it’s not really the design we have. This would be a pretty significant backwards-incompatible change and it’s hard to justify doing so when there’s a good alternative. What this means is that the value for dt.hour, dt.day, dt.month, etc could not really be considered a constant anymore, even though dt is immutable, because those would become context-specific views on the underlying UTC data. It means that the result of dt.isoformat() and dt.strptime() might change possibly capriciously during an interpreter run.

It would also not be cost-free. Note the opening paragraph of the datetime documentation:

While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.

By storing UTC time as the “true” representation, that means we need to do a conversion operation on every attribute extraction, which would get very costly, particularly if you are using a pure Python implementation of the offset lookup. You could potentially increase the size of the datetime object to store both representations, but then you’d need to work out how best to handle cache invalidation.

I think the idea of almost any changes to how datetime works is a non-starter for this PEP. It’s not really necessary for what we’re trying to accomplish and it would be fraught with difficulties.

steve.dower · March 13, 2020, 10:28pm

Okay, you got some backwards incompatibilities I didn’t think of Thanks for confirming that it’s a reasonable idea though - I’m still not sure when I’m right or when I’m crazy in this area yet.

So this essentially means that a datetime instance is identified by both its UTC and its local time, right? (Which I get is the same as saying it’s identified by its naive local time and offset, but that phrasing makes more sense to me.)

And also we acknowledge that this means all your live datetime instances need to be recreated if we learn something new about the transitions? That would seem to suggest that the only feasible caching strategies are “interpreter lifetime” or “completely managed by the application”. Because unless I’ve designed thoroughly for invalidation, I’m going to end up with inconsistencies.

Assuming we keep the two constructors for either cached or non-cached lookup, which should libraries and frameworks use?

Marco_Sulla · March 13, 2020, 11:01pm

You can create a function that does zoneinfo.clear_cache() and bind it to SIGNAL.

pganssle · March 26, 2020, 3:09pm

Sorry, forgot to respond to this. If libraries and frameworks accept a string as a specifier for a time zone (rather than a tzinfo object, in which case the point is moot), I would expect them to use the primary constructor - that is almost always what you’d want.

I think the cache-bypassing constructor will be a niche use case for people who need fine-tuned control over the cache invalidation because their applications are sensitive to certain edge cases and they want to make different trade-offs. Libraries and frameworks that do the time zone construction for you should probably also accept arbitrary tzinfo objects as well, to support those use cases.

pganssle · March 26, 2020, 3:25pm

While implementing the C extension, I’ve realized that I’m not actually sure about the situation with subinterpreters – if I use a static type for ZoneInfo rather than a heap type, I think that the ZoneInfo cache also ends up being a per-process rather than a per-interpreter cache, and all ZoneInfo objects (not just the class) that hit the cache would end up being shared among all subinterpreters.

I see some examples in PEP 554 of sharing objects via marshal and pickle - if this is the primary way that objects are passed between interpreters, I think it is safe to use a per-interpreter cache, but if objects are sometimes shared directly between interpreters, then it might be preferable to use a process-wide cache for the constructor to avoid the possibility of identical time zones constructed with the primary constructor being passed between interpreters in such a way that violates the invariant of ZoneInfo(key) is ZoneInfo(key).

@eric.snow or @nanjekyejoannah – do either of you have any thoughts or clarifications on this?

steve.dower · March 27, 2020, 9:33am

I more had in mind libraries that read from a DB or file and construct a tzinfo from that, but never actually give it out to the application developer.

So I agree using the caching constructor is the right default in every case, but I think the “niche” cache management functions should either come with a big doc warning (e.g. 1) or just be omitted/internal (and hence not necessary to make an equivalent API).

(1: “This function may cause any datetime instances in your application to become incomparable, including those created by third party libraries, at unexpected times in the future. Check your dependencies before using.”)

eric.snow · March 27, 2020, 5:09pm

Thanks, @pganssle, for keeping subinterpreters in mind. That really helps.

That is correct. Static types are shared between interpreters. This is actually one of the things we have to solve, given how we have a bunch of static types in CPython. Using a heap type would avoid the problem.

Going with a static type would be fine until we reach the point that subinterpreters stop sharing the GIL. So in the short term you would probably be fine. However, if you can instead do it in a per-interpreter way, that would be great. It would save us later work.

Take a look a PEP 489 and the newly accepted PEP 573.

PEP 554 is aiming for minimal functionality, including only a basic set of types that can be passed between interpreters. There is no proposed support for actually sharing objects between interpreters. In fact not even their underlying data is shared.

I expect that later we’ll look into broadening the scope of inter-interpreter sharing. However, assume for now that objects in each interpreter are entirely independent of other interpreters.

pganssle · March 29, 2020, 4:38pm

Thanks for the response Eric.

For now I can try and go with a per-interpreter cache, but maybe I’ll avoid specifying the behavior exactly as part of the PEP, so we have a bit more freedom to make changes as needed in response to changes in how subinterpreters work.

The one thing I’ll note is that I don’t want to make this broadening of the scope harder, and this cache is not a cache for performance purposes – it could cause bugs in peoples’ code if ZoneInfo objects from different caches were passed between interpreters.

That said, I believe if the ability to share objects between interpreters becomes broader, we may be able to switch to either a per-process cache shared between all interpreters (or a more complicated design with a per-process cache and per-interpreter caches that query the per-process cache), so I suppose there’s not much need to worry about the choices we make here making it harder to allow the sharing of objects between interpreters.

pganssle · March 29, 2020, 6:32pm

Apparently I can no longer edit the post with the PEP in it , so that text is now out of date. The latest version of the PEP has moved the open issues for Windows ICU support and for different PYTHONTZPATH configuration options into the “Rejected Ideas” section, and I have one more PR to move the “Using the datetime module” section there as well.

I am also thinking that it might be a good idea to rename nocache() to .no_cache(), since that would be more consistent with the naming convention used with .from_file().

Other than that, I believe this is ready to be submitted to the SC for approval, but please if anyone has further comments or believes I have missed something, let me know.

As I mentioned on python-dev, I was sort of hoping this could get approved next Sunday during one of the southern hemisphere’s DST -> STD transitions, so that the “accepted” datetime is an ambiguous datetime somewhere on earth .

pganssle · March 31, 2020, 1:44pm

When adjusting the PEP to clarify this, I realized why I originally wanted ZoneInfo.__str__ to work this way: it allows for an easy way to check whether the zone can be serialized by string, since str(zi) will be "" if no key was supplied.

Upon further consideration and in discussion with @barry, I decided that we’ll have __str__ fall back to __repr__ when no key is supplied, and add a key attribute to ZoneInfo, which will be None if no key was supplied, so zi.key is None can replace str(zi) == "".

pganssle · April 18, 2020, 5:24pm

There has been a decent amount of discussion about this PEP on the steering council thread on Github, and right now one of the remaining questions is @vstinner’s concerns about the __eq__ and __hash__ implementation, with the discussion starting here.

In the current implementation, I do not override __eq__ and __hash__, because the semantics of these things are very much geared around object equality, and so I think it makes sense to have object equality correspond to value equality. That said, I would say that in the abstract, there are at least four valid ways to consider two ZoneInfo objects to be equal, assuming ZoneInfo objects z1 and z2, I would say the most reasonable choices are:

z1 == z2 if and only if z1 is z2
z1 == z2 if z1.key == z2.key
z1 == z2 if z1 and z2 have the same behavior for all datetimes - which is to say that all the transition information is the same.
A combination of 2 and 3: z1 == z2 if they keys are the same and all the transitions are identical.

In almost all real cases, these will all give the same answer, because most people will be calling zoneinfo.ZoneInfo, which will always return the same object for the same key. However, there are some implications around the notion of equality that compares all transition information.

Unlike options 1 and 2, options 3 and 4 do provide extra, otherwise inaccessible, information about the zones, so while you can easily write a comparison function to mimic options 1 and 2 in a world using option 3, you cannot write a comparison function using option 3 in a world where we use option 1 or 2.

We would also presumably have the option of making it so that zoneinfo.ZoneInfo("UTC") == datetime.timezone.utc if we have a custom, value-based comparison method, which might conceivably be convenient for trying to “normalize” your UTC or other fixed-offset time zones (though I suspect this would only be really meaningful for UTC, and you can special-case that by checking against str(zi) == "UTC", which, incidentally, would work for pytz as well).

I think the most important thing about this is how it would affect how these things get hashed. If we go with option 2, then it would not be possible to hold two different instances of zones with the same key together in a set:

>>> s = {ZoneInfo("America/New_York",
...      ZoneInfo.no_cache("America/New_York")}
>>> s
{ZoneInfo('America/New_York')}

Which means that {dt.tzinfo for dt in bunch_of_datetimes} won’t necessarily give you all the ZoneInfo objects used in bunch_of_datetimes.

If we go with option 3, then zones that are links to one another or are distinct zones with the same behavior could not co-exist in a set together:

>>> s = {ZoneInfo("America/New_York"),
         ZoneInfo("US/Eastern")}
>>> s
{ZoneInfo('America/New_York')}

If we go with option 4, though, you wouldn’t be able to tell whether two zones are identical to one another even if they have different keys, so you can’t do something like this:

with open(some_file, "rb") as f:
    unknown_zi = ZoneInfo.from_file(f)

print(unknown_zi == ZoneInfo("America/New_York"))

You also wouldn’t have any way to detect whether two zones have the same behavior but different names (e.g. "US/Eastern" and "America/New_York").

In the end, I can sort of imagine uses for having some sort of value-based equality in ZoneInfo, but there’s no one obvious choice here. I don’t know why people would want to use these things as keys in a dictionary, but maybe they would. I can also see some reasons for putting them in a set, but nothing so common that there’s one obvious use case.

In terms of performance, option 1 is the cheapest for both hashing and equality, and options 3 and 4 are most expensive, but we can use a cache to at least make the hash comparison a one-time cost.

My proposal: I think that we should stick with option 1 (default implementation - comparison by object identity) for equality, because that most closely matches the semantics people will care about (and for the same reasons that we have pickle serializing by key).

If a lot of people are chafing at the inability to do “comparison by value”, in a future version we can offer an .equivalent_transitions() method that exposes the results of option 3. We would also have the option of changing __hash__ to be value-based in the future, since hash values aren’t guaranteed, and all we’d be doing is introducing some hash collisions, but that would allow people to create subclasses (or wrapper functions) with the __eq__ and __hash__ behavior described in either options 3 or 4.