I strongly believe that users want to reload time zone data transparently, without restarting their applications. The on-disk data is updated fairly frequently, see Red Hat Enterprise Linux Timezone Data (tzdata) - Development Status Page for an example of what one distribution does. I do not think that users would want to restart their application (with a scheduled downtime) just to apply one of those updates.
This means that the caching in the ZoneInfo constructor is very problematic.
Thank you for taking the time to comment on the proposal!
I understand where you are coming from here, but there are a lot of reasons to use the cache, and good reasons to believe that using the cache won’t be a problem.
The question of “reloading time zone data transparently” could mean that existing datetimes would be updated if the data on disk changes (which would be problematic from a datetimes-are-immutable point of view), or it could mean that newly-constructed datetimes are always pulled from the latest data. Assuming we can only do the second thing, that means that if you get time zone data updates during a run of your program, you will end up with a mixture of stale and non-stale time zones, which is also pretty non-ideal.
I think there’s also a lot of precedent for this kind of thing:
It is already the case that if system local time changes, you must call time.tzset() in order to invalidate the cache, and this only works on some platforms. We’re basically already in a situation where you must actively take action to get the “latest time zone information” during a run of the interpreter.
Right now more or less everyone uses a cache and there are not really any complaints. pytz and dateutil both use a similar caching behavior (and AFAIK pytz doesn’t even expose a way to opt out of it - everything is unconditionally cached). I don’t think I’ve heard of anyone complaining about this behavior or even noticing much.
And one of the main drivers for this cache behavior is that the semantics of datetime explicitly assume that time zones are singletons, and you can run into some weird situations if you don’t use singletons. Consider this case:
Note that this makes no use of the cache — I used the .nocache constructor to simulate what the semantics of a non-caching constructor would look like - it could cause strange path-dependencies where whether you got somewhere by arithmetic or by construction / replace operations, you’d get different answers.
So, to summarize my position:
Cache by default because most people will want such a cache, even if they don’t know it.
Document various strategies and their trade-offs for people with long-running applications - including the use of ZoneInfo.clear_cache() for tzset-like behavior and ZoneInfo.nocache for “always give me a fresh copy” behavior.
This is because there’s an STD->DST transition between 2020-03-08 and 2020-03-09, so the difference in wall time is 24 hours, but the absolute elapsed time is 23 hours. I wrote a blog post about datetime arithmetic semantics that goes into more detail about this, but basically, the way datetime's arithmetic and comparisons work is something like this:
def subtract(a, b):
if a.tzinfo is b.tzinfo:
return a.replace(tzinfo=None) - b.replace(tzinfo=None)
UTC = timezone.utc
return a.astimezone(UTC) - b.astimezone(UTC)
So dt2 - dt0 is treated as two different zones and the math is done in UTC, whereas dt1 - dt0 is treated as the same zone, and the math is done in local time.
dt1 will necessarily be the same zone as dt0, because it’s the result of an arithmetical operation on dt0. dt2 is a different zone because I bypassed the cache, but if it hit the cache, the two would be the same.
This seems reasonable to me, as given the combination of “tzinfo objects are immutable” and “tzinfo objects are compared by identity in date arithmetic operations”, there’s going to need to be application level logic to cope with a tzdb change without restarting.
Clarifying the logic for not bundling tzdata in the Windows installers in the absence of support for the Windows ICU API: keeping tzdata up to date is going to require separate package installation commands anyway, so it’s reasonable to have the obvious failure (being unable to construct named timezones) happen first, such that users learn the required update command for their system?
If I’ve understood it correctly, I don’t think that rationale fully holds, as there are cases where it would be nice to be able to rely on having tz info available without having to introduce the complexities of package installation, and minor date arithmetic errors here and there would be acceptable. (I’m mostly thinking “teaching Python learners about time zones”, so such errors could even be used to illustrate the importance of keeping timezone DBs up to date)
In a lot of ways “tzdata is installed but not up to date” is just another form of the caching problem in long-running processes, except it’s occurring at the Python environment level.
That said, if we were to ship tzdata initially on Windows (by default), with the intent of eventually removing it from the default package set once the ICU API was supported and support for Windows versions without that API had been dropped, the public top-level module name could be problematic. So perhaps “tzdata” should be imported as “_tzdata” instead, to help make it clear not all systems will provide that module?
Thanks for the thoughtful comments on this, I particularly like the idea of thinking of the “tzdata is installed but not up to date” as another form of the caching problem.
So I would say that that is not the logic for not bundling it, and also not even really something I had considered. The main reason I do not want to bundle tzdata together with Python on Windows or any other platform is that I started to think about the complexities of such a thing and realized that 1. this is a fairly hard packaging problem and 2. blocking the addition of any time zone support would be making the perfect the enemy of the good. Bundling tzdata with CPython is something I think we should tackle in the future (possibly, though not likely, as part of 3.9), but it’s something we can do in a perfectly backwards-compatible way, and it’s a very tricky packaging problem.
I also think that Windows may be something of a red herring here in that I’m not entirely sure that everything in the Python support matrix will necessarily have the system time zone data installed. I have recommended in the PEP that if you are a distributor of Python and it is possible for you to make tzdata a dependency of Python, you should do it, but it is not a requirement for compliance with the PEP (though I guess we could make it one).
From a practical perspective, I think "ship tzdata" leaves a lot to be desired:
For system python, it is often required but undesirable to install packages as an administrator. A somewhat better solution is to use pip install --user, but this means that the base install will never actually get “upgraded”.
A corollary to 1. is that I believe that virtual environments will ignore your --user-installed packages, which means that every time you create a virtual environment, you’ll need to update your tzdata even if you’ve already “globally” updated it.
There is no pip update command, you need to explicitly select the packages you want to update. This is half the reason I almost exclusively work in virtual environments - it’s easier for me to say the requirements and create a new virtual environment from scratch rather than to try and track the state of my baseline environment.
The "tzdata unbundled" situation I’m imagining is that if you want to use time zone data, if you want something 100% pip installable, you just declare a dependency on tzdata in your application or library (possibly conditional on Windows, if we establish that Windows is the only platform where this is a problem). If you are only targeting platforms that you know will have the system time zone data or you can declare a dependency on "install system tzdata" (e.g. conda), then you can omit the declaration.
If people follow this strategy, then everything should work. pip install -U something-depending-on-tzdata will upgrade your tzdata if and only if you need it. When you create a virtual environment, the latest tzdata will be installed when you pip install tzdata.
And to re-iterate, this is not the situation I see going forward forever, just a reasonably tolerable version that is better than something we hack together at the last minute to get this in before feature freeze. At some point I expect those tzdata declarations to turn into tzdata; python_version < 3.10 or whatever.
After some talking with @steve.dower, I’m not nearly as confident with the ICU-based solution as I was in the past. It seems that the part of ICU that Windows exposes may not be suitable for eagerly populating ZoneInfo, and as a result I’m not confident that it is necessarily appropriate to attempt to transparently fall back to it. We are still exploring somewhat, which is why I have not yet updated the PEP.
That said, one of the values of exposing tzdata as a public module was to allow libraries that ship their own time zone data to depend on it. My plan was to have dateutil >= 3.0 start depending on tzdata, for example. The way it’s designed, it can be used in libraries that support older versions of Python (thus opening the way for a zoneinfo backport), so I definitely want to keep the PyPI version public.
We could theoretically have the default-installed version be _tzdata, or simply install the zoneinfo files somewhere other than site-packages with lower precedence than the tzdata module, but I think that gets into tricky packaging territory and people would find it hard to determine the source of their time zone data. Again, I think this is another reason to push the bundling question to a separate PEP.
Thanks for the explanation, and I agree that “deferring for now because it’s tricky and we can still make things better without it” is a good reason for leaving any form of tzdata bundling out of this initial iteration of the feature. (And since I forgot to actually write it down the first time: definite +1 on the overall proposal. Thank you for putting it together!)
As additional notes for a possible future bundling implementation:
ensurepip-style bundling of tzdata wouldn’t actually help much, for the reasons you gave:
user-level tzdata upgrades will be invisible to all venvs
users may not have permissions for Python installation level upgrades
even installation level upgrades will be invisible in venvs that aren’t using the system site-packages
given those limitations, any bundling would likely need to be as a regular stdlib module, so the fallback would be visible in all venvs. Being a regular stdlib module would restrict updates to Python maintenance releases rather than PyPI package releases. However, if the stdlib fallback used the name “_tzdata”, we could still set up the logic to prefer the system timezone db and the public tzdata module to the private stdlib fallback. The downside of having two copies of the time zone db around means we would probably restrict the fallback bundling to platforms with no usable system time zone db (e.g. Windows)
that approach would mean that anyone keeping current on their Python maintenance releases and/or system package updates would be getting reasonably up to date tzdata info (no more than a few months old), while anyone that needed to consistently get tz updates within days would need to install and use the public tzdata module
To be clear, arrow is a wrapper around datetime and arrow does not provide its own time zone support. This PEP is suggesting that we provide a specific concrete implementation of the tzinfo abstract base class and the API questions are mostly around how you construct those tzinfo objects, not around how they interact with datetime.
Think of the cache/no cache distinction as between the questions “What was the first definition seen in the available time zone databases for this time zone by the running application?” and “What is the definition of this time zone in the available time zone databases right now?”
Most applications and services want the first behaviour, where they keep a consistent set of time zone definitions while running, and then pick up changes when they restart.
However, some long-lived applications need to be able to manage their own time zone caching, so the “nocache” APIs exist to let them do that. (Bypassing the cache is also useful for testing purposes)
Yes, I understood that. My only doubt is if such atomicity it needed. Is not more simple for the API to allow only to clear the entire cache, or provide a mechanism to disable the cache?
Anyway, I think this is interesting. Maybe it could be applied to all immutables that have a caching system, like str. This way you could also be able to disable all the caching with a single command, maybe with a command-line parameter, similar to -u.
The cache is global and I think mutating a global cache like that would not necessarily be something you want to do lightly. Having a lightweight version of the ZoneInfo constructor that gives you a fresh copy is going to have an entirely localized effect (and in fact I would suspect that some people would want to maintain their own caches, separate from the global cache).
In the end, the cost of including the nocache option is very small:
The implementation is simple because we basically need this functionality for cache misses anyway.
It is an obscure alternate constructor where in order to even think you want to use it, you have to read the documentation to understand how it works anyway, so people are not likely to be confused and think it does something it doesn’t.
In the end, a combination of the ability to clear the global cache on a per-key basis and .nocache gives end users the ability to achieve whatever cache behavior they would prefer with very little in the way of maintenance cost, plus it makes the whole thing easier to test, so I think it’s a clear win even if it’s a relatively obscure use case.
This is a separate issue probably best described in “Ideas”, but from my perspective, I’d say that time zones are in a very specific situation that makes it important to be able to control the cache:
Critical parts of the semantics depend on object identity, not equality, so this cache is not being maintained for performance reasons: whether or not you hit the cache actually changes the documented behavior.
The time zone being represented can change during the course of the interpreter run - a string or an integer has a single canonical representation, but these time zone files can and do change frequently.
If you look at the reference implementation, I actually also maintain a cache of timedelta objects used in the ZoneInfo objects for performance reasons. I have not mentioned this at all in the PEP or even provided any way to manipulate that cache in the implementation because it has no bearing on the behavior. I think you’d want a fairly good reason to expose an API to give people fine-grained control over what is effectively an implementation detail.
Yes, maybe there’s no use case for allowing to disable cache of immutables.
What about storing the last modified date of source, on platform that support it and have it enabled?
If the last modified date changed, cached object is replaced by new object.
This way you have to call explicitly ZoneInfo.nocache rarely, and it’s useful also for programs that have to run indefinitely, like servers.
If last modified date is unsupported or disabled, then you can always use ZoneInfo.nocache or clear the cache manually.
PS: md5 or a stronger hash will be better, but maybe too much slow?
@Marco_Sulla Automatic cache invalidation won’t work, as datetime objects are immutable, and all existing datetime objects will be referencing the old cache entry.
So applications that want to support dynamic timezone cache invalidation are going to need to be written specifically to support it (potentially by using a different datetime object implementation).
The PEP doesn’t try to solve that problem, because most applications won’t need it (update on restart will be fine), and the applications that do need it will have clearer requirements on the behaviour they actually want.
Mmmmhhhh… what about a PyZoneInfo_Update()?
# it was never created before, so it's cached
Italy = ZoneInfo("Europe/Rome")
# in the meanwhile, EU abolishes the daylight saving.......
Italy2 = ZoneInfo("Europe/Rome")
# the source file of european countries changed, so the last
# modified date is different. PyZoneInfo_Update() is called
# on the cached ZoneInfo("Europe/Rome"), and Italy2 is binded
# to the cached object
Italy3 = ZoneInfo.nocache("Europe/Rome")
Italy is Italy2 is Italy3 == True