PEP 615: Support for the IANA Time Zone Database in the Standard Library

pganssle · February 25, 2020, 3:34pm

Last year at the Language Summit, I proposed to add additional concrete time zones to the standard library. After much work and much more procrastination, I now have now put together my first proposal: support for the IANA time zone database (also called tz, zoneinfo or the Olson database; Wikipedia). I have submitted this for consideration as PEP 615.

Please give it a read and provide your constructive criticism. I have documented my reasoning for most of the design decisions, but I am very interested in getting feedback on the choices I’ve made here. I have marked two sections as “Open issues”, as they are the ones I feel most uncertain about, but all parts of it are up for discussion.

Below (second post in this thread) is the full text of the PEP. I will try to keep it updated as the drafts evolve so that it is easy to quote in responses on discourse, but the PEPs github repository is the canonical source of the PEP.

Links for easy access:

PEP: PEP 615 (Github)
Reference Implementation: https://github.com/pganssle/zoneinfo
tzdata repo: https://github.com/pganssle/tzdata

pganssle · February 25, 2020, 3:34pm

Created: 2020-02-22
Python-Version: 3.9

Abstract

This proposes adding a module, zoneinfo, to provide a concrete time
zone implementation supporting the IANA time zone database. By default,
zoneinfo will use the system's time zone data if available; if no
system time zone data is available, the library will fall back to using
the first-party package tzdata, deployed on PyPI.

Motivation

The datetime library uses a flexible mechanism to handle time zones:
all conversions and time zone information queries are delegated to an
instance of a subclass of the abstract datetime.tzinfo base class.^[1]
This allows users to implement arbitrarily complex time zone rules, but
in practice the majority of users want support for just three types of
time zone: [a]

UTC and fixed offsets thereof
The system local time zone
IANA time zones

In Python 3.2, the datetime.timezone class was introduced to support
the first class of time zone (with a special datetime.timezone.utc
singleton for UTC).

While there is still no "local" time zone, in Python 3.0 the semantics
of naïve time zones was changed to support many "local time"
operations, and it is now possible to get a fixed time zone offset from
a local time:

>>> print(datetime(2020, 2, 22, 12, 0).astimezone())
2020-02-22 12:00:00-05:00
>>> print(datetime(2020, 2, 22, 12, 0).astimezone()
...       .strftime("%Y-%m-%d %H:%M:%S %Z"))
2020-02-22 12:00:00 EST
>>> print(datetime(2020, 2, 22, 12, 0).astimezone(timezone.utc))

However, there is still no support for the time zones described in the
IANA time zone database (also called the "tz" database or the Olson
database ^[2]). The time zone database is in the public domain and is
widely distributed — it is present by default on many Unix-like
operating systems. Great care goes into the stability of the database:
there are IETF RFCs both for the maintenance procedures (RFC 6557^[3])
and for the compiled binary (TZif) format (RFC 8636^[4]). As such, it is
likely that adding support for the compiled outputs of the IANA database
will add great value to end users even with the relatively long cadence
of standard library releases.

Proposal

This PEP has three main concerns:

The semantics of the zoneinfo.ZoneInfo class
Time zone data sources used
Options for configuration of the time zone search path

Because of the complexity of the proposal, rather than having separate
"specification" and "rationale" sections the design decisions and
rationales are grouped together by subject.

The `zoneinfo.ZoneInfo` class

Constructors

The initial design of the zoneinfo.ZoneInfo class has several
constructors.

ZoneInfo(key: str)

The primary constructor takes a single argument, key, which is a string
indicating the name of a zone file in the system time zone database (e.g.
"America/New_York", "Europe/London"), and returns a ZoneInfo constructed
from the first matching data source on search path (see the data-sources
section for more details). All zone information must be eagerly read from the
data source (usually a TZif file) upon construction, and may not change during
the lifetime of the object (this restriction applies to all ZoneInfo
constructors).

One somewhat unusual guarantee made by this constructor is that calls
with identical arguments must return identical objects. Specifically,
for all values of key, the following assertion must always be valid
[b]:

    a = ZoneInfo(key)
    b = ZoneInfo(key)
    assert a is b

The reason for this comes from the fact that the semantics of datetime
operations (e.g. comparison, arithmetic) depend on whether the datetimes
involved represent the same or different zones; two datetimes are in the
same zone only if dt1.tzinfo is dt2.tzinfo.^[5] In addition to the
modest performance benefit from avoiding unnecessary proliferation of
ZoneInfo objects, providing this guarantee should minimize surprising
behavior for end users.

dateutil.tz.gettz has provided a similar guarantee since version 2.7.0
(release March 2018).^[6]

Note
The implementation may decide how to implement the cache behavior, but
the guarantee made here only requires that as long as two references
exist to the result of identical constructor calls, they must be
references to the same object. This is consistent with a reference
counted cache where ZoneInfo objects are ejected when no references to
them exist (for example, a cache implemented with a weakref.WeakValueDictionary) — it is allowed but not required or recommended to implement this with a "strong" cache, where all ZoneInfo files are kept alive indefinitely.

ZoneInfo.nocache(key: str)

This is an alternate constructor that bypasses the constructor's cache.
It is identical to the primary constructor, but returns a new object on
each call. This is likely most useful for testing purposes, or to
deliberately induce "different zone" semantics between datetimes with
the same nominal time zone.

Even if an object constructed by this method would have been a cache
miss, it must not be entered into the cache; in other words, the
following assertion should always be true:

>>> a = ZoneInfo.nocache(key)
>>> b = ZoneInfo(key)
>>> a is not b

ZoneInfo.from_file(fobj: IO[bytes], /, key: str = None)

This is an alternate constructor that allows the construction of a
ZoneInfo object from any TZif byte stream. This constructor takes an
optional parameter, key, which sets the name of the zone, for the
purposes of __str__ and __repr__ (see
Representations)

Unlike the primary constructor, this always constructs a new object.
There are two reasons that this deviates from the primary constructor's
caching behavior: stream objects have mutable state and so determining
whether two inputs are identical is difficult or impossible, and it is
likely that users constructing from a file specifically want to load
from that file and not a cache.

As with ZoneInfo.nocache, objects constructed by this method must not
be added to the cache.

Behavior during data updates

If a source of time zone data is updated during a run of the
interpreter, it will not invalidate any caches or modify any existing
ZoneInfo objects, but newly constructed ZoneInfo objects should come
from the updated data source.

This means that the point at which a ZoneInfo file is updated depends
primarily on the semantics of the caching behavior. The only guaranteed
way to get a ZoneInfo file from an updated data source is to induce a
cache miss, either by bypassing the cache and using ZoneInfo.nocache
or by clearing the cache.

Note
The specified cache behavior does not require that the cache be lazily
populated — it is consistent with the specification (though not
recommended) to eagerly pre-populate the cache with time zones that have
never been constructed.

String representation

The ZoneInfo class's __str__ representation will be drawn from the
key parameter. This is partially because the key represents a
human-readable "name" of the string, but also because it is a useful
parameter that users will want exposed. It is necessary to provide a
mechanism to expose the key for serialization between languages and
because it is also a primary key for localization projects like CLDR
(the Unicode Common Locale Data Repository).

An example:

>>> zone = ZoneInfo("Pacific/Kwajalein")
>>> str(zone)
'Pacific/Kwajalein'

When a key is not specified, the str operation should not fail, but
should return the file's __repr__:

>>> zone = ZoneInfo.from_file(f)
>>> str(zone)
'ZoneInfo.from_file(<_io.BytesIO object at ...>)'

The __repr__ for a ZoneInfo is implementation-defined and not
necessarily stable between versions, but it must not be a valid
ZoneInfo key.

Pickle serialization

Rather than serializing all transition data, ZoneInfo objects will be
serialized by key, and ZoneInfo objects constructed from raw files
(even those with a value for key specified) cannot be pickled.

The behavior of a ZoneInfo file depends on how it was constructed:

ZoneInfo(key): When constructed with the primary constructor, a
ZoneInfo object will be serialized by key, and when deserialized
the will use the primary constructor in the deserializing process,
and thus be expected to be the same object as other references to
the same time zone. For example, if europe_berlin_pkl is a string
containing a pickle constructed from ZoneInfo("Europe/Berlin"),
one would expect the following behavior:
```
>>> a = ZoneInfo("Europe/Berlin")
>>> b = pickle.loads(europe_berlin_pkl)
>>> a is b
True
```
ZoneInfo.nocache(key): When constructed from the cache-bypassing
constructor, the ZoneInfo object will still be serialized by key,
but when deserialized, it will use the cache bypassing constructor.
If europe_berlin_pkl_nc is a string containing a pickle
constructed from ZoneInfo.nocache("Europe/Berlin"), one would
expect the following behavior:
```
>>> a = ZoneInfo("Europe/Berlin")
>>> b = pickle.loads(europe_berlin_pkl_nc)
>>> a is b
False
```
ZoneInfo.from_file(fobj, /, key=None): When constructed from a
file, the ZoneInfo object will raise an exception on pickling. If
an end user wants to pickle a ZoneInfo constructed from a file, it
is recommended that they use a wrapper type or a custom
serialization function: either serializing by key or storing the
contents of the file object and serializing that.

This method of serialization requires that the time zone data for the
required key be available on both the serializing and deserializing
side, similar to the way that references to classes and functions are
expected to exist in both the serializing and deserializing
environments. It also means that no guarantees are made about the
consistency of results when unpickling a ZoneInfo pickled in an
environment with a different version of the time zone data.

Sources for time zone data

One of the hardest challenges for IANA time zone support is keeping the
data up to date; between 1997 and 2020, there have been between 3 and 21
releases per year, often in response to changes in time zone rules with
little to no notice (see^[7] for more details). In order to keep up to
date, and to give the system administrator control over the data source,
we propose to use system-deployed time zone data wherever possible.
However, not all systems ship a publicly accessible time zone database
— notably Windows uses a different system for managing time zones —
and so if available zoneinfo falls back to an installable first-party
package, tzdata, available on PyPI. If no system zoneinfo files are
found but tzdata is installed, the primary ZoneInfo constructor will
use tzdata as the time zone source.

System time zone information

Many Unix-like systems deploy time zone data by default, or provide a
canonical time zone data package (often called tzdata, as it is on
Arch Linux, Fedora and Debian). Whenever possible, it would be
preferable to defer to the system time zone information, because this
allows time zone information for all language stacks to be updated and
maintained in one place. Python distributors are encouraged to ensure
that time zone data is installed alongside Python whenever possible
(e.g. by declaring tzdata as a dependency for the python package).

The zoneinfo module will use a "search path" strategy analogous to
the PATH environment variable or the sys.path variable in Python;
the zoneinfo.TZPATH variable will be read-only (see
search-path-config for more details), ordered
list of time zone data locations to search. When creating a ZoneInfo
instance from a key, the zone file will be constructed from the first
data source on the path in which the key exists, so for example, if
TZPATH were:

    TZPATH = (
        "/usr/share/zoneinfo",
        "/etc/zoneinfo"
        )

and (although this would be very unusual) /usr/share/zoneinfo
contained only America/New_York and /etc/zoneinfo contained both
America/New_York and Europe/Moscow, then
ZoneInfo("America/New_York") would be satisfied by
/usr/share/zoneinfo/America/New_York, while
ZoneInfo("Europe/Moscow") would be satisfied by
/etc/zoneinfo/Europe/Moscow.

At the moment, on Windows systems, the search path will default to
empty, because Windows does not officially ship a copy of the time zone
database. On non-Windows systems, the search path will default to a list
of the most commonly observed search paths. Although this is subject to
change in future versions, at launch the default search path will be:

TZPATH = (
    "/usr/share/zoneinfo",
    "/usr/lib/zoneinfo",
    "/usr/share/lib/zoneinfo",
    "/etc/zoneinfo",
)

This may be configured both at compile time or at runtime; more
information on configuration options at
search-path-config.

The `tzdata` Python package

In order to ensure easy access to time zone data for all end users, this
PEP proposes to create a data-only package tzdata as a fallback for
when system data is not available. The tzdata package would be
distributed on PyPI as a "first party" package, maintained by the
CPython development team.

The tzdata package contains only data and metadata, with no
public-facing functions or classes. It will be designed to be compatible
with both newer importlib.resources^[8] access patterns and older
access patterns like pkgutil.get_data^[9] .

While it is designed explicitly for the use of CPython, the tzdata
package is intended as a public package in its own right, and it may be
used as an "official" source of time zone data for third party Python
packages.

Search path configuration

The time zone search path is very system-dependent, and sometimes even
application-dependent, and as such it makes sense to provide options to
customize it. This PEP provides for three such avenues for
customization:

Global configuration via a compile-time option
Per-run configuration via environment variables
Runtime configuration change via a reset_tzpath function

Compile-time options

It is most likely that downstream distributors will know exactly where
their system time zone data is deployed, and so a compile-time option
PYTHONTZPATH will be provided to set the default search path.

The PYTHONTZPATH option should be a string delimited by os.pathsep,
listing possible locations for the time zone data to be deployed (e.g.
/usr/share/zoneinfo).

Environment variables

When initializing TZPATH (and whenever reset_tzpath is called with
no arguments), the zoneinfo module will use the environment variable
PYTHONTZPATH, if it exists, to set the search path.

PYTHONTZPATH is an os.pathsep-delimited string which replaces
(rather than augments) the default time zone path. Some examples of the
proposed semantics:

$ python print_tzpath.py
("/usr/share/zoneinfo",
 "/usr/lib/zoneinfo",
 "/usr/share/lib/zoneinfo",
 "/etc/zoneinfo")

$ PYTHONTZPATH="/etc/zoneinfo:/usr/share/zoneinfo" python print_tzpath.py
("/etc/zoneinfo",
 "/usr/share/zoneinfo")

$ PYTHONTZPATH="" python print_tzpath.py
()

This provides no built-in mechanism for prepending or appending to the
default search path, as these use cases are likely to be somewhat more
niche. It should be possible to populate an environment variable with
the default search path fairly easily:

$ export DEFAULT_TZPATH=$(python -c \
    "import os, zoneinfo; print(os.pathsep.join(zoneinfo.TZPATH))")

`reset_tzpath` function

zoneinfo provides a reset_tzpath function that allows for changing the
search path at runtime.

def reset_tzpath(
    to: Optional[Sequence[Union[str, os.PathLike]]] = None
) -> None:
    ...

When called with a sequence of paths, this function sets
zoneinfo.TZPATH to a tuple constructed from the desired value. When
called with no arguments or None, this function resets
zoneinfo.TZPATH to the default configuration.

This is likely to be primarily useful for (permanently or temporarily)
disabling the use of system time zone paths and forcing the module to
use the tzdata package. It is not likely that reset_tzpath will be a
common operation, save perhaps in test functions sensitive to time zone
configuration, but it seems preferable to provide an official mechanism
for changing this rather than allowing a proliferation of hacks around
the immutability of TZPATH.

Caution

Although changing TZPATH during a run is a supported operation, users
should be advised that doing so may occasionally lead to unusual
semantics, and when making design trade-offs greater weight will be
afforded to using a static TZPATH, which is the much more common use
case.

As noted in Constructors, the primary ZoneInfo
constructor employs a cache to ensure that two identically-constructed
ZoneInfo objects always compare as identical (i.e.
ZoneInfo(key) is ZoneInfo(key)), and the nature of this cache is
implementation-defined. This means that the behavior of the ZoneInfo
constructor may be unpredictably inconsistent in some situations when
used with the same key under different values of TZPATH. For
example:

>>> reset_tzpath(to=["/my/custom/tzdb"])
>>> a = ZoneInfo("My/Custom/Zone")
>>> reset_tzpath()
>>> b = ZoneInfo("My/Custom/Zone")
>>> del a
>>> del b
>>> c = ZoneInfo("My/Custom/Zone")

In this example, My/Custom/Zone exists only in the /my/custom/tzdb
and not on the default search path. In all implementations the
constructor for a must succeed. It is implementation-defined whether
the constructor for b succeeds, but if it does, it must be true that
a is b, because both a and b are references to the same key. It is
also implementation-defined whether the constructor for c succeeds.
Implementations of zoneinfo may return the object constructed in
previous constructor calls, or they may fail with an exception.

Backwards Compatibility

This will have no backwards compatibility issues as it will create a new
API.

With only minor modification, a backport with support for Python 3.6+ of
the zoneinfo module could be created.

The tzdata package is designed to be "data only", and should support
any version of Python that it can be built for (including Python 2.7).

Security Implications

This will require parsing zoneinfo data from disk, mostly from system
locations but potentially from user-supplied data. Errors in the
implementation (particularly the C code) could cause potential security
issues, but there is no special risk relative to parsing other file
types.

Because the time zone data keys are essentially paths relative to some
time zone root, implementations should take care to avoid path traversal
attacks. Requesting keys such as ../../../path/to/something should not
reveal anything about the state of the file system outside of the time
zone path.

Reference Implementation

An initial reference implementation is available at
https://github.com/pganssle/zoneinfo

This may eventually be converted into a backport for 3.6+.

Rejected Ideas

Building a custom tzdb compiler

One major concern with the use of the TZif format is that it does not
actually contain enough information to always correctly determine the
value to return for tzinfo.dst(). This is because for any given time
zone offset, TZif only marks the UTC offset and whether or not it
represents a DST offset, but tzinfo.dst() returns the total amount of
the DST shift, so that the "standard" offset can be reconstructed from
datetime.utcoffset() - datetime.dst(). The value to use for dst()
can be determined by finding the equivalent STD offset and calculating
the difference, but the TZif format does not specify which offsets form
STD/DST pairs, and so heuristics must be used to determine this.

One common heuristic — looking at the most recent standard offset —
notably fails in the case of the time zone changes in Portugal in 1992
and 1996, where the "standard" offset was shifted by 1 hour during a
DST transition, leading to a transition from STD to DST status with no
change in offset. In fact, it is possible (though it has never happened)
for a time zone to be created that is permanently DST and has no
standard offsets.

Although this information is missing in the compiled TZif binaries, it
is present in the raw tzdb files, and it would be possible to parse this
information ourselves and create a more suitable binary format.

This idea was rejected for several reasons:

It precludes the use of any system-deployed time zone information,
which is usually present only in TZif format.
The raw tzdb format, while stable, is less stable than the TZif
format; some downstream tzdb parsers have already run into problems
with old deployments of their custom parsers becoming incompatible
with recent tzdb releases, leading to the creation of a
"rearguard" format to ease the transition.^[10]
Heuristics currently suffice in dateutil and pytz for all known
time zones, historical and present, and it is not very likely that
new time zones will appear that cannot be captured by heuristics —
though it is somewhat more likely that new rules that are not
captured by the current generation of heuristics will appear; in
that case, bugfixes would be required to accommodate the changed
situation.
The dst() method's utility (and in fact the isdst parameter in
TZif) is somewhat questionable to start with, as almost all the
useful information is contained in the utcoffset() and tzname()
methods, which are not subject to the same problems.

In short, maintaining a custom tzdb compiler or compiled package adds
maintenance burdens to both the CPython dev team and system
administrators, and its main benefit is to address a hypothetical
failure that would likely have minimal real world effects were it to
occur.

Including `tzdata` in the standard library by default

Although PEP 453^[11], which introduced the ensurepip mechanism to
CPython, provides a convenient template for a standard library module
maintained on PyPI, a potentially similar ensuretzdata mechanism is
somewhat less necessary, and would be complicated enough that it is
considered out of scope for this PEP.

Because the zoneinfo module is designed to use the system time zone
data wherever possible, the tzdata package is unnecessary (and may be
undesirable) on systems that deploy time zone data, and so it does not
seem critical to ship tzdata with CPython.

It is also not yet clear how these hybrid standard library / PyPI
modules should be updated, (other than pip, which has a natural
mechanism for updates and notifications) and since it is not critical to
the operation of the module, it seems prudent to defer any such
proposal.

Support for leap seconds

In addition to time zone offset and name rules, the IANA time zone
database also provides a source of leap second data. This is deemed out
of scope because datetime.datetime currently has no support for leap
seconds, and the question of leap second data can be deferred until leap
second support is added.

The first-party tzdata package should ship the leap second data, even
if it is not used by the zoneinfo module.

Using a `pytz`-like interface

A pytz-like (^[12]) interface was proposed in PEP 431^[13], but was
ultimately withdrawn / rejected for lack of ambiguous datetime support.
PEP 495^[14] added the fold attribute to address this problem, but
fold obviates the need for pytz's non-standard tzinfo classes,
and so a pytz-like interface is no longer necessary.^[15]

The zoneinfo approach is more closely based on dateutil.tz, which
implemented support for fold (including a backport to older versions)
just before the release of Python 3.6.

Open Issues

Using the `datetime` module

One possible idea would be to add ZoneInfo to the datetime module,
rather than giving it its own separate module. In the current version of
the PEP, this has been resolved in favor of using a separate module, for
the reasons detailed below, but the use of a nested submodule
datetime.zoneinfo is also under consideration.

Arguments against putting `ZoneInfo` directly into `datetime`

The datetime module is already somewhat crowded, as it has many
classes with somewhat complex behavior — datetime.datetime,
datetime.date, datetime.time, datetime.timedelta,
datetime.timezone and datetime.tzinfo. The module's implementation
and documentation are already quite complicated, and it is probably
beneficial to try to not to compound the problem if it can be helped.

The ZoneInfo class is also in some ways different from all the other
classes provided by datetime; the other classes are all intended to be
lean, simple data types, whereas the ZoneInfo class is more complex:
it is a parser for a specific format (TZif), a representation for the
information stored in that format and a mechanism to look up the
information in well-known locations in the system.

Finally, while it is true that someone who needs the zoneinfo module
also needs the datetime module, the reverse is not necessarily true:
many people will want to use datetime without zoneinfo. Considering
that zoneinfo will likely pull in additional, possibly more
heavy-weight standard library modules, it would be preferable to allow
the two to be imported separately — particularly if potential "tree
shaking" ^[16] distributions are in Python's future.

In the final analysis, it makes sense to keep zoneinfo a separate
module with a separate documentation page rather than to put its classes
and functions directly into datetime.

Using `datetime.zoneinfo` instead of `zoneinfo`

A more palatable configuration may be to nest zoneinfo as a module
under datetime, as datetime.zoneinfo.

Arguments in favor of this:

It neatly namespaces zoneinfo together with datetime
The timezone class is already in datetime, and it may seem
strange that some time zones are in datetime and others are in a
top-level module.
As mentioned earlier, importing zoneinfo necessarily requires
importing datetime, so it is no imposition to require importing
the parent module.

Arguments against this:

In order to avoid forcing all datetime users to import zoneinfo,
the zoneinfo module would need to be lazily imported, which means
that end-users would need to explicitly import datetime.zoneinfo
(as opposed to importing datetime and accessing the zoneinfo
attribute on the module). This is the way dateutil works (all
submodules are lazily imported), and it is a perennial source of
confusion for end users.

This confusing requirement from end-users can be avoided using a
module-level __getattr__ and __dir__ per PEP 562, but this would
add some complexity to the implementation of the datetime module.
This sort of behavior in modules or classes tends to confuse static
analysis tools, which may not be desirable for a library as
widely-used and critical as datetime.
Nesting the implementation under datetime would likely require
datetime to be reorganized from a single-file module
(datetime.py) to a directory with an __init__.py. This is a
minor concern, but the structure of the datetime module has been
stable for many years, and it would be preferable to avoid churn if
possible.

This concern could be alleviated by implementing zoneinfo as
_zoneinfo.py and importing it as zoneinfo from within
datetime, but this does not seem desirable from an aesthetic or
code organization standpoint, and it would preclude the version of
nesting where end users are required to explicitly import
datetime.zoneinfo.

This PEP currently takes the position that on balance it would be best
to use a separate top-level zoneinfo module because the benefits of
nesting are not so great that it overwhelms the practical implementation
concerns, but this still requires some discussion.

Structure of the `PYTHON_TZPATH` environment variable

This PEP proposes to use a single environment variable: PYTHONTZPATH.
This is based on the assumption that the majority of users who would
want to manipulate the time zone path would want to fully replace it
(e.g. "I know exactly where my time zone data is"), and other use
cases like prepending to the existing search path would be less common.

There are several other schemes that were considered and weakly
rejected:

Separate PYTHON_TZPATH into two environment variables:
DEFAULT_PYTHONTZPATH and PYTHONTZPATH, where PYTHONTZPATH
would contain values to append (or prepend) to the default time zone
path, and DEFAULT_PYTHONTZPATH would replace the default time
zone path. This was rejected because it would likely lead to user
confusion if the primary use case is to replace rather than augment.
Adding either PYTHONTZPATH_PREPEND, PYTHONTZPATH_APPEND or both,
so that users can augment the search path on either end without
attempting to determine what the default time zone path is. This was
rejected as likely to be unnecessary, and because it could easily be
added in a backwards-compatible manner in future updates if there is
much demand for such a feature.
Use only the PYTHONTZPATH variable, but provide a custom special
value that represents the default time zone path, e.g.
<<DEFAULT_TZPATH>>, so users could append to the time zone path
with, e.g. PYTHONTZPATH=<<DEFAULT_TZPATH>>:/my/path could be used
to append /my/path to the end of the time zone path.

This was rejected mainly because these sort of special values are
not usually found in PATH-like variables, and it would be hard to
discover mistakes in your implementation.

One advantage to this scheme would be that it would add a natural
extension point for specifying non-file-based elements on the search
path, such as changing the priority of tzdata if it exists, or if
native support for TZDIST were to be added to the library in the
future.

Windows support via Microsoft's ICU API

Windows does not ship the time zone database as TZif files, but as of
Windows 10's 2017 Creators Update, Microsoft has provided an API for
interacting with the International Components for Unicode (ICU) project
, which includes an API for accessing time zone data — sourced from
the IANA time zone database.

Providing bindings for this would allow for a mostly seamless
cross-platform experience for users on sufficiently recent versions of
Windows — even without falling back to the tzdata package.

This is a promising area, but is less mature than the remainder of the
proposal, and so there are several open issues with regards to Windows
support:

None of the popular third party time zone libraries provide support
for ICU (dateutil's native windows time zone support relies on
legacy time zones provided in the Windows Registry, which would be
unsuitable as a drop-in replacement for TZif files), so this would
need to be developed de novo in the standard library, rather than
first maturing in the third party ecosystem.
The most likely implementation for this would be to have TZPATH
default to empty on Windows and have a search path precedence of
TZPATH > ICU > tzdata, but this prevents end users from
forcing the use of tzdata by setting an empty TZPATH.

Two possible solutions for this are:
1. Add a mechanism to disable ICU globally independent of setting
  TZPATH.
2. Add a cross-platform mechanism to give tzdata the highest
  precedence.
This is not part of the reference implementation and it is uncertain
whether it can be ready and vetted in time for the Python 3.9
feature freeze. It is an open question whether a failure to
implement native Windows support in 3.9 should defer the release of
zoneinfo or if only the ICU-based Windows support should be
deferred.

Footnotes

[a]

: The claim that the vast majority of users only want a few types of
time zone is based on anecdotal impressions rather than anything
remotely scientific. As one data point, dateutil provides many
time zone types, but user support mostly focuses on these three
types.

[b]

: The statement that identically constructed ZoneInfo files should
be identical objects may be violated if the user deliberately clears
the time zone cache.

References

datetime.tzinfo documentation
https://docs.python.org/3/library/datetime.html#datetime.tzinfo ↩︎
Wikipedia page for Tz database:
https://en.wikipedia.org/wiki/Tz_database ↩︎
RFC 6557: Procedures for Maintaining the Time Zone Database
https://tools.ietf.org/html/rfc6557 ↩︎
RFC 8536: The Time Zone Information Format (TZif)
https://tools.ietf.org/html/rfc8536 ↩︎
Paul Ganssle: "A curious case of non-transitive datetime
comparison" (Published 15 February 2018)
https://blog.ganssle.io/articles/2018/02/a-curious-case-datetimes.html ↩︎
dateutil.tz https://dateutil.readthedocs.io/en/stable/tz.html ↩︎
Code of Matt: "On the Timing of Time Zone Changes" (Matt
Johnson-Pint, 23 April 2016)
https://codeofmatt.com/on-the-timing-of-time-zone-changes/ ↩︎
importlib.resources documentation
https://docs.python.org/3/library/importlib.html#module-importlib.resources ↩︎
pkgutil.get_data documentation
https://docs.python.org/3/library/pkgutil.html#pkgutil.get_data ↩︎
tz mailing list: [PROPOSED] Support zi parsers that mishandle
negative DST offsets (Paul Eggert, 23 April 2018)
https://mm.icann.org/pipermail/tz/2018-April/026421.html ↩︎
PEP 453: Explicit bootstrapping of pip in Python installations
https://www.python.org/dev/peps/pep-0453/ ↩︎
PEP 431: Time zone support improvements
https://www.python.org/dev/peps/pep-0431/ ↩︎
PEP 495: Local Time Disambiguation
https://www.python.org/dev/peps/pep-0495/ ↩︎
Paul Ganssle: "pytz: The Fastest Footgun in the West"
(Published 19 March 2018)
https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html ↩︎
"Russell Keith-Magee: Python On Other Platforms" (15 May 2019,
Jesse Jiryu Davis)
https://pyfound.blogspot.com/2019/05/russell-keith-magee-python-on-other.html ↩︎
RFC 7808: Time Zone Data Distribution Service
https://tools.ietf.org/html/rfc7808 ↩︎

encukou · February 25, 2020, 4:04pm

Would WeakValueDictionary be a more straightforward example?

This means print(zone) would print nothing, which is quite confusing. Why not fall back to <ZoneInfo object at 0x...>? ISTM that’s also always an invalid key.

As an employee of Red Hat, I feel obliged to point out that it’s spelled with a space, and that it’s a company, not a distro. Instead of “Red Hat Enterprise Linux” (or RHEL), I recommend giving Fedora as an example. (IMO new CPython features should target community distros.)

What’s the use case for PYTHONTZPATH_APPEND?

Either the signature is missing = None, or you can’t call it with no arguments.
Neither set_tzpath() not set_tzpath(None) look like they’re resetting the path. Have you considered a separate reset_tzpath() to make the effect clear?

To discourage setting global state, and to make tests or hotfixy workarounds more robust, can tzpaths also be an argument to ZoneInfo (with the same caching caveat as set_tzpath) or ZoneInfo.nocache?

pganssle · February 25, 2020, 4:46pm

Thanks for the quick feedback! Here are my responses (roughly ordered in terms of how complicated the response is):

Yes, agreed (that’s how it’s implemented in the reference implementation, too). I have added it as an example.

Done.

Fixed, thanks.

This I will have to think about more deeply, but I think it’s a very valid way to go, and I’m leaning towards changing it to this behavior.

To be honest, it’s been a while since we came up with this scheme and I am not entirely sure what my justification for selecting “APPEND” only was, I think that it was by analogy to PYTHONPATH, which always appends to the search path. The initial discussion happened in this dateutil PR, and I did a twitter poll where 4 people said they would want to “augment” the path and 8 people said they would want to replace it.

As you can see from the “Open issues” section, I am quite ambivalent about the whole thing, but if I were to take a stab at possible reasons why people would want to mess with their time zone path, I’d say:

Replace: You would want to do this for testing purposes (I would use it to test against master of the time zone database, for example). You would also want to use this if you deploy your time zones somewhere non-standard but you are not compiling your own Python (and thus can’t use the compile-time argument).
Prepend: You have deployed some time zones somewhere and would like to preferentially use them, but if a zone is missing or something you’d like to fall back to the standard search path.
Append: You have deployed your own custom time zones not in the IANA database, for your own purposes (again possibly testing purposes), and you’d like this to be the fallback location to look.

With more consideration, I am thinking that option 2 is more likely to be a reasonable use case than option 3, though neither of those seems terribly likely to be useful.

I am hesitant to say that these would be completely usless, but maybe these things are so unlikely to be useful that if you want them, it will be sufficient to do: PYTHONPATH=/my/path:$(python -c "import os; import zoneinfo; print(os.pathsep.join(zoneinfo.TZPATH))"?

Yeah, that is a good point. I think I originally conceived of this as an analogy to time.tzset, which actually resets the path.

Alternatively, I could rename set_tzpath(tzpaths=None) to something like reset_tzpath(to=DEFAULT).

Yes, this was actually my original design and I think it’s still on the table, but I went with the global state because it complicated the implementation and semantics of the cache.

Assuming we went with a design like ZoneInfo.from_key(key, *, tzpath=None), the issue would be that there are three options for how the cache would work, all unpalatable:

keep track of a per-tzpath cache - this would be complicated to implement and there are a lot of issues with getting the semantics of that right.
passing tzpath would necessarily mean that you are bypassing the cache
passing tzpath uses the global cache, which means that sometimes ZoneInfo.from_key would use the specified tzpath and sometimes it wouldn’t (and also that using ZoneInfo.from_key could “pollute” your normal cache).

The most common use case I imagine for this sort of feature would be if you want to force your ZoneInfo calls to use tzdata globally for some reason - either for testing purposes or because for some reason you are prevented from using the environment variables. In that case, you would need to modify the constructor calls for anything that uses ZoneInfo, and in some of these options (notably #2), it would have an affect on the semantics of the operations!

Belopolsky · February 25, 2020, 5:13pm

When presenting rejected ideas in a PEP, it is customary to provide links to the discussion(s) that led to a rejection.

pganssle · February 25, 2020, 5:33pm

Yes, I think I may have jumped the gun a bit on the “rejected ideas” sections. These were ideas that I personally considered and rejected, and I wanted to document my rationale, not necessarily ideas that were discussed and rejected.

In this case, though, I believe we discussed this at the Language Summit and my assessment of the consensus was that some people cared very strongly about using system-deployed time zone data (possibly @tiran?) , which precludes the use of a custom zic parser as the primary source of data. We had a very brief interchange about this on datetime-SIG (see thread start with your response, my response to that - they were split into “two threads” by the MM3 migration I guess), but I do not think there’s any written record of such discussions. Hopefully I have summarized the relevant points adequately, though.

One thing to note that I did not put in the PEP because I am not yet sure if it is a viable possibility, but it is now increasingly common (though still somewhat uncommon) to ship tzdata.zi, which is a text format that does contain the relevant offset information. There are a few issues with using it, including:

It may be missing
Depending on build options, the format may be “vanguard” or “rearguard”, and I think it may be a less stable format than TZif (which is very concerned with backwards compatibility)
The file contains all the time zones, and it may require parsing or reading the entire thing or a significant fraction of it every time you want to construct a time zone from it.

I think it’s mostly out of scope for this PEP (though we should probably make sure nothing in this PEP explicitly is incompatible with future enhancements in this way), but I was thinking that it might be a reasonable fallback for situations where we detect that something unusual has occurred - e.g. dst() is 0 but isdst=1, or a shift in offset occurs.

Belopolsky · February 25, 2020, 7:19pm

Does this mean that a pickle that contains a serialized instance of aware datetime will include potentially kilobytes of transition data?

Belopolsky · February 25, 2020, 7:21pm

I think this option should be discussed in detail.

pganssle · February 25, 2020, 8:30pm

Yes. This is already the case with dateutil time zones:

from dateutil import tz
from datetime import datetime

import pickle

dt = datetime.now(tz.gettz("America/New_York"))
print(len(pickle.dumps(dt))) # 3539

I don’t have much in the way of use cases for pickle, but I think people would prefer a version of this that always works to a version that has a slimmer package, particularly because it seems that since there’s only one ZoneInfo instance per value of key, pickle will be able to include it by reference, so:

dts = [datetime.now(tz.gettz("America/New_York"))
       for _ in range(100)]
print(len(pickle.dumps(dts))) # 5622

So it will be a few kb per zone included, not a few kb per datetime.

Do you mean you think my arguments for rejecting it should be included in the PEP, or that we should discuss it in detail as a potential option for the initial version of the module?

I will also mention that argument 2 in the “why reject a custom tzdb parser” section, which is basically “Java has their own tzdb parser and it’s caused them lots of problems” is compounded when we’re now talking about parsing system-deployed .zi files at runtime.

It seemed hard to maintain when it was “let’s parse it at build time and ship a parsed package”, where we at least have the ability to provide a uniform experience for most users of Python (e.g. old versions of Python won’t stop working if the raw format changes), but parsing a potentially less-stable format that we don’t control seems like it could become a significant maintenance burden pretty easily.

I also think that as of the moment this purports to fix is something of a non-problem (note that the implementation of local time zone support also relies on heuristics based on “this has never happened in the tzdb yet” - the way that fold is inferred by looking forwards and backwards by 1 day makes assumptions about the size of the fold and about the spacing between folds).

I also think that “DST” is a bit under-defined anyway, and I would argue not a useful piece of information to want to know about a datetime (at least partially for that reason). Imagine, for example, several time zones:

A time zone whose standard offset is +1 and in the winter they shift over to using +0 (negative DST)
A time zone where the offset shifts by 30 minutes every quarter: 0, +30, +1, +30, 0 - which one is the “standard” offset?
A “permanent DST” zone that never shifts but legally is referred to as daylight saving time (e.g. if New York were to use EDT year round - functionally equivalent to switching to Atlantic Standard Time).
A “permanent DST” zone that is called daylight saving time by most people, but legally speaking is permanently standard time (e.g. California shifts over its base offset by one hour and calls it standard time to comply with federal laws requiring that you must either observe the US DST transition times or not observe DST, but still calls the “standard” offset “Pacific Daylight Time”).

In all of these cases, what you mean by “is the zone DST” is “What do people think of as the ‘standard’ offset”? And it seems like this is a very obscure piece of information to want as part of a computer program. Usually what people want to know is something about how to display the time zone information, which they should use .tzname() for, or they want to know something about the offset from UTC, which they should use .utcoffset() for.

So, basically my contention is that the .dst() method is full of dangerous edge cases anyway, so even if we weren’t able to set this value to what we more or less expect it to be in all cases, I can’t see this bug being a primary source of practical negative consequences for real use cases, but maybe someone else has an example of something that will go very wrong if dst() returns the wrong value?

guido · February 26, 2020, 3:17pm

I have some questions about the cache. You describe how ZoneInfo(key) must use a cache (I think about the only freedom is whether it’s a plain dict or a weak value dict – it can’t even be an LRU cache since evictions would break the required semantics). You also describe how ZoneInfo.nocache(key) returns a new object each time (not consulting the cache).

What’s not 100% crystal clear is whether the new object created by ZoneInfo.nocache() is entered into the cache. Not doing this seems to make the most sense, but it’s not explicit.
Ditto for ZoneInfo.from_file().
Perhaps more importantly, I found no mention of the cache in the section about pickling. I presume that the most common case is that a pickled ZoneInfo object is in fact identical to one with the same key read from the current tz database (this will be the case e.g. if pickles are used for RPC within one host). But I can easily see an application mixing locally-created datetime objects with ones received from a pickle, and those wouldn’t be comparable because the tzinfo objects would have different identities. Ditto for timezones unpickled from different pickles – the nice identical-object caching used by pickle doesn’t work across pickles (for obvious reasons). And because you don’t guarantee that an unpickled ZoneInfo object contains the same information as one created locally from the same key (the latter being what’s in the cache) you can’t enter unpickled ZoneInfo objects in the cache either.

For the latter issue, I don’t see an easy way out other than adopted your rejected proposal of pickling only the key as state. I understand that this means sometimes unpickling will fail (in particular if there’s no key or if the key doesn’t exist in the tz database where it is being unpickled). I see this as little different from unpickling some pickle that contains a reference to a class or function (which is represented as the fully qualified name of its definition) if there is no corresponding definition.

In fact, this points towards a reasonable mental model for ZoneInfo objects as similar to class or function definitions. ZoneInfo objects created bypassing the cache (using nocache() or from_file()) are similar to dynamically created classes or functions – these don’t have a global name and cannot be pickled.

Thoughts?

pganssle · February 26, 2020, 5:21pm

I posted this thread to the tz@iana.org mailing list to gather feedback from the IANA time zone maintainers, and I thought I would forward on some of the comments from Paul Eggert’s response, along with my responses:

I am going to update this to clarify, but I think this is mostly covered by the caching behavior described in the section on constructors, once I make explicit the assumption that the full time zone data must be eagerly read into memory at construction (rather than being implemented in terms of system calls or something of that nature). With that assumption in place, the answer is that the data is updated whenever a cache miss occurs - the first time any given zone is constructed or, depending on the implementation, the first time it is constructed after a previous version has been ejected from the cache (in the reference implementation, we use a “strong” LRU cache of 8 zones and an unbound “weak” cache, so if you construct 9 zones and hold no references to any of them, constructing the first one again will be a cache miss, and the other 8 will be a cache hit).

This does mean that if you call ZoneInfo("America/New_York") when your installed zoneinfo data is 2019c and then you upgrade to 2020a and call ZoneInfo("US/Eastern"), the two objects may have different behaviors, but I think this is mainly unavoidable without a pretty significant performance impact.

I have made some minor changes to the wording of the constructors text and added a section to clarify this.

Beyond the fact that I plan to ship non-“zone” files in the tzdata fallback package (and thus include the leap seconds), leap seconds are out of scope for this proposal. Python’s datetime type has no support for leap seconds currently, and other than being tracked in the same database, I think they’re at least somewhat orthogonal to the primary problem we’re solving here (a tzinfo implementation).

Leap second support is on my long list of improvements for the datetime module, so I’ll probably get around to it at some point in the future.

I have added a subsection on leap seconds to the “Rejected ideas” section

Yes, I will have to look into this. My main concern is that my hope is to try to use a time zone data source that can be managed at the system level, independent of language stack. I will admit to never having looked into the details, but I was under the impression that tzdist was something that the system would consume, rather than individual programs, is that wrong?

I also am not clear - are there public tzdist servers, or is the suggestion that we would have a Python-specific tzdist service and end users would subscribe to it for updates?

I’m mainly asking because I decided early on (on some very good advice) that effectively distributing the data is a big enough task on its own that it would bog down the initial implementation to try and handle both at once, so my goal with this is to get something that will work if you have the data, and provide a reasonable way to get the data and handle the data distribution in a separate proposal. If tzdist is consistent with a backwards compatible upgrade from a version using TZif files at some point in the future, I’m happy to put it off as, “We should look into this when we try to solve the distribution problem.” It sorta seems like it should be possible to seamlessly transition from system files to tzdist (at least depending on how strong our promises are about the tz search path, anyway).

Note: This is an open action item and I am waiting for either a response or to do a bit more digging and get the answers to this question, but I suspect that we will want to hold off on TZDIST until a later PEP.

Additionally, Matt Johnson-Pint (who works at Microsoft, though he gave me no indication that he was speaking as a representative of Microsoft) pointed me at the new ICU Time Zone API in Windows 10, and so I’ve removed the rejected “Windows native support” and added a new section under “Open issues” detailing a path forward on Windows and the remaining open questions there.

pganssle · February 26, 2020, 5:59pm

Good point, you were right on in what I was planning - I’ve made it explicit.

This issue of the semantics of datetimes recovered from pickles is a very good point, and not one I had thought of, but you are definitely right that it poses a major problem. I am inclined to agree that this makes for a strong case in favor of pickling only the key and expecting it to be reconstructed on deserialization.

To play devil’s advocate, one possible option would be to have the serialization behavior remain the same (all transition information is serialized along with the key name), but to have deserialization go through the cache: if the key is in the cache, use the existing object rather than one built from a pickle, otherwise populate the cache with the unpickled object.

I personally feel like the behavior at that point is getting a lot harder to keep track of, though, and I’d rather just go with serializing the key. The one thing I’m hesitant about is this:

I don’t love the idea of .nocache()-constructed ZoneInfo instances being unpickleable, because they do have a valid key. One possible way around this would be for nocache time zones to carry a nocache flag or something, so that they can be serialized and deserialized by key, but the deserialized objects maintain the same relative semantics.

For from_file() I’m somewhat more comfortable having those throw an error on pickling, though there is still the option to have the ones that have been passed an explicit key value serialize by key. It would not be terribly difficult to roll a ZoneInfo wrapper that uses custom files but serializes by key, and any such use case would necessarily be obscure.

One last concern before I go all in on the “serialize by key” mechanism: I intend for these things to be opaque data structures, so none of the transition data or even the location of the file that the transitions were read from will be exposed to the end user. This introduces an asymmetry between the two options because the end user can create a simple function to serialize these things by key:

KeyedDatetime = Tuple[datetime, Optional[str]]

def to_keyed_datetime(dt: datetime) -> KeyedDatetime:
    if isinstance(dt.tzinfo, ZoneInfo):
        return (dt.replace(tzinfo=None), str(dt.tzinfo))
    return (dt, None)

def from_keyed_datetime(keyed_dt: KeyedDatetime) -> datetime:
    dt, key = keyed_dt
    if key is not None:
        dt = dt.replace(tzinfo=ZoneInfo(key))
    return dt

But using the serialize-by-key method, it’s not possible for end users to manually get the other behavior, so we are essentially foreclosing that option for them.

guido · February 26, 2020, 11:48pm

I don’t know exactly what the use cases for nocache and from_file are, so it’s hard to know whether it’d ever be a problem if these were unpicklable.

I wonder if you could have an opaque RawZoneInfo object that behaves like a nocache ZoneInfo and is pickled by value, and have regular ZoneInfo be a very thin wrapper for that (maybe a subclass with no extra fields) but with by-key pickle behavior?

EpicWink · February 27, 2020, 1:44am

Note

The implementation may decide how to implement the cache behavior, but the guarantee made here only requires that as long as two references exist to the result of identical constructor calls, they must be references to the same object. This is consistent with a reference counted cache where ZoneInfo objects are ejected when no references to them exist — it is allowed but not required or recommended to implement this with a “strong” cache, where all ZoneInfo files are kept alive indefinitely.
source

This can’t be true if the database is updated between subsequent calls to the constructor with the same arguments, right?

Would it be better to have the interface to have a function to get a ZoneInfo instance, retrieving from cache or otherwise creating, similar to the logging module? ie

>>> tz1 = ZoneInfo("Australia/Brisbane")
>>> tz2 = get_zone_info("Australia/Brisbane")
>>> tz3 = get_zone_info("Australia/Brisbane")
>>> tz4 = ZoneInfo("Australia/Brisbane")
>>> tz1 is tz2
False
>>> tz2 is tz3
True
>>> tz1 == tz2 == tz3 == tz4
True

This would separate cache-control from the data class ZoneInfo to the module or a manager instance, allowing for easier user-extensibility of either.

pganssle · February 27, 2020, 3:57am

It can be, this is how the reference implementation does it, and it’s how dateutil does it. Here’s the implementation of __new__. The database is never consulted except in the case of a cache miss. I clarify that a bit in this PR to the PEP.

In the end, always getting “the latest data” is fraught with edge cases anyway, and the fact that datetime semantics rely on object identity rather than object equality just adds to the edge cases that are possible.

I will note that there is some precedent in this very area: local time information is only updated in response to a call to time.tzset(), and even that doesn’t work on Windows. The equivalent to calling time.tzset() to get updated time zone information would be calling ZoneInfo.clear_cache() to force ZoneInfo to use the updated data (or to always bypass the main constructor and use the .nocache() constructor).

This is partially how dateutil does it, though the main reason dateutil does it is because tz.gettz() takes any kind of string and returns a time zone from it, so tz.gettz("GMT0BST") will return a tz.tzstr, tz.gettz("Europe/London") will return a tzfile, and tz.gettz() will return local time.

I’d be more open to it if we felt that there was some possibility that we wanted the primary interface to be something that might return any number of types, but I am not convinced of the utility of this function. People mostly know what kind of time zone they want to construct and are happy to select the right type, and in fact it leads to problems when they directly use the tz.tzfile constructor (which uses gettz() for caching).

What I like about ZoneInfo using the cache directly and having the functions bypassing the cache be the more obscure alternate constructors is that most of the time users would want this operation cached - it is much faster, it will make comparison operations cheaper and more consistent and you won’t run into obscure bugs like the one detailed in this blog post.

stub42 · February 27, 2020, 8:47am

Hi. First, thanks for working on this. I’ve managed to put off similar work for about a decade now. I look forward being able to deprecate pytz, making it a thin wrapper around the standard library when run with a supported Python. This kind of needs to happen before 2038, as pytz dates from before the newest tzfile format and does not handle the later timestamps.

On the serialization section, what is really being discussed is the difference between timestamps (fixed instances in time), and wallclock times (time in a location, subject to changes made by politicians, bureaucrats and religious leaders). If I serialize ‘2022-06-05 14:00 Europe/Berlin’ today, and deserialize it in two years time after Berlin has ratified EU recommendations and abolished DST, then there are two possible results. If my application requires calendaring semantics, when deserializing I want to apply the current timezone definition, and my appointment at 2pm in Berlin is still at 2pm in Berlin. Because I need wallclock time (the time a clock hung on the wall in that location should show). If I wanted a fixed timestamp, best practice is to convert it to UTC to avoid all the potential traps, but it would also be ok to deserialize the time using the old, incorrect offset it was stored with and end up with 1pm wallclock time.

The PEP specifies that datetimes get serialized with all transition data. That seems unnecessary, as the transition data is reasonably likely to be wrong when it is de-serialized, and I can’t think of any use cases where you want to continue using the wrong data. To deserialize a local timestamp as a fixed point in time, you only need the local timestamp and the offset. Perpetuating the use of wrong data is going to end up with all sorts of problems and confusion, where you will end up with several versions of a timezone each with unique rules and giving different results. At some point, you are going to need to convert the data using the old timezone rules to the current timezone rules, which seems to be exactly the sort of problem we had with the pytz API. Failing to normalize the datetime will cause apps to spit out nonsense to end users, such as timestamps that no longer exist (skipped over by new DST transition rules), or ordering issues (wallclock timestamps using old rules compare earlier or later than wallclock timestamps using current rules).

I think better options are to either serialize as a) wallclock time (datetime + zoneinfo key), or b) local timestamp (datetime + offset + optional zoneinfo key), or c) UTC timestamp (utcdatetime + optional offset + optional zoneinfo key). Even if this means special casing custom zoneinfo datafiles, which I suspect will be rare or non-existent outside of the Python test suite.

While a) is often what you want for calendaring applications (and what you get with pytz), it could cause problems in general use because there is no fixed ordering. Data structures will fail if they rely on stable ordering of local timestamps, and I can’t see a way of forcing people to use fixed timestamps instead of wall time beyond hoping they read the documentation.

b) & c) store fixed timestamps, and let you round trip if all three components are included. With b) a question needing to be answered is if the fixed timestamp is corrected when deserialized (if the offset doesn’t match the current zoneinfo rules, it can be adjusted), or if the current zoneinfo rules only take affect when arithmetic starts happening. ie. is ‘repr(unpickle(d)) == repr(unpickle(d) + timedelta(0)’ true ? With c) timestamps would be adjusted to current rules when deserialized.

For comparision, PostgreSQL went with c). Storing a ‘timestamp with timezone’ just stores a UTC timestamp, and information about the source timezone and offset is lost. See https://www.postgresql.org/docs/12/datatype-datetime.html#DATATYPE-TIMEZONES

This all affects how ZoneInfo.nocache and arithmetic work too. As proposed, we can have multiple Europe/Berlin ZoneInfo with different rules. They are sticky, so a datetime referencing an obsolete ZoneInfo is going to keep doing calculations using the obsolete rules. I’m thinking that it would be better if ZoneInfo.nocache would replace the existing cached version, flagging the existing cached version as expired. Existing datetime instances would be unaffected, as their tzinfo would still reference the obsolete ZoneInfo data. But arithmetic would notice the ZoneInfo has been superseded and the result would be using the latest ZoneInfo.

pganssle · February 27, 2020, 5:25pm

In some ways, the debate over the proper serialization format is a strange one, because pickle is by its nature an ephemeral serialization format - you will have many problems if you try and use it to serialize between dissimilar environments, so in some ways I’m inclined to neglect the case where the data changes between serialization and deserialization anyway.

In the end, I am not sure what most users would want. I think @stub42 makes some solid points about thinking of it in terms of serializing civil times (I had neglected this because usually I only think of that problem when storing dates for the long term, but it’s a valid one, particularly with the cache behavior).

To me, the strongest argument in favor of serializing “by value” rather than “by reference” is that if we go with serializing “by value”, end users on either end of the equation have the option of getting the “by reference” behavior on their own, whereas if we go with “by reference”, end users can’t implement a “by value” solution on their own. I actually really like the idea of doing something like @guido’s solution:

Exposing some interface to get the raw data (RawZoneInfo base class seems like the most natural) would basically alleviate my qualms about this entirely, and give a nice, reasonable default behavior for everyone else.

Another option here is to just go with serialization by key in Python 3.9 and if there’s a lot of demand for the feature we can add RawZoneInfo in a later version. Doing so should be backwards-compatible. For people stuck in the middle, dateutil.tz.tzfile will keep the “serialize by value” behavior it’s always had, and people who need that can use dateutil as a stop gap.

So here’s my proposal for what to do with pickling:

Normally-constructed ZoneInfo objects are serialized by reference to their key parameter.
ZoneInfo.nocache objects are also serialized by reference to their key parameter, but with a flag indicating that they are not drawn from the cache, so they will bypass the cache in the deserialization step.
ZoneInfo.from_file objects cannot be pickled. (End users can write wrapper types if they want to serialize them by key).

guido · February 27, 2020, 5:42pm

That sounds very reasonable and removes my objections. I agree with @stub42’s analysis that users will most likely expect to be transmitting civil times. The "Asia/Qostanay" problem will be no different or worse than other problems caused by pickling.

I think we already have a serialization format for the full transition data: the “zone file” (such as passed to from_file()). Methinks only people engagen in maintaining zone info databases will be interested in that.

pganssle · February 27, 2020, 6:03pm

Are you suggesting that ZoneInfo.nocache(key) should unconditionally eject key from the cache, or are you suggesting that it should eject key from the cache only when the data is updated? The second would make nocache(key) a much more expensive operation than I’d like for a cache-bypassing constructor, and the latter would mean that it would unconditionally mutate global state.

I was thinking that .nocache() would be a safe way to get a time zone that will always have “compare / subtract in UTC” behavior, or to deliberately induce a cache miss for a single call. I would want you to be able to use it safely alongside the primary conductor, which is a way to always get “compare / subtract in civil time” behavior when using the same nominal time zone.

There’s still the clear_cache() function, which is in the reference implementation but not described in the PEP. Right now it takes no arguments and clears the cache entirely. Another option would be to change it to something like this (simplified by ignoring thread-safety and ignoring the fact that there are two caches, not just one):

def clear_cache(self, *keys):
    if len(keys):
        for key in keys:
            self._cache.pop(key, None)
    else:
        self._cache = weakref.WeakValueDictionary()

This would allow them to decide independently if they want to invalidate the cache for a given key (and they can always create a wrapper around .nocache that does this).

I think the only issue is that there’s no easy way to determine if a cache has gone stale, since the transition data is stale. One option to allow this would be to provide some comparison function like .equivalent_to(self, other) that indicates when two ZoneInfo objects have all the same transition information, so that someone could write an auto-invalidating zoneinfo factory like so:

def latest_zone_info(key):
    new_zoneinfo = ZoneInfo.nocache(key)
    zi = ZoneInfo(key)

    if new_zoneinfo.equivalent_to(zi):
        ZoneInfo.clear_cache(key)

    return ZoneInfo(key)

Belopolsky · February 27, 2020, 6:18pm

I agree. If the goal is to unambiguously specify a point in time, only the offset corresponding to the specific moment should be transmitted with the local timestamp. On the other hand, more often we use local timestamps to specify the “wall time” without regard to any notion of absolute time. When we specify the opening time of the New York stock exchange, 09:30 means whatever time is in use in New York on the given day, be it EST, EDT, EWT or anything else. If and when New York state abandons daylight saving time transitions, the opening bell will continue to ring at 09:30 throughout the year.