Parse "Z" timezone suffix in datetime

mehaase · August 28, 2019, 2:59pm

This is already opened as BPO 35829 but I wanted to ask about it over here for discussion.

Problem Statement

The function datetime.fromisoformat() parses a datetime in ISO-8601, format:

>>> datetime.fromisoformat('2019-08-28T14:34:25.518993+00:00')
datetime.datetime(2019, 8, 28, 14, 34, 25, 518993, tzinfo=datetime.timezone.utc)

The timezone offset in my example is +00:00, i.e. UTC. The ISO-8601 standard (for which fromisoformat() is presumably named) allows “Z” to be used instead of the zero offset, i.e. 2019-08-28T14:34:25.518993Z, however fromisoformat() cannot parse this:

>>> datetime.fromisoformat('2019-08-28T14:34:25.518993Z')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Invalid isoformat string: '2019-08-28T14:34:25.518993Z'

Paul Ganssle (@pganssle) is the maintainer of the dateutil library and has made numerous improvement’s to the standard library datetime as well. (Thanks, Paul!) Paul suggested that I should post here to brainstorm possible improvements in the API.

The dateutil library does include support for parsing the Z suffix:

>>> from dateutil import parser
>>> parser.isoparse('2019-08-28T14:34:25.518993Z')
datetime.datetime(2019, 8, 28, 14, 34, 25, 518993, tzinfo=tzutc())

This feels like a missing battery in the standard library, especially since other systems may produce dates that end in “Z”. One big example is JavaScript. (You can run this in your browser console right now!)

>> new Date().toISOString()
"2019-08-28T14:34:25Z"

If you have a web browser (or a node.js system) that sends you an ISO-8601 date in UTC, then you can’t parse it with Python’s standard library.

The obvious workaround (that my colleagues and I have committed to muscle memory at this point) is datetime.fromisoformat(my_date.replace('Z', '+00:00')). This works but it is verbose and this seems like a missing battery in the standard library.

Rejected idea

Paul doesn’t want to break the existing contract:

datetime.fromisoformat() is the inverse operation of datetime.isoformat(), which is to say that every valid input to datetime.fromisoformat() is a possible output of datetime.isoformat(), and every possible output of datetime.isoformat() is a valid input to datetime.fromisoformat().

Therefore, if fromisoformat() can parse the Z suffix, then isoformat() will need to emit the Z suffix instead of +00:00, which could create a backwards compatibility issues. But then fromisoformat() wouldn’t be able to parse the +00:00 suffix anymore. Therefore, this idea cannot be accepted without breaking the contract.

Proposed Idea

The name fromisoformat() is a bit unfortunate because it doesn’t handle the full ISO-8601 spec. In fact, the spec is quite broad and covers issues that don’t matter in the datetime class such as representing dates, times, and intervals. Furthermore, the ISO spec isn’t an open standard as far as I know. (it looks like I would need to pay money to ISO if I wanted a copy to read?)

However there is a simplified standard that is open: RFC-3339. I suggest adding new methods datetime.rfcformat() and datetime.fromrfcformat() that implement this RFC. As a consequence, this would also allow us to parse dates ending in Z.

Let me know thoughts on this issue. Thanks!

jdemeyer · August 28, 2019, 5:31pm

Having two functions that are almost-but-not-quite-the-same sounds very confusing. I think it’s a worse solution than simply slightly extending the existing functions.

pf_moore · August 28, 2019, 6:20pm

How critical is it in practice that passing a string using the Z format through fromisoformat and isoformat doesn’t give the same string as passed in, but changes the Z to +00:00? It seems a pretty minor discrepancy (and accepting both Z and +00:00 conforms to “be liberal in what you accept and strict in what you produce”). But I’m not an expert in the field, so there may be reasons why perfect round tripping is important.

pganssle · August 28, 2019, 7:49pm

To be clear, it is certainly possible to maintain the “fromisoformat is only the inverse of isoformat” without making isoformat emit Z instead of +00:00 by default since isoformat() could always grow the capability to optionally emit Z in place of +00:00 with a feature flag, in which case fromisoformat() would be required to implement Z parsing. I wouldn’t want to add a feature flag to isoformat() just to maintain an arbitrary contract, though, so I would only consider this option if there’s strong demand for emitting Z in isoformat - and I have not seen any issues on BPO requesting this, so it’s probably not that important to people.

That said, in some ways it would violate the spirit of the contract, which is that, as of right now, fromisoformat is intended to be used only on the output of .isoformat, which means that all the people who want it to parse Z are in a sense using it in an unsupported way. Modifying it to start accepting these would probably lead to more people hitting bugs in production when parsing a valid ISO 8601 string generated by something other than datetime.datetime.isoformat that happens to be in an unsupported format.

To be clear, currently the idea is that you should parse this name as “from isoformat” rather than “from ISO format”, meaning that it constructs a datetime from the output of fromisoformat. The same goes for datetime.fromisocalendar and datetime.fromtimestamp.

I strongly disagree with this idea, for the same reasons that @jdemeyer identifies.

This is not true, I just do not want to half-change the contract. If you look at the original issue in which I added fromisoformat, the intention was always that we would start with “reverses isoformat” because it is well-scoped and incredibly easy to explain what it does (it parses the result of isoformat()). I think it would be acceptable for it to eventually grow something like a full ISO 8601 parser, but there are many UI challenges and decisions to be made there.

There is no requirement that strings can be round-tripped from str → datetime → str, the only guarantee is in the other direction, so dt == datetime.fromisoformat(dt.isoformat()) must be true. We are free to expand what fromisoformat() parses and the main reason we have chosen not to is that it’s much more complicated to get it right.

I strongly disagree with this sentiment in most library code, as it tends to take clear specifications and make them very fuzzy and implementation-defined. In this case failing loudly on common mistakes that we can still interpret as a datetime is an early warning that you are using the function in an unsupported way. As of today, if you are not parsing the output of a dt.isoformat() call or a string guaranteed to be in an equivalent format, you should not be using datetime.fromisoformat, and if it works it only does so by accident.

My goals for a “general-purpose” ISO 8601 parser:

It should support the entire datetime portion of the spec (or as nearly so as we can)
It should have a way to specify which deviations from the spec are not allowed (e.g. no sub-minute offsets)
It should be possible to specify that you want to support certain subsets of all supported functionality (e.g. RFC 3339, which is in some ways a subset and a superset of ISO 8601).
It should support a minimum of deviations from ISO 8601 - essentially those that are specified to be changeable “by agreement” plus support for sub-minute time zone offsets.
It should continue to “just work” on the output of datetime.isoformat.

We will also need to decide what to do with the --MMDD and --MM-DD formats, since they represent a concept that cannot be represented with datetime.datetime. I believe the options are “don’t support at all”, “fill in the missing year from the current year” and “allow the user to specify the default value for the year”. The last two can also be combined (e.g. use current year by default but allow users to override it). I suspect if we didn’t support it no one would care, since most of the people who even know it exists are people who have tried implementing the spec.

pf_moore · August 28, 2019, 8:13pm

Ah, if that’s the case then yes, fromisoformat shouldn’t accept Z. I didn’t realise that (although that’s my fault for not checking the docs). I guess the answer “if you want to parse more general ISO format dates, use a 3rd party library” stands, then. Which is fine for my needs, so I’ll stop offering uninformed opinions here

pitrou · August 29, 2019, 7:23am

Well, it’s still pretty annoying that something called fromisoformat doesn’t actually parse the ISO format. And the doc isn’t helpful, as it doesn’t give any alternative.

encukou · August 29, 2019, 9:24am

The way I see it:

ISO 8601 evolved from a very old stanard, designed for parsing by humans. The main problem it fixes is ambiguity in traditional formats like 10/9/12. If a human who’s never heard of the standard gets a ISO8601-encoded string, they’ll either parse it correctly or go „this is weird, I better ask the sender what they meant!“.
That’s very good news for the receiver (encoder).

ISO 8601 specifies how to encode a lot more than just datetime: things like durations, repeating intervals. It allows you to use week-based counting. It’s very useful if you want to express something, but it’s not at all practical if you want to write parser.
A complete parser for ISO 8601 is not only practical, but also not very useful.

But that’s okay: you can define a subset of ISO 8601 and write a parser for that.
If you need week-based counting, ISO 8601 will give you the best way to encode a week-based date, with all the nice properties (unambiguity, lexical sorting) and all the relevant information (like which edge cases are solved and which are still dangerous).
Writing the encoder/decoder with all the nice properties is then trivial.

That, for me, is the point of ISO 8601: it has extremely good guidance for selecting a datetime encoding. But it is not as a spec to be implemented.

Contrast with the other standard: RFC 3339. This is an encoding only of a moment in time, with timezone information – i.e. it’s limited only to what a Python datetime stores. It has nearly all the nice properties of ISO 8601, because it’s a profile/subset. (Not a 100% strict one, but the deviation is well argued.) And crucially, it’s designed to be easily implementable (and testable) – it omits the arcane parts of the ISO standard that are largely irrelevant to datetime.

(Also, RFC 3339 is an open standard: not only can it be reasonably implemented, but anyone can also check if the implementation is actually correct.)

Now contrast with datetime.__str__(), which has almost the same design goals as RFC 3339, but an additional one of being „human-friendly“. It replaces a T (a computer-friendly separator) with a space (a human-friendly separator). RFC 3339 explicitly doesn’t allow this to keep a useful property :

Assuming [important details], then the date and time strings may be sorted as strings […] and a time-ordered sequence will result. The presence of optional punctuation would violate this characteristic.

How does ISO 8601 handle this? It tells you T is the best choice, but allows other characters by „mutual agreement“ of sender and receiver. How typical of he ISO! It’s not a spec, but guidance for making your own spec.
Writing a parser that accepts T or space (or anything else) isn’t a lot of work, and so Python’s isoformat has an option to select the separator. It carefully passes the choice to define your own format on to the user.

In conclusion, ISO 8601 is not a good spec to implement for datetime, but RFC 3339 is, and it’s a perfect match.

I’d like to quote Paul, but substitute the RFC for the ISO:

pganssle:

My goals for a “general-purpose” [RFC 3339] parser:

It should support the entire datetime portion of the spec (or as nearly so as we can) [and RFC 3339 strives to make this easy]

It should have a way to specify which deviations from the spec are not allowed (e.g. no sub-minute offsets) [RFC 3339 defines this rigorously]

~~It should be possible to specify that you want to support certain subsets of all supported functionality (e.g. RFC 3339 […]).~~ [it is trivial to define and implement a useful subset of ISO 8601, so IMO Python should focus on just its own subset]

It should support a minimum of deviations from ISO 8601 - essentially those that are specified to be changeable “by agreement” plus support for sub-minute time zone offsets.
[we want to keep the human-friendly space separator as an extension]

It should continue to “just work” on the output of datetime.isoformat .

The good news is that we’re almost there: we’re missing details like the Z.

Apologies for:

not having read the ISO standard
being all words and no work

pganssle · August 29, 2019, 3:09pm

Unfortunately RFC 3339 is not a perfect match for fromisoformat, since it requires a time zone, a requirement we most certainly do not have in datetime. Additionally, it doesn’t cover some things that isoformat() allows, such as sub-minute offsets.

Another wrinkle here is that RFC 3339 does not support the use of commas as a separator for fractional components, which is allowed in ISO 8601 and, unfortunately, is included in the default format for the logging module - if we’re already being liberal in accepting anything that is allowed by “mutual consent” we should probably be able to parse the logging module’s format.

Restricting ourselves only to the date, time and datetime related portions of the spec, it’s actually not terribly difficult to write a fairly full-featured ISO 8601 parser once you know all the rules. I would also contend that a full-featured ISO 8601 parser is useful, just that supporting additional valid ISO 8601 formats has pretty severely diminishing marginal utility once you get away from the (pseudo-regex) forms YYYY(-?MM(-?DD)?)?(.*HH:?MM:?SS([\.,]\d+)?([+-]HH:?MM([\.,]\d+)?)?)?. The marginal utility of adding additional formats is slight and the marginal cost of accidentally accepting invalid dates is also minor - it’s probably a net positive.

This is an interesting suggestion. The main problem is that you have two types of users: one form that has a bunch of datetimes and wants to parse them as long as they are any kind of valid format, and another kind who knows the format of the datetime (e.g. “it was generated by isoformat()” or “the spec says it’s in RFC 3339” or “the spec says it’s ISO 8601”) and they want it to be an error if it’s not that. I have gotten requests for stricter versions of both dateutil.parser.parse and dateutil.parser.isoparse (which itself exists as a “strict” version of parse).

It might not be such a bad thing to make the default fromisoformat be maximally permissive (accept anything valid that is allowed “by mutual agreement”, plus extend the timezone offsets to accept any valid format for a naive time, and point people to dateutil.parser.isoparse for a more configurable strict-subset behavior (though that still means I’d need to figure out that API for dateuil.parser.isoparse).

cben · September 2, 2019, 12:32pm

Does datetime.fromisoformat(my_date.replace('Z', '+00:00')) really cover full RFC3339?

It’s an important RFC that’s used in many internet APIs. It’s the recommended way to represent moments in time in JSON Schema, OpenAPI etc.
So it would be nice to have it as a battery, and the stdlib is so tantalizingly close to providing it…

(As usual, an external library is more immediately useful because it can be used today, and python’s datetime is lacking some other things so people use external libraries anyway.)

ViktorHaag · September 4, 2019, 7:22pm

I regularly have to work with integrations and standards in the ed-tech world where zulu-terminated datetimes are required (required to accept as input, and required to produce on output), and would be happy to have this functionality in the standard library.

mehaase · September 19, 2019, 12:58pm

Paul, I know you favor the idea of a more comprehensive ISO-8601 parser, but you have stated it is tricky to design the API (e… feature flags) and nobody in this thread is asking for broader ISO-8601 support. They just want want to be able parse the date strings created by the world’s most popular programming language (JavaScript).

Would you endorse a minimal patch that just adds support for zulu dates (and updates documentation)?

pganssle · September 19, 2019, 2:32pm

No, I realize that from a practical point of view it would be nice to have something that often just works, but we have deliberately designed it this way because it has a very clear scope and by not stepping outside that scope for practicality’s sake, people are more likely to learn early on that they are using the function incorrectly (i.e. for parsing datetimes not guaranteed to produce only the formats that datetime.isoformat produces).

The preferred solution is to have a version of this function that will satisfy both the people who want to invert datetime.isoformat and the people who want to parse ISO 8601 datetimes in general. I think Petr’s suggestion of leaving the feature flags for dateutil.parser.isoparse and creating a liberal ISO 8601 parser might simplify things greatly, however.

In the meantime, I have seen no objections to using dateutil.parser.isoparse other than “but it’s in a third party library”, which is not a great justification, particularly when that library is dateutil - an incredibly popular library maintained by one of the maintainers of datetime (me) and from which datetime.fromisoformat was adapted in the first place. Best case scenario, we change the scope of fromisoformat today, PEP 602 passes and you can get the same functionality you get out of dateutil.parser.isoparse today in November 2020, assuming you are comfortable immediately upgrading your code to be Python 3.9-only. Given that timeline, I don’t think there’s enough urgency here that we should complicate the clearly-communicated scope of this function with a half-measure like supporting parsing “Z” for UTC.

aeros · September 21, 2019, 5:58am

As far as I’m aware, there’s no reliable means of determining the most popular programming language. The stackoverflow 2019 survey shows JavaScript on the top, but the PyPL Index and TIOBE Index would suggest differently. There’s of course many other sources with differing results.

Therefore, I would highly recommend replacing “the world’s most popular programming language” with “one of the world’s most popular programming languages”. This is far less controversial, but still represents the same user demand for compatibility.

mehaase · October 14, 2019, 3:56pm

It’s a 1.3MB dependency that for many of us only adds one feature: the ability to parse the letter Z.

I don’t find this persuasive, because the same logic applies to every change ever made to Python.

I respect your authority on this issue, though, and thank you for the hard work on datetime and dateutil. I’ll stick with replace('Z', '+00:00') for now.

oneiros · January 9, 2020, 2:03am

And it violates the robustness principle: “Be conservative in what you send, be liberal in what you accept”.

blacklight86 · January 12, 2020, 1:26pm

I’ve just ended up here after getting sick of writing hackish code such as:

if d.endswith('Z'):
    d = d[:-1] + '+00:00'
return datetime.datetime.fromisoformat(d)

Java, Javascript and other languages and platforms out there consider the Z-suffixed date strings as the norm.

If your code interacts for instance with a Node.js server then you may see dates formatted like this:

$ node
> d = new Date()
2020-01-20T20:48:30.971Z

I understand that the ISO-8601 is arcane and most of the languages implement only a subset of it, but please make sure that Python can at least understand the formats returned by default by other languages. Otherwise every Python script that interacts with a Java or JavaScript server may have to implement its own brittle fromisostring(), or rely on external libraries just to get the Z parsed properly.

pf_moore · January 12, 2020, 2:19pm

Please consider your tone.

cben · January 12, 2020, 4:04pm

Due to unfortunate naming, this is impractical — the full ISO-8601 format is large with arcane options like ordinal days, decimal fractions on minutes and much more. We can safely assume this will never happen in stdlib. I’m not sure any of the external packages that tried ever implemented 100% of the full standard. dateutil doesn’t either (doc says it doesn’t parse fractional minutes). A couple years ago I searched several other languages too, and didn’t find anybody doing full 8601! [I guess being a pay-to-read standard, with long prose and no BNF, makes this a goal programmers just don’t care enough about…]

What most people actually mean when they think “ISO” is “as long as I pass a valid RFC-3339 string”.

It’s a 1.3MB dependency that for many of us only adds one feature: the ability to parse the letter Z.

To be fair, there are multiple smaller modules that don’t attempt ISO 8601 but only RFC 3339 (not sure if any of them is perfect, but hey if not, let’s perfect one before requesting stdlib does it ):

Let’s see, what are the actual points separating fromisoformat from full RFC 3339?

“Z” or lowercase “z” — @blacklight86 note you code above doesn’t handle “z”.
4.3. Unknown Local Offset Convention.
Not clear how to best represent with datetime.
email.utils.parsedate_to_datetime set a precedent of returning a naive datatime, which you should understand as “UTC but with no indication of the actual source timezone”, which is… meh.
Leap seconds (section 5.7)?
Conversely, fromisoformat seems to accept any character between date and time. Even a digit.
Even \x00! Makes sense because isoformat takes an optional arg to emit any characters (space " " is common but not only). But a “from RFC” function better only allow "T", "t", and optionally " "?

While I respectfully disagree on value of supporting RFC3339, this is a very insightful comment, thanks.

blacklight86 · January 12, 2020, 5:16pm

I agree that the ISO 8601 is large and arcane, and probably it’s not really worth to implement everything in stdlib. However I would argue that:

Python isn’t the first language to bump into the problem of how to implement the ISO datetime standard. From what I know, Java isn’t fully ISO-8601 compliant, but at least it supports both the time offset and the Z suffix notation. JavaScript instead implements most of the standard, even if some arcane features are platform-dependant. C/C++ also supports most of the standard, even though the solutions are platform-dependant (see strptime). My point is that probably Python doesn’t have to reinvent the wheel, and it could see instead how other languages have tackled the problem (answer: most of them aren’t fully compliant either, but they at least do support some reasonable variations, such as the time offset and Z suffix notation).
A quick search on the internet for “Invalid isoformat string: 'YYYY-mm-ddTHH:MM:SSZ'” would return hundreds of people that have been puzzled by this issue. It means that there is an issue, and people frequently bump into it - especially when their Python logic has to handle data returned by other systems/applications. Making sure that Python can parse at least the ISO format strings returned by default by the most common languages out there (at least Java, JavaScript, C/C++) would ensure better inter-compatibility - and prevent developers from coding their own brittle workarounds when they interact with e.g. a NodeJS server, or adding a new random dependency to their code developed by someone on Github.
As long as those behind the ECMA standard keep saying “we’ve always returned UTC datetime strings with the Z suffix, we won’t change it now”, and those who develop Python keep saying “we’ve always only parsed the datetime strings generated by Python itself (with time offset), we don’t care about processing in stdlib those returned by default by other languages”, the divergence can only get worse.

pganssle · January 12, 2020, 9:22pm

You are critically missing at least 4.4. Unqualified local time - RFC 3339 is only suitable for aware datetimes and requires a tzoffset.