PEP 680: "tomllib" Support for parsing TOML in the Standard Library

frostming · January 14, 2022, 3:53am

A TOML file, the most of time, is for human reading. So I can’t say a tomli-w level writer is actually useful.
When generating a TOML, we will be likely to want to control the format, otherwise, it may produce an unreadable document. If for machine reading only, we wouldn’t have chosen TOML format at the first place.

For PDM, it is a build frontend and an installer, the build backend is extracted as a standalone package pdm-pep517. In PDM, tomli is used for PEP 517 building and tomlkit for package management.

abravalheri · January 14, 2022, 3:48pm

Please notice that my idea in the previous post was not trying to convince anyone that write support should be included in the PEP. I completely understand the reasons of the authors/sponsors.

I was purely trying to reply to @barry’s question. My view is that yes, adding write support to the stdlib would influence whether or not tools include features (and given that writing TOML was already previously mentioned as brainstorm in the Packaging category, the existence of writing support can potentially also change future decisions regarding packaging and standards).

How relevant this is for the PEP or its acceptance is a completely different story. My view is that tomllib is useful/important even without writing support.

This is the discussion where TOML is mentioned: Python metadata format specification and implementation . I completely agree here that JSON would make more sense.

This is not the first time “writing TOML” shows up in the Packaging discussion (always as brainstorm). Before PEP 643 and PEP 621, there was some discussion about backends modifying pyproject.toml to remove the dynamic fields: PEP 621: round 3 (as we all know the idea was rejected and PEP 643 was crafted instead).

Jelle · January 15, 2022, 2:33pm

I’d like to suggest widening the type of tomllib.load. As the PEP is currently written, load only accepts a file opened in binary mode. The justification is:

Using a binary file allows us to ensure UTF-8 is the encoding used, and avoid incorrectly parsing single carriage returns as valid TOML due to universal newlines in text mode.

This feels overly pedantic to me. It protects against the library accepting some obscure cases that are not strictly valid TOML, but it also makes it so the simplest way to read a TOML file (with open("my.toml") as f: config = tomllib.load(f)) doesn’t work. Also, it means you can’t use io.StringIO to build up a TOML document and then parse it with tomllib.load.

The JSON spec requires JSON to be encoded in UTF-8, but json.load accepts files opened in text mode.

hukkinj1 · January 15, 2022, 4:06pm

I’m very much against this.

A good API is not such where 99.99% of users use it incorrectly. I know people will use it incorrectly because Tomli started with text file objects only (I wasn’t aware of how problematic this is back then), and nice people like @domdfcoding had to go and fix incorrect usage (pretty much every consumer of the library).

To ensure correct TOML parsing with text file objects one must do

open("conf.toml", encoding="utf8", newline="")

and I have never, ever, seen anyone get that right with any of the TOML libraries available. Even library authors make the mistake of omitting the newline arg. It is IMO much better to error than allow most users to write incorrect code.

The use cases you mention are possible with the current API. I don’t think they should be possible at the expense of the 99% misusing the API though.

If you really want to read invalid TOML you can do

with open("conf.toml") as f:
    doc = tomllib.loads(f.read())

if for some reason you want to build an io.StringIO you can do

doc = tomllib.loads(string_io.getvalue())

This isn’t true. According to JSON spec “JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.”. According to TOML spec “A TOML file must be a valid UTF-8 encoded Unicode document”.

What I would maybe consider is to not accept file objects at all, but only paths instead (as pathlib.Path or str), e.g. the following API signatures

# load file
tomllib.parse(path: pathlib.Path | str) -> dict
# load string
tomllib.parse_string(s: str) -> dict

That’s an easier to use API than the load[s] API, but also not consistent with the existing load[s] APIs in the stdlib, so not sure if better overall, or worth bikeshedding.

CAM-Gerlach · January 15, 2022, 7:32pm

As @hukkinj1 mentions, this may be simple, but it is also unfortunately wrong, not just for 0.1% of cases, but on any platforms (e.g. all Windows) where the default locale encoding is not UTF-8, in any case where any TOML files processed by it contain any non-ASCII text. The unfortunate fact (and motivator of, e.g. PEP 597) is that it iseasy for even experienced devs to forget the critical encoding argument (or, perhaps more of a 0.1% issue, get it wrong) this, especially if they’re on *nix, and is very much a real-world issue that I’ve seen a number of times in other contexts.

I’d personally be strongly in favor of something like this; there doesn’t seem to be much of a use case that accepting file-like objects covers that accepting string/pathlib paths, and (separately) string objects does not, whereas doing so avoids a line or two of unnecessary boilerplate and potential error for almost all cases. As discussed in the PEP and elsewhere, spot-checking a number of toml projects revealed that almost all passed paths to load, not file objects (see point 1 of the Appendix), and for a small number of users needing such, they can just use the [s] version with .read().

Perhaps treading into bikeshed territory, but we should at least consider retaining the load[s] name rather than making up a new parse[s] (or bikeshed over something else), since many/most of the the third party implementations I’m aware of for toml, json, yaml, etc. (other than tomli), use the load[s] name and allow passing paths (at least some exclusively, IIRC). Personally, I’d see coming up with, bikeshedding over and requiring users to remember some new names to be more UX-unfriendly and inconsistent with the ecosystem as a whole than accepting paths rather than file objs for the first argument. But its not really my call.

pf_moore · January 15, 2022, 8:10pm

Reading a TOML file (e.g., pyproject.toml) from a zip/tar file (e.g., a sdist). Yes, you can read the data and then convert it to a string and parse it, but hen you have the problem of knowing the exact rules you need to use for converting valid TOML bytes to a Python string that tomllib can parse.

Yes, it’s rare, but it’s a real packaging use case.

I don’t think there’s sufficient benefits to having a method that takes a filename - either it’s an extra method which is at best a minor convenience over open/parse, or it replaces the existing parse method with something less flexible.

domdfcoding · January 15, 2022, 8:13pm

I’m also in favour of an API that takes a filename (I think I proposed that before). I think it avoids a potential pitfall where the user reads the file in text mode with the wrong encoding / line ending and then passes it to loads, rather than opening it in binary mode and passing it to load. The function can handle all of that internally.

EpicWink · January 15, 2022, 10:20pm

Doing this has three downsides important to me

It’s different from the other loads in the standard library (as you say)
For reading from a file-like (when you don’t have a file), you can’t just let the tomli library handle string encoding because loads requires a string, as Paul says
There would now be no ability for streamed parsing (ie parsing part of the TOML before the entire file is downloaded). This is perhaps not that important with the typical file size of TOML documents

CAM-Gerlach · January 16, 2022, 1:23am

Thanks for explaining these use cases! If they are significant enough to potential justify this, wouldn’t it be simpler to just accept bytes as well as str as input to tomllib.reads(), for which the PEP says:

It is possible to add bytes support in the future if needed, but we are not aware of any use cases for it.

So the cases that needed this functionality could just do tomllib.dumps(binary_file.read()) instead of tomllib.dump(binary_file), while not complicating every other case? (Sure, users might make the mistake of reading in a file as text with the wrong encoding, but they can already do that anyway with the existing reads.

However, better still might simply be allowing load to accept os.PathLike in addition to SupportsRead[bytes]. Consistency with json.load and pickle.load is the reason cited in the PEP for not doing so, which is even more so true of accepting only paths as @hukkinj1 proposes above, instead of paths in addition to files. This approach has the advantages of both and further reduces the delta to toml.load, at the cost of a modestly more complex type signature and implementation.

Personally, I find this the one potential reason to not do this. It ultimately comes down to a more or less subjective judgement: if consistency with json.load and pickle.load (as well as tomli) outweighs the user-code simplicity and ergonomic benefits, along with consistency with the most popular toml implementation (if perhaps not for core packaging projects, as of very recently). Of course, tomllib.load as proposed is already somewhat inconsistent with them (for good reasons) in that it does not accept text-mode files, whereas if path-like support was added, it would accept additional types while not reducing compat further.

Also, if this was still a blocker, a different function name could be used instead as @hukkinj1 suggests, at the cost of introducing an inconsistency in name rather than argument type.

hukkinj1 · January 16, 2022, 2:34am

This wouldn’t help with streamed parsing. Although that is a highly hypothetical use case. The json module for instance never needed it, I can’t see why tomllib ever would.

FWIW, I already regret sharing the idea of an API accepting os.PathLike . I really don’t think it’s worth bikeshedding over, and don’t think we should change the PEP.

(Perhaps my message was that I prefer os.PathLike over text file objects because such an API makes it impossible to open the file with incorrect arguments. But binary file objects are just fine!)

hauntsaninja · January 16, 2022, 2:35am

Of course, tomllib.load as proposed is already somewhat inconsistent with them (for good reasons) in that it does not accept text-mode files

Note that pickle.load does not accept text-mode files. Consistency arguments for the first argument only really apply to “file-like” vs “path-like” vs “both file-like and path-like”.

The suggestion of accepting path-like objects has come up several times for json.load, pickle.load, etc, and doesn’t seem to have had good reception:

Mailman 3 A shortcut to load a JSON file into a dict : json.loadf - Python-ideas - python.org
Mailman 3 Pickle to/from filename or path - Python-ideas - python.org

CAM-Gerlach · January 16, 2022, 2:47am

Definitely agreed there. I’ve personally run into far too many bugs with other Python projects/code not getting encoding (and even trickier, newlines) handling right.

Yeah, I was thinking more about json there; but my broader point is that at least to me, the consistency argument doesn’t seem as compelling in the context of preventing load() from being strictly more compatible in the types it will accept rather than less (per Liskov), so long as the added type (os.PathLike) doesn’t create a significant hazard of (especially silent) misuse (which SupportsRead[str] does). But that’s ultimately somewhat subjective.

PythonCHB · January 28, 2022, 7:14am

I think the issue with floats is a bit of a red herring.

TOML is using decimal notation to express numbers, just as every text file format I know does, and language literals, including Python, do. So this is completely familiar to virtually everyone.

However, unlike JSON, that does not specify that numbers are to be interpreted as binary floats, TOML is explicit about it – that is really nice.

(note a thread about a year ago about the json lib, in which a user had issue with the fact that a 16 digit number didn’t round trip through the json lib exactly (it was as exact as float64 could be) – that led to the proposal that json should use Decimal, which would, indeed better match the JSON spec. But isn’t very practical.)

Sure, maybe some implementations only support float32, but some implementations could not support Decimal, or ?? either, if you don’t have float64, you can’t have floats with that much precision, that has nothing to do with TOML.

So all good here

PythonCHB · January 28, 2022, 7:23am

I came to this discussion specifically to address the issue of what load() will accept.

LIke others have said, the json lib has exactly the same issues, and ideally they will be solved. Personally, while I think the current situation is not great, I"d rather see the new tomllib be consisent with the current json lib, and then we can solve the problem for the whole stdlib at once. Maybe tomllib will help provide the extra motivation.

As for solutions:

We already have UTF-8 mode (), IIIUC, it will become the default one day (though I can’t find a reference for that)
I’d love to see a PathLike API for all the textfile readers – I hope to write a PEP one day, but maybe someone will beat me to it?

In any case, TOML actually provides some extra motivation to do so

Now that I think of it – we could introduce a PathLike API with tomllib, and then later, maybe,add it to the others.

CAM-Gerlach · January 28, 2022, 7:43am

I’m a little unclear—could you explain what your concrete proposal is for this?

Yes, but per the spec, JSON can be UTF-8, UTF-16 or UTF-32, unlike the explicitly specified UTF-8 of TOML, and doesn’t have an explicitly specified EOL character(s) as does TOML. Furthermore, just because json has the issues doesn’t mean that we should inherit them with tomllib, just like json didn’t necessarily have to inherit the limitations of older APIs for other data formats.

If we have the chance to do things right this time, I don’t think the fact that a different stdlib module did things “wrong” should prevent us from taking it, especially since by far the easiest time to make a change here is when adding the module, not breaking compatibility in a future release. See @brettcannon 's reply for some background on that.

Doing the “wrong thing” now just to give us more “motivation” to break backward compatibility later doesn’t seem to be a wise course of action, considering it just makes it doubly more difficult to do as double the modules will be affected by a backward incompatible change, with a correspondingly greater amount of user code this will break and that will have to manage the transition, instead of being able to get it right the first time.

PEP 597 posits that, and PEP 538/540 originally specified that, but so far there is unfortunately not yet a concrete plan AFAIK.

As would I, and have advocated for such on this thread for tomllib, but this is a highly non-trivial proposition for all the other format packages, and is not really in scope here outside of tomllib itself.

This incremental approach would be the most potentially workable, though one must keep in mind that while I personally feel it makes sense for tomllib, it may not make sense for all the others, without adding extra args to handle things like encoding, EOL, etc.

PythonCHB · January 28, 2022, 8:21am

[quote=“CAM-Gerlach, post:55, topic:13040, full:true”]

PythonCHB:

I’m a little unclear—could you explain what your concrete proposal is for this?

Sorry, not sure my proposal is concrete yet, but here’s an attempt:

Primarily: Don’t punt and just enforce binary files – TOML is a text format, text files are default in Python, and “the easy things should be easy”.

So: what do we do?

Option 1) Use the same interface that json uses – it’s partially broken, but at least it mostly works, and has been in use and documented for a long time.

and we really should make it better anyway, so let’s do that sooner than later.

Option 2) Add an API to open a file from a PathLike. Personally, I’d overload load() to take a PathLike or a open file object, but others don’t like that – so have an loadf() or something. When this was recently brought up on python-dev for json, folks argued that you don’t always read JSON from a file-on-disk. But at least the primary motivation for TOML is for file-on-disk, so, again:
“the easy things should be easy”

PythonCHB:

LIke others have said, the json lib has exactly the same issues, and ideally they will be solved

Yes, but per the spec, JSON can be UTF-8, UTF-16 or UTF-32

IIUC, that used to be the case, but the modern spec for JSON is UTF-8 now. In any case, it’s surely a reasonable default. but we’re not talking about JSON right now anyway.

doesn’t have an explicitly specified EOL character(s) as does TOML.

I’m a bit confused as to what the issue here is – the spec says:

Newline means LF (0x0A) or CRLF (0x0D 0x0A).

which seems completely consistent with how Python handles text files. Sure, I suppose one could put a bare \r in a TOML file and it would get translated to \n – but would that break anything? and would it have been valid TOML anyway?

And doing binary and decode, we’d have to hand translate the \r\n.

Furthermore, just because json has the issues doesn’t mean that we should inherit them with tomllib,

Of course not – but I would like to see the community get nudged

just like json didn’t necessarily have to inherit the limitations of older APIs for other data formats.

Actually it kinda did – the json lib uses the pickle API, which I think is less than ideal…

If we have the chance to do things right this time, I don’t think the fact that a different stdlib module did things “wrong” should prevent us from taking it,

I agree – though I think requiring binary is NOT doing it right. It’s a hack. Adding a PathLike API might be right.

And using a text file and then addressing the encoding issue would be more right, too.

especially since by far the easiest time to make a change here is when adding the module, not breaking compatibility in a future release.

Absolutely – which is why I don’t want to have binary files be the default – I think it’s a sub-optimum API, and I’d hate to be stuck with it.

Let’s say we use text files as the default. Then systems that don’t have utf-8 as default system encoding will not work if the user hasn’t specified utf-8 when opening the file. Not great, but it’s a problem with literally every use of text files in Python. Not the least bit unique to TOML (or JSON, or …).

And if that gets “fixed” in the future – maybe by making utf-8 mode default, then exactly zero correct code will break.

Maybe tomllib will help provide the extra motivation.

Doing the “wrong thing” now just to give us more “motivation” to break backward compatibility later doesn’t seem to be a wise course of action,

I don’t think it’s the “wrong thing” – the issue is not with the JSON API, or any other stdlib API other than open() – it’s with the default encoding for opening text files. Requiring a binary file is a work-around, not the best API.

instead of being able to get it right the first time.

See above – that’s exactly what I want – to get it right the first time.

We already have UTF-8 mode (), IIIUC, it will become the default one day (though I can’t find a reference for that)
PEP 597 posits that, and PEP 538/540 originally specified that, but so far there is unfortunately not yet a concrete plan AFAIK.

Darn – it would really be nice to do that sooner than later

PythonCHB:

I’d love to see a PathLike API for all the textfile readers

As would I, and have advocated for such on this thread for tomllib, but this is a highly non-trivial proposition for all the other format packages, and is not really in scope here outside of tomllib itself.

Indeed.

PythonCHB:

Now that I think of it – we could introduce a PathLike API with tomllib, and then later, maybe, add it to the others.

This incremental approach would be the most potentially workable, though one must keep in mind that while I personally feel it makes sense for tomllib, it may not make sense for all the others, without adding extra args to handle things like encoding, EOL, etc.

Sure – but if it makes sense for tomllib, then let’s do it for tomllib – though at least considering the idea that we might want to establish a similar API for other file readers.

Finally: I just realized that TextIOWrapper has an encoding attribute – couldn’t tomllib.load() take a look and raise an exception if it’s not ''UTF-8" ?

encukou · January 28, 2022, 9:47am

A Path API looks reasonable for TOML libraries, or even as a future addition to tomllib, but this proposal is intentionally minimal. It’s a building block, and it’s up to the user to add the bells and whistles.
Also, the code (and tests!) for this isn’t written yet. How much would it complicate the implementation? Would it be ready for 3.11?

Finally: I just realized that TextIOWrapper has an encoding attribute – couldn’t tomllib.load() take a look and raise an exception if it’s not ''UTF-8" ?

This is a question that’s best solved in a library on PyPI, not by Python stdlib. Third-party libraries are usually easy to install, upgrade, or even pin to older versions if you need more time to deal with deprecations/removals (in case the idea doesn’t work out).

pf_moore · January 28, 2022, 10:07am

For example, a relatively trivial implementation of this could do:

def load_toml_text(f):
    enc = getattr(f, "encoding")
    nl = getattr(f, "newline")
    buf = getattr(f, "buffer")
    if buf is not None and <enc and nl are correct>:
        return tomllib.load(buf)
    raise InvalidTomlFile(f)

But there’s enough fiddly choices to make here (making sure that both “utf-8” and “utf_8” are accepted as valid UTF-8 encodings, for example) to make it reasonable to keep this sort of API out of the stdlib until it’s been sufficiently battle-tested. So a 3rd party library on top of a stdlib tomllib sounds ideal to me.

hukkinj1 · January 28, 2022, 12:21pm

I did consider this a while back for Tomli (i.e. accepting text IO but error on incorrect attributes).

The issues are

normalization of encoding values (as noted by others)
there is no public attribute for checking what newline value was passed in. There is newlines but that is very different.
It isn’t completely out of the question that TOML spec changes and that the accepted open() arguments would change as consequence. In fact, TOML was close to adding support for bare carriage return newlines recently which would’ve been relevant here. Why would we want an API that is susceptible to such changes?
If we’re gonna be strict about arguments, only accepting certain values, that’s a very clear indicator we should rather have an API that has no arguments (binary file objects (or os.PathLike, but again I wouldn’t want to bikeshed over this)).

I think it’s a false assumption that if a file is human readable text, then it is also well compatible with Python text file objects. TOML is human readable, but still a strict format incompatible with incorrectly made encoding and newline translations made by text file objects.

pf_moore · January 28, 2022, 1:11pm

It seems to me that the TOML spec can be interpreted in two ways here.

On the one hand, a TOML file is clearly stated as having a very explicit format - UTF-8 with an explicit definition of what constitutes a “newline”. On the other hand, the bulk of the TOML spec can be read as defining how to interpret a series of lines of Unicode text - and that can conceptually handle any source of “lines of text”, even if that data doesn’t come from a “TOML file” as per the spec.

People looking at TOML as “a text format for serialising data” are likely thinking in terms of that “broader” interpretation, and that’s why there’s a disconnect here. Overall, I agree that the stdlib should be strict, but that doesn’t mean it shouldn’t support the “lines of text” interpretation. That’s why the loads() API is present (and useful).

The thing is, the loads() API isn’t as flexible as an API that takes an iterable of text lines - which is itself a superset of an “API taking a file open in text mode”. But while a more flexible API might be better, is it worth the inconsistency with other stdlib APIs (JSON and pickle)? I’m inclined to say “probably not” - even though I’d prefer such an API. At the end of the day, TOML data is likely to be relatively small (after all, it’s meant to be human readable, and 100MB of data isn’t human readable!) so "\n".join(iterable) is going to be perfectly acceptable in practice.