PEP 680: "tomllib" Support for parsing TOML in the Standard Library

A TOML file, the most of time, is for human reading. So I can’t say a tomli-w level writer is actually useful.
When generating a TOML, we will be likely to want to control the format, otherwise, it may produce an unreadable document. If for machine reading only, we wouldn’t have chosen TOML format at the first place.

For PDM, it is a build frontend and an installer, the build backend is extracted as a standalone package pdm-pep517. In PDM, tomli is used for PEP 517 building and tomlkit for package management.

1 Like

Please notice that my idea in the previous post was not trying to convince anyone that write support should be included in the PEP. I completely understand the reasons of the authors/sponsors.

I was purely trying to reply to @barry’s question. My view is that yes, adding write support to the stdlib would influence whether or not tools include features (and given that writing TOML was already previously mentioned as brainstorm in the Packaging category, the existence of writing support can potentially also change future decisions regarding packaging and standards).

How relevant this is for the PEP or its acceptance is a completely different story. My view is that tomllib is useful/important even without writing support.


This is the discussion where TOML is mentioned: Python metadata format specification and implementation . I completely agree here that JSON would make more sense.

This is not the first time “writing TOML” shows up in the Packaging discussion (always as brainstorm). Before PEP 643 and PEP 621, there was some discussion about backends modifying pyproject.toml to remove the dynamic fields: PEP 621: round 3 (as we all know the idea was rejected and PEP 643 was crafted instead).

1 Like

I’d like to suggest widening the type of tomllib.load. As the PEP is currently written, load only accepts a file opened in binary mode. The justification is:

Using a binary file allows us to ensure UTF-8 is the encoding used, and avoid incorrectly parsing single carriage returns as valid TOML due to universal newlines in text mode.

This feels overly pedantic to me. It protects against the library accepting some obscure cases that are not strictly valid TOML, but it also makes it so the simplest way to read a TOML file (with open("my.toml") as f: config = tomllib.load(f)) doesn’t work. Also, it means you can’t use io.StringIO to build up a TOML document and then parse it with tomllib.load.

The JSON spec requires JSON to be encoded in UTF-8, but json.load accepts files opened in text mode.

5 Likes

I’m very much against this.

A good API is not such where 99.99% of users use it incorrectly. I know people will use it incorrectly because Tomli started with text file objects only (I wasn’t aware of how problematic this is back then), and nice people like @domdfcoding had to go and fix incorrect usage (pretty much every consumer of the library).

To ensure correct TOML parsing with text file objects one must do

open("conf.toml", encoding="utf8", newline="")

and I have never, ever, seen anyone get that right with any of the TOML libraries available. Even library authors make the mistake of omitting the newline arg. It is IMO much better to error than allow most users to write incorrect code.

The use cases you mention are possible with the current API. I don’t think they should be possible at the expense of the 99% misusing the API though.

If you really want to read invalid TOML you can do

with open("conf.toml") as f:
    doc = tomllib.loads(f.read())

if for some reason you want to build an io.StringIO you can do

doc = tomllib.loads(string_io.getvalue())

This isn’t true. According to JSON spec “JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.”. According to TOML spec “A TOML file must be a valid UTF-8 encoded Unicode document”.


What I would maybe consider is to not accept file objects at all, but only paths instead (as pathlib.Path or str), e.g. the following API signatures

# load file
tomllib.parse(path: pathlib.Path | str) -> dict
# load string
tomllib.parse_string(s: str) -> dict

That’s an easier to use API than the load[s] API, but also not consistent with the existing load[s] APIs in the stdlib, so not sure if better overall, or worth bikeshedding.

As @hukkinj1 mentions, this may be simple, but it is also unfortunately wrong, not just for 0.1% of cases, but on any platforms (e.g. all Windows) where the default locale encoding is not UTF-8, in any case where any TOML files processed by it contain any non-ASCII text. The unfortunate fact (and motivator of, e.g. PEP 597) is that it iseasy for even experienced devs to forget the critical encoding argument (or, perhaps more of a 0.1% issue, get it wrong) this, especially if they’re on *nix, and is very much a real-world issue that I’ve seen a number of times in other contexts.

I’d personally be strongly in favor of something like this; there doesn’t seem to be much of a use case that accepting file-like objects covers that accepting string/pathlib paths, and (separately) string objects does not, whereas doing so avoids a line or two of unnecessary boilerplate and potential error for almost all cases. As discussed in the PEP and elsewhere, spot-checking a number of toml projects revealed that almost all passed paths to load, not file objects (see point 1 of the Appendix), and for a small number of users needing such, they can just use the [s] version with .read().

Perhaps treading into bikeshed territory, but we should at least consider retaining the load[s] name rather than making up a new parse[s] (or bikeshed over something else), since many/most of the the third party implementations I’m aware of for toml, json, yaml, etc. (other than tomli), use the load[s] name and allow passing paths (at least some exclusively, IIRC). Personally, I’d see coming up with, bikeshedding over and requiring users to remember some new names to be more UX-unfriendly and inconsistent with the ecosystem as a whole than accepting paths rather than file objs for the first argument. But its not really my call.

Reading a TOML file (e.g., pyproject.toml) from a zip/tar file (e.g., a sdist). Yes, you can read the data and then convert it to a string and parse it, but hen you have the problem of knowing the exact rules you need to use for converting valid TOML bytes to a Python string that tomllib can parse.

Yes, it’s rare, but it’s a real packaging use case.

I don’t think there’s sufficient benefits to having a method that takes a filename - either it’s an extra method which is at best a minor convenience over open/parse, or it replaces the existing parse method with something less flexible.

2 Likes

I’m also in favour of an API that takes a filename (I think I proposed that before). I think it avoids a potential pitfall where the user reads the file in text mode with the wrong encoding / line ending and then passes it to loads, rather than opening it in binary mode and passing it to load. The function can handle all of that internally.

Doing this has three downsides important to me

  • It’s different from the other loads in the standard library (as you say)
  • For reading from a file-like (when you don’t have a file), you can’t just let the tomli library handle string encoding because loads requires a string, as Paul says
  • There would now be no ability for streamed parsing (ie parsing part of the TOML before the entire file is downloaded). This is perhaps not that important with the typical file size of TOML documents

Thanks for explaining these use cases! If they are significant enough to potential justify this, wouldn’t it be simpler to just accept bytes as well as str as input to tomllib.reads(), for which the PEP says:

It is possible to add bytes support in the future if needed, but we are not aware of any use cases for it.

So the cases that needed this functionality could just do tomllib.dumps(binary_file.read()) instead of tomllib.dump(binary_file), while not complicating every other case? (Sure, users might make the mistake of reading in a file as text with the wrong encoding, but they can already do that anyway with the existing reads.

However, better still might simply be allowing load to accept os.PathLike in addition to SupportsRead[bytes]. Consistency with json.load and pickle.load is the reason cited in the PEP for not doing so, which is even more so true of accepting only paths as @hukkinj1 proposes above, instead of paths in addition to files. This approach has the advantages of both and further reduces the delta to toml.load, at the cost of a modestly more complex type signature and implementation.

Personally, I find this the one potential reason to not do this. It ultimately comes down to a more or less subjective judgement: if consistency with json.load and pickle.load (as well as tomli) outweighs the user-code simplicity and ergonomic benefits, along with consistency with the most popular toml implementation (if perhaps not for core packaging projects, as of very recently). Of course, tomllib.load as proposed is already somewhat inconsistent with them (for good reasons) in that it does not accept text-mode files, whereas if path-like support was added, it would accept additional types while not reducing compat further.

Also, if this was still a blocker, a different function name could be used instead as @hukkinj1 suggests, at the cost of introducing an inconsistency in name rather than argument type.

This wouldn’t help with streamed parsing. Although that is a highly hypothetical use case. The json module for instance never needed it, I can’t see why tomllib ever would.


FWIW, I already regret sharing the idea of an API accepting os.PathLike :smiley: . I really don’t think it’s worth bikeshedding over, and don’t think we should change the PEP.

(Perhaps my message was that I prefer os.PathLike over text file objects because such an API makes it impossible to open the file with incorrect arguments. But binary file objects are just fine!)

2 Likes

Of course, tomllib.load as proposed is already somewhat inconsistent with them (for good reasons) in that it does not accept text-mode files

Note that pickle.load does not accept text-mode files. Consistency arguments for the first argument only really apply to “file-like” vs “path-like” vs “both file-like and path-like”.

The suggestion of accepting path-like objects has come up several times for json.load, pickle.load, etc, and doesn’t seem to have had good reception:

1 Like

Definitely agreed there. I’ve personally run into far too many bugs with other Python projects/code not getting encoding (and even trickier, newlines) handling right.

Yeah, I was thinking more about json there; but my broader point is that at least to me, the consistency argument doesn’t seem as compelling in the context of preventing load() from being strictly more compatible in the types it will accept rather than less (per Liskov), so long as the added type (os.PathLike) doesn’t create a significant hazard of (especially silent) misuse (which SupportsRead[str] does). But that’s ultimately somewhat subjective.

1 Like