PEP 680: "tomllib" Support for parsing TOML in the Standard Library

I was curious about this so made a PoC (the tests only fail in CI becasue test dependencies try to use the built and incompatible package, tox solves this locally!). I don’t think this complicates the implementation. So I don’t see this as a blocker if we really hate binary file objects that much :grinning_face_with_smiling_eyes:

EDIT: The bulk of the change is pretty much exactly this one line.
EDIT 2: I think that one line also highlights the question: Do we want to be consistent with the API and abstraction level at which json, marshal, pickle, plistlib (perhaps others?) modules operate at, or spare users from writing that one with open("f.toml", "rb") as f: line.

1 Like

Personally I don’t think it’s about “sparing people one line” as about making the API more or less flexible. TOML data can come from a lot of places that aren’t a file in the filesystem (for example, a file from a zipfile/tarfile). So having a means to parse TOML from a file object is necessary.

Yes, a user can do tomllib.reads(f.read().decode("utf8")) - but how do they get the newline handling right?

So essentially, my view is:

  • “Parse from binary data” is essential.
  • “Parse from text data” is (extremely) convenient.
  • Making everything work from in-memory data is viable, but feels uncomfortable (TOML shouldn’t be so large that memory overhead is a concern, but people tend to think in terms of streaming interfaces when they have streaming data).
  • Parsing a named file in the filesystem (via a Path object) is a useful convenience function, and would probably cover a significant proportion of use cases, but not all.

With that said, my preference would be for 3 APIs

  • Parse a binary file stream
  • Parse an iterable of Unicode strings representing lines of text
  • Parse TOML from a Path object

Matching existing APIs by providing “parse a (binary) file object” and “parse a single string” has the advantage of consistency, but otherwise feels inferior. But unless I’m mis-remembering the tomli interface, it also has the advantage of being “take the tomli interface as is, and stop bikeshedding”, so it has some real-world experience (for TOML) to back it up.

Other variations feel to to me like they are neither optimal nor consistent. In particular, I accept the argument that “parse a text file object” is a bug magnet because it doesn’t enforce the TOML spec - but even though “parse an iterable of lines” is functionally capable of doing the same thing, it feels like it would be less prone to misuse.

1 Like

Accepting paths seems fine to me, but if that isn’t acceptable, please include an open function so people can just call tomlib.open(…) instead of having to remember the keyword arguments, and the spelling of utf8, and whether they need to deal with encoding errors…

This is a very good PEP. I think it would be reasonable to add a simple toml reader just to cover the data format, and therefore appreciate the discussion of why a writer is not included.

I think the file usage (have to get the right kind of file, and open with the appropriate magic) is awkward, but it sounds like “also accept a filepath” is a reasonable solution.

I greatly appreciate the willingness to say <<this is only a basic version, you may prefer third-party implementations such as X, Y, or Z, particularly if you want to write these files>>. Ideally, that will also be in both the docs and the docstring for the module itself.

Is it actually difficult to hand-code a TOML writer for a specific file structure? I may be missing something, but perhaps regular string interpolation can get you there.

What would be the practical benefit over just vendoring a minimal, battle-tested, already-existing package like tomli_w, which is only 167 sloc in a single file (more or less, if you vendored it)?

The benefit is to avoid having to maintain it. :slight_smile:
Of course, if someone volunteers to do so on a long-lasting basis, it’s not a problem.

I mean, I’d argue that just dropping in a new version of tomli-w if ever needed is easier than having to maintain oneself whatever hacky parsing code one wrote :slightly_smiling_face:

We’re talking about what to include in the standard library here.

Oh, sorry; I assumed you were referring to an alternate solution for the situation experienced by the “third party tools” that need to write a specific TOML file that @barry mentioned in the quote you replied to, rather than seriously suggesting including a string-interpolation-based TOML writer in the stdlib.

In that case, its hard to see how that would be more maintainable and much if any less code than using the 167-ish sloc of fully TOML v1.0.0-supporting tomli-w as a starting point, unless we declared a pretty narrowly constrained subset of the TOML format that we could write (like just first-level tables and second-level keys, basically typed INI, with no nested dicts, table-arrays, datetime parsing, lists, etc).

That would only be useful for a small subset of applications, there would likely be wide user confusion over why it doesn’t work for others and it still raises all the questions above about style/formatting, API design, etc. And of course, we’d have to decide exactly what that constrained subset would be, which would substantially delay inclusion.

So to answer your question: no, I don’t see how that really solves much, but maybe I’m the one missing something?

No. What I said is that if someone needs a TOML writer and doesn’t want to rely on anything else than the standard library, then they can still hand-code a rudimentary TOML writer tailored to their particular use case.

Again, in the context of this PEP, use cases where it’s ok to use third-party libraries are out of the picture.

Okay thanks for clarifying (again) and sorry for the confusion. In that case, that was what I thought you said in your first comment, and that’s what I was referring to in my reply—but I now see how the latter could be misinterpreted as referring to the benefits for the stdlib rather than the third party authors, sorry. I then misinterpreted your response to refer to the benefits for the latter (that I had been referring to) and turn was thus confused by your further reply, which I thought was talking about your first comment rather than your second. Now it all makes more sense, thanks.

In that case, my original comment still stands; that’s a pretty hacky and limited solution for a small minority of cases, but in the real world, unless the project is operating under rather artificial and contrived restrictions that do not allow them to even vendor or incorporate (not just depend upon) existing minimal, tested, permissive-licensed code, they can simply depend on/vendor/drop in tomli-w instead of reinventing the wheel.

You don’t have to remember any of that as tomllib.load() takes a file opened in binary mode (just like json.load() supports).

No worse than any other file you may want to construct in a certain way.

Of course, you still do have to open the file and remember to do so in binary mode but you get a clear error message if you don’t.

That said, both of the above could be obviated by either allowing passing os.PathLike in addition to a binary file object to tomllib.load (or whatever it is named); adding a new function that takes os.PathLike, or (if neither is acceptable), having only two functions, one that takes only os.PathLike and, to cover most cases that require a file object, allowing tomllib.loads or equivalent to accept bytes as well as str (the only reason given for not doing so being a lack of a clear use case), and just passing the file object with .read(), which would handle everything but the theoretical corner case of streaming data (which the first two approaches would handle, along with everything else the current implementation does).

1 Like

Just so we’re all on the same page, the proposed API works like this:

with open("foo.toml", "rb") as f:
    doc = tomllib.load(f)

If instead you do open("foo.toml", "r"), you’ll get an error like TypeError("File must be opened in binary mode, e.g. use open('foo.toml', 'rb')")

Overall, I feel this is both quite friendly and familiar.

I personally agree an API that takes paths could be a nice convenience (either by extending load or by adding a new function). As mentioned in the PEP, the PyPI toml library allows load to take a path, not just a file-like object, and this functionality is widely used.
However, as mentioned upthread, the suggestion of an API to take path-like objects has come up several times for json.load, pickle.load, etc, to overall somewhat negative reception:

This is why the PEP does not propose it, as discussed in this section PEP 680 – tomllib: Support for Parsing TOML in the Standard Library | peps.python.org.

I’d love to see a PathLike API for all the textfile readers … Now that I think of it – we could introduce a PathLike API with tomllib, and then later, maybe, add it to the others.

If we determine we want to make this change for json, pickle, etc, we should also do so for tomllib. But given that there doesn’t seem to be clear consensus, I’d rather this be its own separate discussion, rather than backdoor-ing an API via tomllib.

1 Like

Just to be clear, allowing path-like objects to those functions is indeed a separate discussion and shouldn’t be conflated here. I do think, however, there are some reasons specific to tomllib that could potentially justify allowing it.

In particular, existing use of an implementation that allows both filelike and pathlike objects is widespread and well-proven. Not featuring this capability, despite a very similar module and function name, could be surprising and frustrating to users, prevents it being a drop-in replacement, and requires non-trivial structural changes to users’ code (adding an additional nested block with an additional indent level), which both increases friction and the change of bugs switching to it, and prevents it from being as easily swapped out for other non-stdlib alternatives, as well as avoiding tempting users to not close file objects properly.

Conversely, it is a new module, implementing pathlike support is very straightforward, and it would accept a superset of what the other similar functions do, so there wouldn’t be compatibility or confusion issues if users do assume it simply takes a filelike object. That said, it isn’t the end of the world and could be added later, but since it would require structural changes to the surrounding code, it is a bit of a hassle to use pathlike objects until dropping any Python version that doesn’t support them.

I thought the primary goal was to allow packaging tools to not use third-party dependencies, not to entice users to switch to standard-library when perfectly valid TOML libraries already exist in PyPI

Sure, that was the initial motivation of this, but I don’t think it means we shouldn’t at least consider making it more generally useful for and compatible with other packages, provided it doesn’t interfere with that use case or overly complicate the API or implementation (which, unlike write support, it sounds like it doesn’t, and there seems to be a lot more support for adding the former).

The issue is not that path-like API is fundamentally bad idea, but that there will always be another good idea about what to add.
A line needs to be drawn somewhere, and IMO, "tomli as it is now" is a pretty good place to draw it.

8 Likes

And the key word there is “convenience”. We need to remember this PEP is about getting a bare-bones, break-the-dependency-cycle TOML parser into the stdlib. If you want fancier you can either go to PyPI or it can be added to the stdlib later. But we should not be prematurely optimizing the API when the key goal is basic support in the stdlib. It is much harder to remove something than it is to add something to the stdlib.

Nope, this is veering off-topic. It’s a separate discussion, as you said, so we should stop talking about it and focus on any issues people have that would lead to the PEP being rejected instead of trying to expand its API scope.

9 Likes