Adopting/recommending a toml parser?

tjreedy · May 17, 2021, 11:35pm

The stdlib has a usable html.parser but no html.writer, so I think we could start with just a toml parser. Sphinx has a writer, which formats the output however it does, but I am sure others make other choices.

hukkinj1 · May 27, 2021, 11:59pm

Hi all! Wanted to let you know that I just wrote Tomli which I think is pretty much what you describe here, if that’s what you’re looking for, i.e. as minimalistic as possible. It’s spec 1.0.0 compliant and has 100% test coverage. Seems to also perform better than other pure Python parser, although that was not really a goal for me. You’ll find it here, fresh out the oven

frostming · June 8, 2021, 9:51am

Yes, the issues and PRs are not responded in time, I even tried to request for maintenance but haven’t received any response: [Maintainance] Willing to maintain tomlkit · Issue #62 · python-poetry/tomlkit · GitHub

What’s more, the ability of preserving comments and orders also brings a lot of problems, so I gave up creating my own fork of it: GitHub - frostming/atoml: Yet another style-preserving TOML library for Python. But after I released the first version, tomlkit also adopted all those fix and made a new release after not having a release for 10 months.

I personally am against adopting tomlkit, toml should be a better choice if PyPA can take the maintenance.

h-vetinari · June 8, 2021, 10:24am

There are probably two different (and opposing) camps in terms of preference between performance and style-preservation. I want to make the case that the stdlib should prioritize the latter and improve performance later.

Preservation of order, comments & whitespace of a (suitably normalized) toml file is essential for all automated reprocessing of config-files that are supposed to also be read/edited by humans. In fact, while some people use TOML as just another config format, I’d argue that the choice of that format already implies the importance of having humans interact with the file, since readability is arguably the key differentiators of TOML vs. JSON.

The problem becomes very quickly that automatic reprocessing of files becomes impossible once all carefully constructed metainformation (such as comments & order) falls away, and this closes entire classes of usecases that TOML should rightfully enable (IMO).

Therefore, I’d really love to see the stdlib aim for a “batteries-included” approach here (i.e. adopting or implementing a solution that is able to preserve comments etc.), and improve performance later. Those on the critical path can still use optimized third-party libraries (like pytomlpp).

takluyver · June 8, 2021, 10:36am

My prioritisation is much more about complexity than performance. Parsing a TOML file, discarding all the style information, and returning a dict of standard data types should be much simpler than capturing all of the formatting & comments, working with ‘TOML node’ objects and preserving the ability to write it out with formatting intact. Complexity means more bugs and more room for disagreement (e.g. how hard does it try to match existing styling for new data).

I’m not denying the importance of programmatically modifying TOML files while preserving the formatting. That’s clearly very valuable. But a plain parser to read TOML with very few options necessary seems like a better fit for the standard library.

pf_moore · June 8, 2021, 10:47am

Do we actually even need write capability at all? It makes sense for parity with JSON, I guess, but in practical terms I’d expect usage of TOML to be much more like ini files - written manually by humans using an editor, and only really read by application code. For the occasional use case where read/write is needed, using a 3rd party library seems justifiable. Much like we have configparser in the stdlib, but ConfigObj available on PyPI for read/write cases.

takluyver · June 8, 2021, 11:05am

tomli looks like a good example of this (thanks @hukkinj1!). It’s about 800 lines of code in total (tomlkit is several times larger), and has two public functions: load and loads.

encukou · June 8, 2021, 11:46am

Preserving whitespace and comments is not just a performance issue; it’s also a issue of representation. Straightforward lists/dicts cannot store comments, so you need more complex data structures.
Designing these is not easy; worse, there’s no one clear obvious way to do it.
On the other hand, parsing TOML into dict/list/str/int etc. does have one obviously correct behavior. There are very few API design choices (e.g. return a list or tuple?), and they can be usually answered by following json's precedent.

In the standard library, we have one shot to do the semantics right before they need to be maintained forever. Unlike performance, they can’t be improved.

h-vetinari · June 8, 2021, 12:02pm

I’d argue that yes, the stdlib really should (though arguably that feature can be added at a later stage). Most use-cases start out as “I just need to read this config”, but as mentioned, the choice of format already signifies that human interaction is an important requirement, and automated reprocessing (if only for validation of the human edits) is something that almost always comes up soon afterwards.

Until the comment by @frostming above, I was not aware of a toml library that is able to do this for now (though it’s been a while since I had researched that), so it becomes a self-fulfilling prophecy that writing comes as an afterthought in use-cases (or rather, everyone has to swallow that they cannot do that automated reprocessing of human-edited files).

Humans (want to) choose TOML because config files are often part of the code and need to be readable, but while it’s trivial to write JSON (because the format has effectively no metadata), a huge amount of usecases would be prevented from switching to TOML if there’s no write capability.

h-vetinari · June 8, 2021, 12:08pm

But TOML has a spec, so we’re not talking about lists/dicts, but arrays, tables, etc. And while parsing that syntax, comments can be attached to any given node (I’d say always afterwards, because that would unify the line-ending comments with the free-standing ones, with one artificial node at the top to handle leading comments).

But yes, for an interface that supports both reading and writing, the nodes would need to be stored in something more capable than a vanilla dict. It’s not my point that all features need to be there from day zero, but to not accept an API into the stdlib that precludes the writing case to be added as a natural extension (e.g. a boolean on ingestion preserve_formatting or whatever).

hukkinj1 · June 8, 2021, 12:32pm

^ This is the reason why I dropped plans of style preservation very early on for Tomli. In fact, performance wise I don’t think it’s even an issue: you’re gonna parse through all the whitespace and comments anyways, so storing them in a data structure shouldn’t add much of a performance hit at all.

When it comes to the real problem, style-preseving representation, a naive one is easy: just parse a sequence of “statements” (key/value pair, table declaration, whitespace, etc)

The problem with that is the API for editing a doc in that representation is horrible compared to nested dicts and lists.

I believe tomlkit tries to solve this, and provides a nested dict/list like API, but the problem with that is style ambiguity.

Consider the following TOML docs:

a.b = [{}]

[a]
b = [{}]

[[a.b]]

[a]
[[a.b]]

These structures are equivalent, and funnily enough I believe with some imagination the list could go on.
Storing the stylistic difference of these TOML docs is a non-issue, but providing a programmatic API that doesn’t lose existing style, and where the user can decide the style to be used, is an incredibly tough problem (it might just be that pathlib.Path(file).write_text(toml_doc) is the best user experience, lol).

h-vetinari · June 8, 2021, 12:42pm

Regarding the different valid toml-docs, roundtripping doesn’t have to be no-op, it just has to be idempotent (i.e. after doing it once, you keep getting the same representation, which humans can get used to writing).

(Presumably there could be some few PEP8-style knobs to expose, e.g. at which length to break inline tables into fully-fledged ones, etc.)

hukkinj1 · June 8, 2021, 12:48pm

Well I guess that depends on the use case, no? But essentially for some use cases it no longer is style preserving if roundtripping isn’t a no-op.

And talking about use cases, in my subjective experience, any use case other than “read a config file and be done with it” seems very rare in the wild.

Some data serialization uses may exist (poetry lockfile) but data serialization doesn’t care about style.

h-vetinari · June 8, 2021, 1:00pm

That’s IMO just step 0 on a long journey. Config is ubiquitous, and TOML is an excellent format for it. But once you start scaling, you’ll want to update it in an automated manner (e.g. “add this extra value” to all deployed configs of your app/pipeline/whatever so that you can release a new version) - if you can’t do this without destroying the metadata, you’re either back to doing it manually, or saying goodbye to using metadata at all (or hand-parsing, blergh).

I accept that it might not be everyone’s experience, but for me it has come up eventually in basically all use cases. And I’d like to highlight also that the absence of metadata-preserving toml libraries directly contributes to that, because by and large, people who need to (re)serialize their configs just learn that it’s not possible with toml (currently). And that’s IMO a completely artificial constraint, and ideally one that shouldn’t be reinforced by the stdlib.

hukkinj1 · June 8, 2021, 1:03pm

Sure thing, I’m not claiming there isn’t a use case, but that the use case might be rare. And if it is rare, maybe it’s not justified as a stdlib, but should be third-party package instead.

h-vetinari · June 8, 2021, 1:37pm

All I’m asking is that the design for the stdlib does not close the door for this to be added later. Even if it’s rare now (which I see differently, see above…), it’s going to be ever-more important.

EpicWink · June 8, 2021, 1:51pm

It’s not like load and loads have to be the only functions to parse a TOML, you could use those functions to return a nested dict of simple types, and introduce new functions (eg deserialise) later on for rich TOML parsing

uranusjr · June 8, 2021, 2:53pm

I wonder if it would be good to have a read-only library in stdlib, and a read-write-style-preserving one under offical-ish maintenance like psf or pypa. Configuration file use cases (not just TOML, but generally) are overwhelmingly human-edited, machine-read, so we should prioritise that for the stdlib, and leave extra functionalities to third-part libraries.

IMO it is wrong to use JSON as an analogy. TOML is designed as a human-readable configuration format, and JSON is more of a data interchange format. JSON is used as a configuration format by very few tools (only Node-related stuff from the top of my head), and doing a pretty bad job at it. A better comparison would be either YAML (which the stdlib doesn’t have) and INI (for which the stdlib writer is not style-preserving).

pganssle · June 8, 2021, 3:00pm

I tend to be in the anarchistic laissez faire camp about this, but I don’t see why there must be something official or semi-official. The big benefit I see to including a TOML parser in the standard library is that there’s a standard file that packaging libraries need to read in many contexts, and bootstrapping packaging stuff is difficult, so it’s helpful to minimize dependencies. If we adopt a TOML reader for those purposes, also developing a new, full-featured TOML library just so that there’s an “official recommendation” seems unnecessary. Just let the best library win.

uranusjr · June 8, 2021, 3:04pm

I think the main issue here is there’s no best library, every one of the full-featured solution kind of sucks.