Adopting/recommending a toml parser?

dholth · June 8, 2021, 3:20pm

Maybe bootstrapping packaging stuff is not as difficult as it used to be.

Adopting a toml library means pypa maintains one.

Putting it in the standard library is something else.

encukou · June 8, 2021, 3:54pm

IMO, it’d be very hard to make a full-featured solution that won’t suck for anyone.

This journey?

I don’t need to write TOML at all.
Just write a TOML file so other tools can read it; I don’t care about formatting.
…
…
Also, numeric values under the download_count key should use thousands separators by default, but only if they’re greater than 10_000.
…

A library that would solve the last step would be so complex that it would no longer be a good fit for users at the beginning of the journey.

I think that rather than making sure the stdlib version can be extended, we need to encourage external libraries for formatting TOML. Just like attrs exists to fill the gaps in dataclasses.

uranusjr · June 8, 2021, 4:05pm

To clarify, I wasn’t referring to the implementation, but the projects. All the projects mentioned here except tomli are either abandonware or well on its way to become one. The only other one with hope is atoml, but from what I’ve read (sorry if I misunderstood the situation), the project needs maintenance help.

h-vetinari · June 8, 2021, 4:16pm

pytomlpp is maintained and was mentioned above, but does rely on C++17 in the underlying C++ implementation.

Note that the discussion regarding preservation of comments has come up there as well. Problem is that the underlying std::map is not order-preserving (unlike python’s dict), and would need some pretty fancy C++ to do that while also enabling heterogeneous lookup. It’s somewhere on a long list of things I’d like to get to at some point (independently of what the stdlib does; I don’t expect cpython to adopt something as recent as C++17 anytime soon).

sdispater · June 8, 2021, 4:50pm

tomlkit is not abandonned in any way but between Poetry, Pendulum (which is being reworked) and other projects I am spread pretty thin. Tomlkit was born from a need for Poetry to have a style-preserving parser/formatter and I updated it for each TOML spec bumping. So it’s currently compliant with the latest version of the spec.

And, honestly, saying it sucks while pipenv jumped on it when it was released is ironic at best, cynical at worst.

As far as I know there are no other style-preserving parser out there, in any language, so I had to take it upon myself to build my own from the ground up. And apart from @frostming – Sorry for not getting back to you about the proposition to co-maintain tomlkit, hit me up if you still want to talk about it – no one here that complained about it has stepped up to help with its development.

Is tomlkit perfect? No. Does it have bugs? Yes. Does it fit in the standard library? No and a full-featured library does not make sense to be integrated in the stdlib in my opinion.

bernatgabor · June 8, 2021, 4:57pm

Considering we’ve managed to standardize the toml without having a toml parser in the stdlib and the fact that the language tries to remove packages from the stdlib, I’d be ok with not having it in the stdlib.

h-vetinari · June 8, 2021, 5:20pm

First off, thanks a lot for your efforts! TBH, there’s such a proliferation of toml-libraries that despite searching for them, I missed tomlkit, even though it apparently does exactly what I’d need (I’ll try it as soon as I can). This proliferation itself is IMO an argument for inclusion in the stdlib, but reasonable people can disagree about that.

Could you explain your thinking on that a bit more? I’d understand if the API still needs to move quickly, but TOML itself looks like it will be a very stable format, so assuming the right abstraction has been found (granted, that might take 2-3 years) - why should general python users not also be enabled to write the format that is becoming more and more ubiquituous, even in the python ecosystem itself?

dholth · June 8, 2021, 6:42pm

The argument probably goes that you might need to read toml to install the writing toml library that you prefer. There are a lot of reasons you might want a different writer but the reader only has to produce a correct dict from a toml file. sdispater’s library is probably not ready for a permanent feature freeze required by the standard library.

Cilyan · July 2, 2021, 3:09am

I very often worked in a constrained environment where only the standard library is available and relying on additional modules is either impossible (e.g. no Internet) or needs some way to be vendored (installation folder is locked and pip cannot pass through the proxy). Possibly this is a niche situation, but I know a lot of companies where such a policy apply. For me, the “batteries included” has been a decision changer in favor of Python compared to other languages/frameworks for development of tools.

For all the tools I developed in these environments, I came to the same trade-offs when selecting how I would store my application’s configuration.

Conf/INI is human-friendly, and the implementation in Python makes it really no-surprise (the concept for the syntax of conf/ini is widely known and the parser available in Python supports a large variety of separators, quotation, …). However, it falls short when the necessary structure grows in complexity.
JSON has a much better support for structured data, but it doesn’t count as a human-editable format. So it can be used for persistence, where data compatibility makes a binary format less suited (cross-version or cross-tool compatibility, or when you need developers to edit the configuration but users should not).
Python data, but it’s not easily writable and you may not wish (inexperienced) users to mess with the internals of your program (and for example defeat a nice error catcher you wrote expressly so that users don’t poke you everyday because they got a cryptic error)
XML, the format is both human-readable and provides complex structuring, with comments for self-documentation. However, there will also be a trade-off as the format is very verbose and extremely strict. So the more complex your structure, the less human-readable and human-editable it will be.

I understand that style-preservation de/serializing can be out-of-scope of the standard library, due to the desire of keeping performance high and size or maintenance effort low. That said, compared to the enormous footprint of the XML/eTree support library despite alternatives like lxml, even tomlkit looks thin. Providing style-preserving modification for TOML would bring to Python a new format that could compete against what is currently the unique solution, XML, which slowly shows its age and limits.

IMHO, TOML has its place in the stdlib, reading obviously, writing surely, style-preserving modification… probably. I mean… We have an SMTP server in the stdlib, don’t we?

DayDreamer · July 2, 2021, 9:28am

I’d like to link this article to the discussion:

pf_moore · July 2, 2021, 10:14am

The points made in that article are interesting but ultimately not relevant. Packaging has chosen TOML as a format, and it’s frankly too late to change that easily even if we wanted to. The question for this thread is what TOML parser to use, not whether to use TOML in the first place.

Yura52 · July 9, 2021, 5:04pm

Hi, everyone! I would like to describe a use case where writing TOML files is required and formatting is not an issue. It can serve as an example how including the most basic (i.e. without guarantees on formatting) writing capabilities can be useful.

Namely, I used TOML files as inputs to my scripts when I was working on the following research project: GitHub - yandex-research/rtdl: The `rtdl` library + The official implementation of the paper "Revisiting Deep Learning Models for Tabular Data" There are >1000 of TOML files across the repository and the vast majority of them was generated automatically. The pattern I see here is “generating inputs for other programs”. The formatting was not important to me at all.

Additionally, the “writing without guarantees on formatting” approach is conceptually simple and means the same to everyone: “the content is preserved, the formatting is up to maintainers of the TOML-writer” (honestly, I did not like at all the style of the TOML-writer I used, but it did its job just fine).

astrojuanlu · December 18, 2021, 7:02pm

Things are getting quite hairy lately on that front:

github.com/hukkin/tomli

tomli violates PEP517 Build Requirements

opened 05:21PM - 17 Dec 21 UTC

closed 02:47AM - 27 Dec 21 UTC

jameshilliard

question

The [PEP517 Build Requirements](https://www.python.org/dev/peps/pep-0517/#build-…requirements) specifically disallow dependency cycles: > - **Project build requirements will define a directed graph of requirements (project A needs B to build, B needs C and D, etc.) This graph MUST NOT contain cycles.** If (due to lack of co-ordination between projects, for example) a cycle is present, front ends MAY refuse to build the project. > > - Where build requirements are available as wheels, front ends SHOULD use these where practical, to avoid deeply nested builds. **However front ends MAY have modes where they do not consider wheels when locating build requirements, and so projects MUST NOT assume that publishing wheels is sufficient to break a requirement cycle**.

Do I understand correctly that having a TOML parser with reading capabilities helps address the packaging bootstrapping issues? And if so, is there a chance to reach a consensus on having read-only TOML parsing in the stdlib?

The discussion in this thread has several branches, but scanning quickly, I recall that:

@sdispater has stated he doesn’t want tomlkit in the stdlib, so that rules tomlkit out.
@brettcannon has asked whether a “(probably massive) discussion about the future of the stdlib” should be had before making this decision, but I perceive that TOML parsing is more urgent.
@bernatgabor is inclined to not including any TOML parser in the stdlib, but it looks like it would indeed be useful.
Several people have stated that they’d rather not want this to bikeshed into the pros/cons of TOML with respect to other formats.
While writing capabilities seems desirable for some use cases, it is not clear that these use cases warrant adding such complexity to the stdlib.
@hukkinj1 has stated that tomli is “spec 1.0.0 compliant and has 100% test coverage”

Thoughts on adopting tomli for the stdlib?

hukkinj1 · December 18, 2021, 7:22pm

My thoughts are here.

My understanding is however that this discussion is about PyPA recommending a parser so stdlib inclusion may be offtopic?

Note that there’s already an ongoing discussion about stdlib inclusion of a TOML parser here Issue 40059: Provide a toml module in the standard library - Python tracker

And a general discussion about stdlib inclusions/removals Standardizing how to handle adding/removing modules from the stdlib · Issue #92 · python/steering-council · GitHub

pitrou · December 19, 2021, 12:55am

While everyone can have reservations about TOML as a format (I find it utterly useless and misguided myself), if a TOML reader is needed in the stdlib for packaging sanity, and since the tomli author seems to agree with putting it in the stdlib, then why isn’t it happening already? Surely practicality can beat purity here and spare us lengthy discussions about writers, style preservation and whatnot.

brettcannon · December 20, 2021, 8:04pm

Because adding something to the stdlib that is less than a year old and exists outside the stdlib by a non-core dev is always a big discussion, especially when there’s a pre-existing toml package which I suspect people would want to use as the name in the stdlib.

bernatgabor · December 21, 2021, 9:55am

This part is a bit of a problem but I don’t think is a deal-breaker. The public package can evolve as tomli, and the vendoring into stdlib can transform it into toml. Similar to how importlib_metadata is the 3rd party API but importlib.metadata is the stdlib API. A bigger problem would be backwards compatibility: unless tomli packages API matches the toml API shipping a new module might break some applications when the import resolves from the standard library rather than the 3rd party package.

I think the only contentious point here compared to importlib.metadata is that that package is maintained by a core developer, where this is not. Something that could be solved by accepting the maintainer ( @hukkinj1) as a core developer; which would make sense as he would maintain part of the standard library. Alternatively could also be a solution to convince an existing core developer to become a co-maintainer for the tomli package, if @hukkinj1 agrees.

hukkinj1 · December 21, 2021, 11:38am

A bigger problem would be backwards compatibility: unless tomli packages API matches the toml API shipping a new module might break some applications when the import resolves from the standard library rather than the 3rd party package.

Yep this is a problem. The APIs are very similar but there’s a few differences where I’m unfortunately not interested at all in matching toml API and do think it would be a mistake to add the toml API to the standard library.

I’ll try to list the key differences and reasons why toml API is not always great.

toml.load takes as input one of the following types: a text file object, pathlib.Path, a list of pathlib.Paths, or string (representing filepath).

In contrast tomli.load only takes binary file objects as input.

Accepting the various data types that toml does is a problem because:
- it is unlike the behavior of any other load function in the standard library
- Accepting many types makes for code that is hard to read. My first thought when I see toml.load("path_to/conf_file.toml") is always “that must be a TypeError, one should open the file first”
- accepting list[pathlib.Path] is just needless IMO, and whatever problem it solves should be trivial to solve by the consumer of the library
- accepting a text file object (instead of binary file object) is the easiest footgun ever, because correctly parsing TOML requires setting arguments as follows open(path, encoding="utf8", newline=""). Omitting one of these two arguments or using other values runs the risk of incorrect parse results. TOML, specifying file encoding and valid newline sequences among other things, is simply a lot stricter format than what a text file object represents.
toml.load and toml.loads accepts a _dict keyword argument for parsing TOML tables to other mapping types than dict. In contrast, tomli has not such keyword argument.

It’s not exactly clear what the value of using other type than dict here would be, but this sure seems like an easy way to introduce bugs. And also load objects that raise TypeError when dumped.
toml.load and toml.loads accept a decoder keyword argument for customizing decoding. The decoder must implement toml.TomlDecoder interface.

tomli doesn’t have any of this.

It seems this is mostly useful for comment preservation, which I don’t want a poor implementation of. Also, the toml.TomlDecoder interface / base class with its 9 public methods seems a bit messy, not something I’d want to recreate or support.
toml uses and exposes custom toml.tz.TomlTz timezone objects. In contrast tomli uses datetime.timezones from the standard library.
toml raises TomlDecodeErrors while tomli raises TOMLDecodeErrors. The casing that toml uses conflicts with PEP8 and standard library conventions.
toml includes the whole encode/dump API while tomli does not. This is probably the most breaking difference out of all of these.

So yeah there’s actually quite many differences considering how small the APIs are. Not sure I even have everything listed here.

In conclusion, if it’s required to match toml API perfectly, then I think I prefer to not add tomli to the standard library.

I think the only contentious point here compared to importlib.metadata is that that package is maintained by a core developer, where this is not. Something that could be solved by accepting the maintainer ( @hukkinj1) as a core developer; which would make sense as he would maintain part of the standard library. Alternatively could also be a solution to convince an existing core developer to become a co-maintainer for the tomli package, if @hukkinj1 agrees.

I don’t really have a problem with either (or both) of these approaches.

bernatgabor · December 21, 2021, 12:08pm

I mostly agree with the design choices.

Based on this, paths ahead:

Adopt the toml library, as guarantees backwards compatibility.
Adopt the tomli library as is and use the tomli namespace.
Adopt the tomli library under the toml id and put it under some namespace (similar to how importlib.metadata is under importlib). There isn’t any good place I see at the moment though.

So I’m personally in favour of 2. Either way, this would likely require a PEP, @hukkinj1 I’m open to writing that up if you co-sign it, probably the least controversial path would be to find a core developer that’s willing to co-maintain the library to also sign-off on it. Perhaps @pf_moore might be willing to help out here.

tiran · December 21, 2021, 12:39pm

We cannot do (3). It would break all software that uses the toml package.

We can always use a different name for a toml parser in the stdlib module, e.g. tomllib (like pathlib or contextlib) or tomlparser (like configparser). This would prevent any conflict with upstream projects.