Adopting/recommending a toml parser?

Maybe bootstrapping packaging stuff is not as difficult as it used to be.

Adopting a toml library means pypa maintains one.

Putting it in the standard library is something else.

1 Like

IMO, itā€™d be very hard to make a full-featured solution that wonā€™t suck for anyone.

This journey?

  • I donā€™t need to write TOML at all.
  • Just write a TOML file so other tools can read it; I donā€™t care about formatting.
  • ā€¦
  • ā€¦
  • Also, numeric values under the download_count key should use thousands separators by default, but only if theyā€™re greater than 10_000.
  • ā€¦

A library that would solve the last step would be so complex that it would no longer be a good fit for users at the beginning of the journey.

I think that rather than making sure the stdlib version can be extended, we need to encourage external libraries for formatting TOML. Just like attrs exists to fill the gaps in dataclasses.

3 Likes

To clarify, I wasnā€™t referring to the implementation, but the projects. All the projects mentioned here except tomli are either abandonware or well on its way to become one. The only other one with hope is atoml, but from what Iā€™ve read (sorry if I misunderstood the situation), the project needs maintenance help.

1 Like

pytomlpp is maintained and was mentioned above, but does rely on C++17 in the underlying C++ implementation.

Note that the discussion regarding preservation of comments has come up there as well. Problem is that the underlying std::map is not order-preserving (unlike pythonā€™s dict), and would need some pretty fancy C++ to do that while also enabling heterogeneous lookup. Itā€™s somewhere on a long list of things Iā€™d like to get to at some point (independently of what the stdlib does; I donā€™t expect cpython to adopt something as recent as C++17 anytime soon).

1 Like

tomlkit is not abandonned in any way but between Poetry, Pendulum (which is being reworked) and other projects I am spread pretty thin. Tomlkit was born from a need for Poetry to have a style-preserving parser/formatter and I updated it for each TOML spec bumping. So itā€™s currently compliant with the latest version of the spec.

And, honestly, saying it sucks while pipenv jumped on it when it was released is ironic at best, cynical at worst.

As far as I know there are no other style-preserving parser out there, in any language, so I had to take it upon myself to build my own from the ground up. And apart from @frostming ā€“ Sorry for not getting back to you about the proposition to co-maintain tomlkit, hit me up if you still want to talk about it ā€“ no one here that complained about it has stepped up to help with its development.

Is tomlkit perfect? No. Does it have bugs? Yes. Does it fit in the standard library? No and a full-featured library does not make sense to be integrated in the stdlib in my opinion.

9 Likes

Considering weā€™ve managed to standardize the toml without having a toml parser in the stdlib and the fact that the language tries to remove packages from the stdlib, Iā€™d be ok with not having it in the stdlib.

3 Likes

First off, thanks a lot for your efforts! TBH, thereā€™s such a proliferation of toml-libraries that despite searching for them, I missed tomlkit, even though it apparently does exactly what Iā€™d need (Iā€™ll try it as soon as I can). This proliferation itself is IMO an argument for inclusion in the stdlib, but reasonable people can disagree about that.

Could you explain your thinking on that a bit more? Iā€™d understand if the API still needs to move quickly, but TOML itself looks like it will be a very stable format, so assuming the right abstraction has been found (granted, that might take 2-3 years) - why should general python users not also be enabled to write the format that is becoming more and more ubiquituous, even in the python ecosystem itself?

1 Like

The argument probably goes that you might need to read toml to install the writing toml library that you prefer. There are a lot of reasons you might want a different writer but the reader only has to produce a correct dict from a toml file. sdispaterā€™s library is probably not ready for a permanent feature freeze required by the standard library.

1 Like

I very often worked in a constrained environment where only the standard library is available and relying on additional modules is either impossible (e.g. no Internet) or needs some way to be vendored (installation folder is locked and pip cannot pass through the proxy). Possibly this is a niche situation, but I know a lot of companies where such a policy apply. For me, the ā€œbatteries includedā€ has been a decision changer in favor of Python compared to other languages/frameworks for development of tools.

For all the tools I developed in these environments, I came to the same trade-offs when selecting how I would store my applicationā€™s configuration.

  • Conf/INI is human-friendly, and the implementation in Python makes it really no-surprise (the concept for the syntax of conf/ini is widely known and the parser available in Python supports a large variety of separators, quotation, ā€¦). However, it falls short when the necessary structure grows in complexity.
  • JSON has a much better support for structured data, but it doesnā€™t count as a human-editable format. So it can be used for persistence, where data compatibility makes a binary format less suited (cross-version or cross-tool compatibility, or when you need developers to edit the configuration but users should not).
  • Python data, but itā€™s not easily writable and you may not wish (inexperienced) users to mess with the internals of your program (and for example defeat a nice error catcher you wrote expressly so that users donā€™t poke you everyday because they got a cryptic error)
  • XML, the format is both human-readable and provides complex structuring, with comments for self-documentation. However, there will also be a trade-off as the format is very verbose and extremely strict. So the more complex your structure, the less human-readable and human-editable it will be.

I understand that style-preservation de/serializing can be out-of-scope of the standard library, due to the desire of keeping performance high and size or maintenance effort low. That said, compared to the enormous footprint of the XML/eTree support library despite alternatives like lxml, even tomlkit looks thin. Providing style-preserving modification for TOML would bring to Python a new format that could compete against what is currently the unique solution, XML, which slowly shows its age and limits.

IMHO, TOML has its place in the stdlib, reading obviously, writing surely, style-preserving modificationā€¦ probably. I meanā€¦ We have an SMTP server in the stdlib, donā€™t we? :wink:

3 Likes

Iā€™d like to link this article to the discussion:

1 Like

The points made in that article are interesting but ultimately not relevant. Packaging has chosen TOML as a format, and itā€™s frankly too late to change that easily even if we wanted to. The question for this thread is what TOML parser to use, not whether to use TOML in the first place.

5 Likes

Hi, everyone! I would like to describe a use case where writing TOML files is required and formatting is not an issue. It can serve as an example how including the most basic (i.e. without guarantees on formatting) writing capabilities can be useful.

Namely, I used TOML files as inputs to my scripts when I was working on the following research project: GitHub - yandex-research/rtdl: The `rtdl` library + The official implementation of the paper "Revisiting Deep Learning Models for Tabular Data" There are >1000 of TOML files across the repository and the vast majority of them was generated automatically. The pattern I see here is ā€œgenerating inputs for other programsā€. The formatting was not important to me at all.

Additionally, the ā€œwriting without guarantees on formattingā€ approach is conceptually simple and means the same to everyone: ā€œthe content is preserved, the formatting is up to maintainers of the TOML-writerā€ (honestly, I did not like at all the style of the TOML-writer I used, but it did its job just fine).

2 Likes

Things are getting quite hairy lately on that front:

Do I understand correctly that having a TOML parser with reading capabilities helps address the packaging bootstrapping issues? And if so, is there a chance to reach a consensus on having read-only TOML parsing in the stdlib?

The discussion in this thread has several branches, but scanning quickly, I recall that:

  • @sdispater has stated he doesnā€™t want tomlkit in the stdlib, so that rules tomlkit out.
  • @brettcannon has asked whether a ā€œ(probably massive) discussion about the future of the stdlibā€ should be had before making this decision, but I perceive that TOML parsing is more urgent.
  • @bernatgabor is inclined to not including any TOML parser in the stdlib, but it looks like it would indeed be useful.
  • Several people have stated that theyā€™d rather not want this to bikeshed into the pros/cons of TOML with respect to other formats.
  • While writing capabilities seems desirable for some use cases, it is not clear that these use cases warrant adding such complexity to the stdlib.
  • @hukkinj1 has stated that tomli is ā€œspec 1.0.0 compliant and has 100% test coverageā€

Thoughts on adopting tomli for the stdlib?

2 Likes

My thoughts are here.

My understanding is however that this discussion is about PyPA recommending a parser so stdlib inclusion may be offtopic?

Note that thereā€™s already an ongoing discussion about stdlib inclusion of a TOML parser here Issue 40059: Provide a toml module in the standard library - Python tracker

And a general discussion about stdlib inclusions/removals Standardizing how to handle adding/removing modules from the stdlib Ā· Issue #92 Ā· python/steering-council Ā· GitHub

1 Like

While everyone can have reservations about TOML as a format (I find it utterly useless and misguided myself), if a TOML reader is needed in the stdlib for packaging sanity, and since the tomli author seems to agree with putting it in the stdlib, then why isnā€™t it happening already? Surely practicality can beat purity here and spare us lengthy discussions about writers, style preservation and whatnot.

8 Likes

Because adding something to the stdlib that is less than a year old and exists outside the stdlib by a non-core dev is always a big discussion, especially when thereā€™s a pre-existing toml package which I suspect people would want to use as the name in the stdlib.

3 Likes

This part is a bit of a problem but I donā€™t think is a deal-breaker. The public package can evolve as tomli, and the vendoring into stdlib can transform it into toml. Similar to how importlib_metadata is the 3rd party API but importlib.metadata is the stdlib API. A bigger problem would be backwards compatibility: unless tomli packages API matches the toml API shipping a new module might break some applications when the import resolves from the standard library rather than the 3rd party package.

I think the only contentious point here compared to importlib.metadata is that that package is maintained by a core developer, where this is not. Something that could be solved by accepting the maintainer ( @hukkinj1) as a core developer; which would make sense as he would maintain part of the standard library. Alternatively could also be a solution to convince an existing core developer to become a co-maintainer for the tomli package, if @hukkinj1 agrees.

2 Likes

A bigger problem would be backwards compatibility: unless tomli packages API matches the toml API shipping a new module might break some applications when the import resolves from the standard library rather than the 3rd party package.

Yep this is a problem. The APIs are very similar but thereā€™s a few differences where Iā€™m unfortunately not interested at all in matching toml API and do think it would be a mistake to add the toml API to the standard library.

Iā€™ll try to list the key differences and reasons why toml API is not always great.

  1. toml.load takes as input one of the following types: a text file object, pathlib.Path, a list of pathlib.Paths, or string (representing filepath).

    In contrast tomli.load only takes binary file objects as input.

    Accepting the various data types that toml does is a problem because:

    • it is unlike the behavior of any other load function in the standard library
    • Accepting many types makes for code that is hard to read. My first thought when I see toml.load("path_to/conf_file.toml") is always ā€œthat must be a TypeError, one should open the file firstā€
    • accepting list[pathlib.Path] is just needless IMO, and whatever problem it solves should be trivial to solve by the consumer of the library
    • accepting a text file object (instead of binary file object) is the easiest footgun ever, because correctly parsing TOML requires setting arguments as follows open(path, encoding="utf8", newline=""). Omitting one of these two arguments or using other values runs the risk of incorrect parse results. TOML, specifying file encoding and valid newline sequences among other things, is simply a lot stricter format than what a text file object represents.
  2. toml.load and toml.loads accepts a _dict keyword argument for parsing TOML tables to other mapping types than dict. In contrast, tomli has not such keyword argument.

    Itā€™s not exactly clear what the value of using other type than dict here would be, but this sure seems like an easy way to introduce bugs. And also load objects that raise TypeError when dumped.

  3. toml.load and toml.loads accept a decoder keyword argument for customizing decoding. The decoder must implement toml.TomlDecoder interface.

    tomli doesnā€™t have any of this.

    It seems this is mostly useful for comment preservation, which I donā€™t want a poor implementation of. Also, the toml.TomlDecoder interface / base class with its 9 public methods seems a bit messy, not something Iā€™d want to recreate or support.

  4. toml uses and exposes custom toml.tz.TomlTz timezone objects. In contrast tomli uses datetime.timezones from the standard library.

  5. toml raises TomlDecodeErrors while tomli raises TOMLDecodeErrors. The casing that toml uses conflicts with PEP8 and standard library conventions.

  6. toml includes the whole encode/dump API while tomli does not. This is probably the most breaking difference out of all of these.

So yeah thereā€™s actually quite many differences considering how small the APIs are. :grinning_face_with_smiling_eyes: Not sure I even have everything listed here.

In conclusion, if itā€™s required to match toml API perfectly, then I think I prefer to not add tomli to the standard library.

I think the only contentious point here compared to importlib.metadata is that that package is maintained by a core developer, where this is not. Something that could be solved by accepting the maintainer ( @hukkinj1) as a core developer; which would make sense as he would maintain part of the standard library. Alternatively could also be a solution to convince an existing core developer to become a co-maintainer for the tomli package, if @hukkinj1 agrees.

I donā€™t really have a problem with either (or both) of these approaches.

4 Likes

I mostly agree with the design choices.

Based on this, paths ahead:

  1. Adopt the toml library, as guarantees backwards compatibility.
  2. Adopt the tomli library as is and use the tomli namespace.
  3. Adopt the tomli library under the toml id and put it under some namespace (similar to how importlib.metadata is under importlib). There isnā€™t any good place I see at the moment though.

So Iā€™m personally in favour of 2. Either way, this would likely require a PEP, @hukkinj1 Iā€™m open to writing that up if you co-sign it, probably the least controversial path would be to find a core developer thatā€™s willing to co-maintain the library to also sign-off on it. Perhaps @pf_moore might be willing to help out here. :blush:

2 Likes

We cannot do (3). It would break all software that uses the toml package.

We can always use a different name for a toml parser in the stdlib module, e.g. tomllib (like pathlib or contextlib) or tomlparser (like configparser). This would prevent any conflict with upstream projects.

3 Likes