Adopting/recommending a toml parser?

While everyone can have reservations about TOML as a format (I find it utterly useless and misguided myself), if a TOML reader is needed in the stdlib for packaging sanity, and since the tomli author seems to agree with putting it in the stdlib, then why isn’t it happening already? Surely practicality can beat purity here and spare us lengthy discussions about writers, style preservation and whatnot.

8 Likes

Because adding something to the stdlib that is less than a year old and exists outside the stdlib by a non-core dev is always a big discussion, especially when there’s a pre-existing toml package which I suspect people would want to use as the name in the stdlib.

3 Likes

This part is a bit of a problem but I don’t think is a deal-breaker. The public package can evolve as tomli, and the vendoring into stdlib can transform it into toml. Similar to how importlib_metadata is the 3rd party API but importlib.metadata is the stdlib API. A bigger problem would be backwards compatibility: unless tomli packages API matches the toml API shipping a new module might break some applications when the import resolves from the standard library rather than the 3rd party package.

I think the only contentious point here compared to importlib.metadata is that that package is maintained by a core developer, where this is not. Something that could be solved by accepting the maintainer ( @hukkinj1) as a core developer; which would make sense as he would maintain part of the standard library. Alternatively could also be a solution to convince an existing core developer to become a co-maintainer for the tomli package, if @hukkinj1 agrees.

2 Likes

A bigger problem would be backwards compatibility: unless tomli packages API matches the toml API shipping a new module might break some applications when the import resolves from the standard library rather than the 3rd party package.

Yep this is a problem. The APIs are very similar but there’s a few differences where I’m unfortunately not interested at all in matching toml API and do think it would be a mistake to add the toml API to the standard library.

I’ll try to list the key differences and reasons why toml API is not always great.

  1. toml.load takes as input one of the following types: a text file object, pathlib.Path, a list of pathlib.Paths, or string (representing filepath).

    In contrast tomli.load only takes binary file objects as input.

    Accepting the various data types that toml does is a problem because:

    • it is unlike the behavior of any other load function in the standard library
    • Accepting many types makes for code that is hard to read. My first thought when I see toml.load("path_to/conf_file.toml") is always “that must be a TypeError, one should open the file first”
    • accepting list[pathlib.Path] is just needless IMO, and whatever problem it solves should be trivial to solve by the consumer of the library
    • accepting a text file object (instead of binary file object) is the easiest footgun ever, because correctly parsing TOML requires setting arguments as follows open(path, encoding="utf8", newline=""). Omitting one of these two arguments or using other values runs the risk of incorrect parse results. TOML, specifying file encoding and valid newline sequences among other things, is simply a lot stricter format than what a text file object represents.
  2. toml.load and toml.loads accepts a _dict keyword argument for parsing TOML tables to other mapping types than dict. In contrast, tomli has not such keyword argument.

    It’s not exactly clear what the value of using other type than dict here would be, but this sure seems like an easy way to introduce bugs. And also load objects that raise TypeError when dumped.

  3. toml.load and toml.loads accept a decoder keyword argument for customizing decoding. The decoder must implement toml.TomlDecoder interface.

    tomli doesn’t have any of this.

    It seems this is mostly useful for comment preservation, which I don’t want a poor implementation of. Also, the toml.TomlDecoder interface / base class with its 9 public methods seems a bit messy, not something I’d want to recreate or support.

  4. toml uses and exposes custom toml.tz.TomlTz timezone objects. In contrast tomli uses datetime.timezones from the standard library.

  5. toml raises TomlDecodeErrors while tomli raises TOMLDecodeErrors. The casing that toml uses conflicts with PEP8 and standard library conventions.

  6. toml includes the whole encode/dump API while tomli does not. This is probably the most breaking difference out of all of these.

So yeah there’s actually quite many differences considering how small the APIs are. :grinning_face_with_smiling_eyes: Not sure I even have everything listed here.

In conclusion, if it’s required to match toml API perfectly, then I think I prefer to not add tomli to the standard library.

I think the only contentious point here compared to importlib.metadata is that that package is maintained by a core developer, where this is not. Something that could be solved by accepting the maintainer ( @hukkinj1) as a core developer; which would make sense as he would maintain part of the standard library. Alternatively could also be a solution to convince an existing core developer to become a co-maintainer for the tomli package, if @hukkinj1 agrees.

I don’t really have a problem with either (or both) of these approaches.

4 Likes

I mostly agree with the design choices.

Based on this, paths ahead:

  1. Adopt the toml library, as guarantees backwards compatibility.
  2. Adopt the tomli library as is and use the tomli namespace.
  3. Adopt the tomli library under the toml id and put it under some namespace (similar to how importlib.metadata is under importlib). There isn’t any good place I see at the moment though.

So I’m personally in favour of 2. Either way, this would likely require a PEP, @hukkinj1 I’m open to writing that up if you co-sign it, probably the least controversial path would be to find a core developer that’s willing to co-maintain the library to also sign-off on it. Perhaps @pf_moore might be willing to help out here. :blush:

2 Likes

We cannot do (3). It would break all software that uses the toml package.

We can always use a different name for a toml parser in the stdlib module, e.g. tomllib (like pathlib or contextlib) or tomlparser (like configparser). This would prevent any conflict with upstream projects.

3 Likes

Just to clarify, I was under the impression that the proposal of adopting ‘tomli’ also included ‘tomli_w’ to write files. Is that the case?

(In the past I heard people in the packaging community talking about backends writing TOML files, e.g. for core metadata or refined versions of pyproject.toml with dynamic field resolved. I believe that including a write brings substantial benefit)

1 Like

Aside: I’d guess the motivation was to allow preserving order (by using collections.OrderedDict), before regular dicts preserved order (implemented in 3.6, documented as part of the language for 3.7). The json module has a similar parameter. I agree that tomli shouldn’t need this.

1 Like

I think this is still an open question. I’m inclined to leave writing out at first, because it involves a whole lot more design choices, and thus more discussion. Writing TOML files isn’t (currently) fundamental to packaging like reading them is, so it’s easier to use a TOML writer library as a regular dependency installed from PyPI.

I might avoid tomlparser as a name even so, because I imagine we might want to add the write part later on. But configparser can write INI files, so it wouldn’t be a big problem if tomlparser also wrote TOML.

1 Like

It wouldn’t be.

I don’t think it would break all toml users, but it is a reason to not go with that name in the stdlib.

Not necessarily.

At this point I think a PEP is necessary. I can sponsor it but I don’t have the bandwidth to be an author on it. I would review here and the issue on bugs.python.org potential concerns to address in the PEP. And to be clear, this would be a standards-track PEP and not a packaging PEP, so that means it will be going to python-dev and the SC.

1 Like

FWIW, I think it would make sense to make changes to toml on PyPI (potentially even some backwards incompatible on), to get it to a point where including it in the standard library would be sensible (getting it to parse + dump TOML 1.0.0, have it better match the json API and so on). It could certainly be disruptive and backwards incompatible; but the work for that can be undertaken before considering adding it to 3.11 or 3.12.

I’ve not had the time to do this and it would effectively be the same idea as @bernatgabor’s case (3) with all the disruptive work happening outside of the Python standard library; and prior to considering addition into the standard library.

I think the end state here is a much better one though: There’s a single toml package (likely based off of tomli’s current implementation) that evolves to a stage where including it in the Python standard library is a straightforward thing to do.

PS: This is obviously contingent on getting the current author for the toml package on PyPI on board for doing this.

5 Likes

Hmm firstly, I feel like people are systematically misunderstanding @bernatgabor’s case 3. :grinning_face_with_smiling_eyes:
I don’t think they intended to break/steal the import toml namespace but rather use a name something like import parser.toml (or from parser import toml if you prefer).

Great, if there was something you disagree with I’d be happy to hear (but perhaps better take it to tomli’s issue tracker).

I’d be happy to help and co-sign!

Would it make sense to name squat tomllib and tomlparser just in case we end up wanting to use one of them?

I agree 100%. The case for TOML parsing is pretty easy to justify solely based on the fact that it fixes packaging/bootstrapping circular dependency madness. The case for writing is not nearly as clear.

Yeah this would be great and definitely have the nicest end state. I’m curious, how much would you be willing to break uiri/toml? For instance, would you remove write capability? If not then we end up having the debate whether writing belongs in the standard library etc…

1 Like

I’d be willing to support this proposal, but I’m cautious about offering to co-maintain, as I’m likely to be pretty busy over the next few months, so I don’t want to commit to too much. Longer term, maybe.

I’ve looked at tomli, I agree with its minimalist+strict philosophy, and I believe I’d be able to maintain it.
@hukkinj1, if you want to do the heavy lifting of integrating tomli into the stdlib and maintaining it there (probably along with a backport on PyPI, à la importlib_resources, I can co-maintain (i.e. advise, merge your PRs, and take over in the worst-case scenario of you disappearing).

One thing that worries me is how future versions of TOML will be supported. There’s precedent in e.g. json and pickle, but it’ll need to be in the PEP, so everyone can agree on it.

Yes. I’d pick one, and when there’s a PEP draft, ask @dustin to reserve it. (Please don’t squat by uploading a placeholder.)

That has a major disadvantage: it would break anyone using a pinned version of toml.

2 Likes

Has there ever been any discussion about adding a module named formats or something similar to the stdlib, that could be used for these kind of encoding library promotions? It would make backward compatibility easier since fewer new package names would need to be “taken over” in the future, and allow for simple format-to-module-name conventions. So for example if a YAML parser would be added later it could simple live in formats.yaml, next to formats.toml. I guess encodings is the most similar name that could be reused.

I guess the downsides would be “Flat is better than nested”, although “Readability counts”, and it would make it easier for a new user to check for a list of available serialisation formats. And there are a gazillion other parsers for other formats (.json, .zip, .eml, .csv, .html, .cfg…), would helpers for them be added as well?

Guess I’ve answered my own question, but I just wanted to mention the idea if it is of any help to anyone. :slightly_smiling_face:

2 Likes

Unfortunately, formats is already taken:

It’s unclear whether this package is actively maintained or has ever been widely used.

-Fred

1 Like

Great. Yeah I can do stdlib integration, maintenance, and backport. I probably won’t start until we’ve drafted a PEP.

There’s some work already made in this Tomli issue I’ve tagged you there!

2 Likes

Just for clarity, to ensure everyone is aware, uiri/toml has been completely unmaintained for well over a year now, but it seems that @pradyunsg may possibly be able to get the name transfered. As a side note, @hukkinj1 , if this will be an officially blessed and supported project, it might be a good idea to learn from that experience, minimize bus factor and follow responsible practices by ensuring it is GitHub/GitLab/etc org, and has multiple maintainers on PyPI, to avoid a single point of failure, which we indeed was the case with uiri/toml. Though, that might only be relevant for the backport/upstream, if there is no longer an independent, maintained hukkin/tomli repo and tomli PyPI project at all (not sure what your plans are there).

I don’t think this is relevant personally because I don’t see a way how we’ll not break at least some people if we start having a stdlib and a 3rd party library under the same name.

3 Likes

If people don’t specify their dependencies appropriately then you’re right that some people will break. But if people use python_version markers and such appropriately it works fine (or just choose to always use the PyPI version thanks to the stdlib being later on sys.path).

1 Like