Adding extra fields in the pyproject.toml authors/maintainers list

luizirber · June 26, 2022, 4:10pm

During the conversion from setup.cfg to pyproject.toml in sourmash I followed the setuptools documentation for pyproject.toml and was very happy with the new list format for authors/maintainers. I noticed that name/email are suggested but optional, and it ocurred to me that adding other identifiers (like ORCID) is something that makes sense for our use case, where all contributors to the codebase are added as authors to scientific papers describing the software.

My question: since only name/email are discussed, am I overreaching when adding an orcid field for this purpose? Are there reserved field names to be used in the future, or am I just setting myself up for fixing this pyproject.toml file later because the valid fields in authors/maintainers might be enforced at some point?

Thanks!

JDLH · June 27, 2022, 2:15am

Good question. I don’t know the answer. I would normally turn to a pyproject.toml specification to answer it, but that spec has not yet been assembled from the multiple PEPs which define various aspects of pyproject.toml. I have noted this question in pypa/packaging.python.org issue 955, which tracks the need for the spec.

CAM-Gerlach · June 27, 2022, 7:55am

Per the top of the Declaring project [source] metadata specification (emphasis mine),

The fields defined in this specification MUST be in a table named [project] in pyproject.toml. No tools may add fields to this table which are not defined by this specification. For tools wishing to store their own settings in pyproject.toml, they may use the [tool] table as defined in the build dependency declaration specification.

And, for completeness, in the section specifying the permitted contents of the authors/maintainers key:

These fields accept an array of tables with 2 keys: name and email. Both values must be strings. The name value MUST be a valid email name (i.e. whatever can be put as a name, before an email, in RFC 822) and not contain commas. The email value MUST be a valid email address. Both keys are optional.

If you added your own arbitrary sub-keys to the author sub-table, any standards-conforming tools would not know what to do with them and at best ignore them, since they have no defined mapping to the underlying Core Metadata; and may raise an error (as it could be indicative of a typo or other mistake). You would also, of course, have to use your own custom packaging tool to actually include them in the metadata or use them for anything. Furthermore, if you or someone else were to propose a PEP to add such, any existing non-standard usage may conflict with whatever eventually gets defined in the PEP.

However, as the above spec suggests, there’s a better solution—custom fields like the ones you mentioned can be added to a custom key in a table under [tool] with the name of whatever custom build tool you will be using to handle them. You could either just specify the additional author details in the tool section, e.g.

[project]
authors = [
    {author = "John Smith", email = "john@example.com"},
    {author = "Jane Doe", email = "jane@example.com"},
]

[tool.your-orcid-builder]
authors = [
    {author = "John Smith", orcid = "..."},
    {author = "Jane Doe", orcid = "..."},
]

Or, you could specify the authors key as dynamic and leave specifying all the author information to your custom tool:

[project]
dynamic = ["authors"]

[tool.your-orcid-builder]
authors = [
    {author = "John Smith", email = "john@example.com", orcid = "..."},
    {author = "Jane Doe", email = "jane@example.com", orcid = "..."},
]

Best of luck!

JDLH · June 27, 2022, 9:33pm

Thank you, @CAM-Gerlach , for answering @luizirber 's original question.

For me, this question leads to four observations about how the specifications could be improved, to be even more helpful in answering similar questions in the future.

Discoverability matters, and therefore document titles matter. If I have a question about whether I can put something in a pyproject.toml file, I would not think to look in a document entitled Declaring project metadata. I hope we end up with a document that has a title like, pyproject.toml File Specification, or is in some other way discoverable for someone who is thinking, “pyproject.toml file”. That document need not be monolithic, however. If it refers readers to a separate Declaring project metadata document for the specification of the [source] table, that can still work well.
The sentence, “No tools may add fields to [the [project]] table which are not defined by this specification”, does not seem applicable to the original question. The question is about the value of the “authors” field and the “maintainers” field. Both these tables are defined by that specification.
I read the original question as asking, is it permitted to add other keys such as ORCID to the inline tables in the list which is the value of the authors field or maintainers field. The relevant specification text appears to be, “These fields accept an array of tables with 2 keys”. The verb “accept” is squishy in this context. It is reasonable to interpret it as “the following 2 strings are the only permitted keys in each of these tables”. But it is also somewhat reasonable to interpret it as, “the tables normally have these 2 keys, but other keys are possible”, maybe with a rule that consumers ignore keys they don’t recognise. I think the specification’s language could be clearer.
Come to think of it, the section, authors/maintainers ends up worded in a slightly clumsy way, perhaps in an effort at brevity. The specification does not actually specify what the list of permitted field names are. The structure of the document implies that the text in each <H2><code> nested element pair is a permitted field name, but it does not seem to say that explicitly. And this particular section has an <H2> element with the text “authors/maintainers”, which implies that there is a field with the name, “authors/maintainers”. But the text is actually two <code> nested elements, separated by a non-code slash character. That is a rather indirect way of saying, “There is a field authors and a field maintainers. They have similar types and semantics, so they are described together.…”.

I suggest that naive questions like the one in this thread are a great way of testing the clarity of specifications in a way that authors and editors find difficult, because authors and editors know too much, and miss the gaps and implicit claims. I suggest it would be helpful to have a process for making editorial changes sections like this to make them more clear and precise over time.

sinoroc · June 27, 2022, 9:41pm

Maybe this discussion is related:

Convention for encouraging citation of python packages

CAM-Gerlach · June 28, 2022, 2:13am

Indeed, the structure I propose to implement (hopefully with a draft PR up this week) takes exactly this approach, with a top-level document named “The pyproject.toml project configuration file” containing one heading for each top-level table, which links to the specification of each (or states it directly, for the tool table, since it is sufficiently simple and included in the original spec).

I suppose that’s arguably possible according to the literal letter of that paragraph of the spec, though though I also included it to communicate the high-level intent (that overall, [project] is not the place for non-standard, tool-specific custom configuration/metadata) and the reference to the tool table where this belongs instead, and also why I included the specific section specifying the defined value of the author key.

Well, I suppose it could be interpreted that way, but it would be a rather not so obvious interpretation, especially if naturally extended to the other similar tables to all implicitly allow any other arbitrary keys/values rather than just the stated structure of a sub-table with the two specified keys. Furthermore, if the specification had meant "“the tables normally have these 2 keys, but other keys are possible, with a rule that consumers ignore keys that don’t recognize”, it stands to reason that it would have actually specified all (or any) of that.

Furthermore, the JSON schema for the pyproject.tomlas of PEP 518 includes "additionalProperties": false at the top level, which prohibits any unspecified keys outside of the tool table that explicitly allows them (as it is type object), and the tools section further emphasizes it is the place for custom tool-specific configuration.

Finally, it is at cross-purposes with the overall stated purpose of the specification, to map project source metadata to distribution Core Metadata in a strictly standardized format that will result in the same output between any confirming tools. If a non-standard key was added to an authors table, one of the following would be true:

No tool actually uses the value for anything, so it is not really useful
A tool uses it in some way as part of filling the Author/Author-email/etc. fields in Core Metadata, which would contradict the spec for how these fields are to be constructed from the author values, and further violate the fundamental guarantee provided by the values in the [project] table, that they can be read by any conforming tool and used to unambiguously determine the values of the output Core Metadata
A tool uses to to fill a new, bespoke core metadata field, which would not be following standard core metadata or the project table specs
A tool uses it for something other than Core Metadata, which contradicts the purpose of the table, which is to be used to fill Core Metadata, as is the case for every other field and table key.

If you feel it should be further made explicit, you’re welcome to propose a PR to the spec, subject to discussion and review by the community. The simplest change I would suggest to achieve this consistently for all sub-table values without belaboring the spec is modifying

No tools may add fields to this table which are not defined by this specification.

to read something like

No tools may add fields to this table, or any child table, which are not defined by this specification.

I mean, I’m all in favor of explicitness, but I feel like this is going down the rabbit hole of pathological (mis)interpretation. To be honest, the spec has much, much more meaningful ambiguities than these.

Well, for one that is not a valid key name in unqouted TOML. While the precise prose could be clarified (e.g. adding “The authors field describes the people or organizations…” to the first sentence of the authors field, I don’t see how the actual description could be realistically read to support such a pathological interpretation, as it pretty clearly describes them as separate fields (“the ‘maintainers’ field is similar to the ‘authors’ field”, “these fields”).

This is a great and very important point, and one I’ve been on both sides of, most recently in the role of the beginner in reviewing the docs for e.g. the sqlite3 module, where a number of key details were left implicit or assumed and ended up being confusing and unclear due to the authors being much more familiar with the module than the average reader, even with the same explicitly trying to account for such—it just can be really, really hard to do if you’re not a relative beginner.

I hope you don’t think with the above that I don’t appreciate the effort. Its just that as someone in the spec’s intended audience (spec/docs writers and packaging community members) who’s struggled myself with some very crucial implicit ambiguities in the same spec, the prioritization of somewhat pathological interpretations of doesn’t seem to be nearly as helpful as a number of much more plausible, important and impactful ambiguities that have caused more significant issues in the packaging community.

For example, the content and mapping of the license.file field, which confused both spec authors (myself) and packaging tool implementors (during the setuptools pyproject.toml implementation, both individuals who’d read and re-read the spec many times), or the exact meaning of dynamic (whether it referred to [project] keys or Core Metadata fields, and what the specific constraints were on dynamicism were).

Personally, I’d much rather first take care of those issues, not to mention actually migrating the other specs, rather than spending time debating a point that can be clarified among the relatively narrower audience much more easily. But I suppose I mostly have myself to blame for that

sorcio · June 28, 2022, 1:32pm

Do you expect the information you want to include in the extra fields to be considered part of the project’s core metadata? I.e. it gets embedded in the wheel’s METADATA, instead of just sitting in the pyproject.toml file.

There is no obvious place for that information in core metadata, but with some stretch you could use the Author field.

Considerations about using Author for additional information

The core metadata specification has a provision for additional “contact information” in the Author and Maintainer fields:

A string containing the author’s name at a minimum; additional contact information may be provided.

Example:
Author: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>

It would not be completely unreasonable for a user to try and fit different types of identifiers, such as the ORCID, in the same space:

Author: Josiah S. Carberry https://orcid.org/0000-0002-1825-0097

I’m not recommending it, but I wouldn’t be surprised if some user already used the field for something like this. Digging further in cursed territory, the [project] specification (quoted above) says that a lone name field is mapped directly to Author/Maintainer with only the additional restriction that it must be valid as a RFC 822 name and contain no commas. So you could have:

[project]
authors = [
    { name = "John Smith", email = "john@example.com" },
    { name = "Josiah S. Carberry https://orcid.org/0000-0002-1825-0097" },
]

Note that the specification for Author-email does not mention “additional contact information” and only says “name”. Let’s say I interpret this to mean that only the author’s name should be included, and additional information is not legal. Then there is no way, by just using the [project] static fields, to add both additional information and an email address, i.e. this would become suddenly illegal:

    { name = "Josiah S. Carberry https://orcid.org/0000-0002-1825-0097",
      email = "j@example.com" },

because it goes to the Author-email field, and not the Author field as in the other example.

I imagine that the authors of PEP 621 took a decision by deciding that only name and email are interesting, so all the above is an abuse of specification to do something that was not intended. But it’s in the spec, so the name field can’t be restricted (in syntax or semantics) without affecting backwards compatibility.

Maybe it’s a candidate for a recommendation rather than a restriction (à la “the field should represent the author/maintainer’s human name”), but standardizing anything around names is a gigantic and risky hassle and I don’t think there is enough pressure for it to be worth it.

I wrote the notes above before this comment was posted, so let me clarify that I share the feeling! The above pedantry is just a side effect of myself trying to get more acquainted with the specs and the problems they are solving, so that I can eventually contribute something more significant.

Aside. One sentiment is that the language of specifications tends to be informal^[1] and favors pragmatism, which is great and very much in the spirit of Python. Just like Python code, sometimes this means leaning on the side of underspecification, which leaves room for flexibility and innovation, but also leads to hard-to-maintain hacks or backwards-compatibility issues with cases that were not considered by the authors. To be clear, I think this is a very effective process to make progress, as long as complexity is kept at bay, which is an active effort. Like Python encourages testing and introduced static types; the equivalent would be to adopt formal language when useful (e.g. JSON Schema for PEP 621) and to encourage a larger share of the community^[2] to build solutions based on the standards at an earlier stage of their development.

Note that PEP 621 mandates that tools raise an error:

If metadata is improperly specified then tools MUST raise an error to notify the user about their mistake.

If respected, this leaves room to revise the standard in the future without having to worry about non-standard extensions. Tools that don’t implement the new revision would just fail.

Some software can be lenient (e.g. a script to extract data from a body of packages) but I believe the better choice for packaging tools is to validate as strictly as reasonable, and for users to be cautious around edge cases.

relatively informal, somewhere in between in a spectrum where the JSON specification is too informal and the ISO C++ standard is unapproachable by most ↩︎
I mean community outside of the open source tool developers and index maintainers and what @CAM-Gerlach referred to as “core audience”; packaging is already a broad interest with a lot of impact, and with standardization and plugin-based tools the audience is set to grow ↩︎

sinoroc · June 28, 2022, 8:45pm

Another potentially related discussion:

The author/maintainer distinction problem (and PEP 621)

CAM-Gerlach · June 28, 2022, 10:13pm

These all look fairly reasonable; the key constraint here, which this follows, is that any structure are contained within the values of the author keys, which (as mentioned) are not generally restricted or strictly defined by the spec, and thus mean they will both interoperate with existing tools and not affect the guarantees that the spec provides.

Just to be clear, I certainly did not mean to speak any ill of this discussion in general or of the sort noted above in particular; indeed, it answers the OP’s question quite thoroughly and perhaps more directly and effectively than did I originally.

Ah, I missed that. Indeed, that would weigh the argument more heavily against supporting such an alternative interpretation, as it would make the [project] table at least de-jure incompatible with standard tools, as opposed to using the designated tool table for tool-specific metadata, configuration and extensions, coupled with marking core metadata fields as dynamic in the project table if they are modified as a result.

abravalheri · June 29, 2022, 12:03am

I have some prior work on that:

github.com

abravalheri/validate-pyproject/blob/main/src/validate_pyproject/project_metadata.schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema#",

  "$id": "https://packaging.python.org/en/latest/specifications/declaring-project-metadata/",
  "title": "Package metadata stored in the ``project`` table",
  "$$description": [
    "Data structure for the **project** table inside ``pyproject.toml``",
    "(as initially defined in :pep:`621`)"
  ],

  "type": "object",
  "properties": {
    "name": {
       "type": "string",
       "description":
         "The name (primary identifier) of the project. MUST be statically defined.",
       "format": "pep508-identifier"
    },
    "version": {
      "type": "string",

This file has been truncated. show original

Please feel free to submit any PR if there is something wrong (sometimes JSON schema can be tricky).

JDLH · July 1, 2022, 10:17pm

Good idea. I’ll start with, this Issue, No clear list of permitted keys in “Declaring project metadata” specification #1101. I think it is easy and purely editorial. If there is support for this change, I’m happy to draft a PR.

Do we expect that every reader of the specification should also have read the JSON schema for pyproject.toml, and understand the semantics of JSON schema, and also have the TOML specification? If so, then it is only fair that the specification state that. But instead, I think that it would be kinder to the reader for the specification to include the most important conclusions from these other documents right there in its own text.

I think this is a really important point for the packaging specification and documentation activity as a whole. What is the right level of formality and precision for the various Python Packaging specs and docs? What some are seeing as “pedantry”, I am seeing as opportunities for more clarity and effectiveness.

I suppose a lot depends on how you see the shape of the curve between (informality, underspecification, hard-to-maintain hacks, faster writing) and (precision, clarity, interoperability, more editing cycles). I grew up as a software engineer with documentation that was pretty readable and accessible, while still being precise and rigorous. I aspire for the packaging documentation to reach that same level. If others don’t have that as a goal, or think it is impractical to achieve, then we will end up with different opinions on the resulting documents.

CAM-Gerlach · July 2, 2022, 3:25am

Well no, not necessarily, but I would expect them to have read the relatively short description of them, which makes fairly clear that they are separate fields, e.g.

The ‘maintainers’ field is similar to the ‘authors’ field …

I also suggested a small textual tweak that would further reduce the possibility of any ambiguity; you could propose a PR to implement that and the other one I proposed to address the other ambiguity you raised.

To be fair, perhaps more so than any other PEP author, I tend to lean on the side of explicitness, clarity and what some others may see as “overspecification”. However, avoiding the specific points of potential confusion identified here should be resolvable with a few small tweaks to the spec language, which I’ve suggested above.