Should the stdlib provide an API to examine function/class field descriptions in docstrings?

ajoino · September 4, 2023, 8:32pm

While reading the PEP 727 discussion, many posters were of the opinion that the goals of that PEP can be achieved with docstrings. I started thinking about how the stdlib could be extended to help parse and validate docstrings and I wanted to share my ideas for further discussion.

The inspect module is extended with the following functions:

inspect.getfielddoc(object),
inspect.validatefielddoc(object), and
inspect.register_docstr_parser(docstr_parser)

which works as follows:

inspect.getfielddoc(object) will parse the docstring and return a dictionary where each key is the name of a field from the parsed docstring, and each value is a pair (type, description) with the type and field description from the parsed docstring. The reason for returning a pair is that some developers document the types in the docstring and don’t use type annotations. This API would primarily be consumed by documentation generators like sphinx. Not sure exactly how to deal with the key of the return value but I’m leaning towards marking that with the key "->". It could also be possible to have this function return documented exceptions.

inspect.validatefielddoc(object) will parse the docstring and validate that the fields in the docstring and function definition are the same. This function could have some flags, e.g. making validation fail unless each field has a description, although separate functions could be used instead of flags. This API would primarily be consumed by linters and LSPs.

inspect.register_docstr_parser(docstr_parser) lets the consumer of the first two APIs register docstring parsers. A docstring parser is a function that takes a string as input and returns the same kind of dictionary returned by inspect.getfielddoc(object). If the consumer wants to parse numpy-style docstrings, they can register a NumpyDocstrParser with inspect.register_docstr_parser(NumpyDocstrParser()), and now inspect.getfielddoc and inspect.validatefielddoc will work with all objects using numpy-style docstrings. A consumer can register multiple docstring parsers, each of which will be tried in turn until a non-empty dictionary is returned. If this registration procedure can’t work for reasons I’m unaware of, the docstring parser could instead be provided directly to inspect.validatefielddoc. (inspect.getfielddoc wouldn’t be needed in that case.)

A note on the names of the first two functions: their names are not snake_case to mirror the inspect.getdoc name.

This proposal is based on the idea that we want to codify, in the stdlib, the concept of field documentation in docstrings. Whether or not this proposal would achieve those goals (I have zero experience implementing documentation generators and linters), this proposal is useless if we don’t wish to put this feature in the stdlib. I think it would be most fruitful if we start by discussing if the API proposed above, or any similar kind of API to work with field descriptions in docstrings, is suitable for the stdlib. If there is consensus that such an API is a good fit for the stdlib, then we can continue discussing the details of such an API. I added the proposed API to a) serve as inspiration for the discussion and b) because I’ve been thinking about it for about a week and needed to get it out of my system.

brettcannon · September 5, 2023, 11:58pm

Except the stdlib via PEP 8 has no standard on how to specify parameter docs (PEP 257 – Docstring Conventions | peps.python.org is the closest and it’s in passing). I think before you can even consider an API is you have to first convince the core devs that parameter docs are good, something the stdlib should start doing itself, and then agree on a format. Then you can worry about how such an API might look.

PIG208 · September 6, 2023, 1:14am

I see how you are approaching this without hitting the “yet-another-standard” problem, by having the consumer specify the docstr parser. But I think it is the core issue here.

I think this has been brought up in the previous thread, as in:

so we really want it to be useful enough for it to be added to the stdlib.

Where should the implementation of NumpyDocstrParser be hosted? I feel that this set of APIs is useful only internally to linters/parsers written in or partially in Python, but it is missing the point of PEP 727 which is to have a standardized format for parameter docstrings.

ajoino · September 6, 2023, 5:21am

I am personally fairly happy with the current state of docstrings, i.e. I use Google-style docstrings for documenting parameters and then use a docgen with the appropriate settings. I started this open-ended discussion because I had some ideas I wanted to share, and because there was apparent interest in this general topic, looking at the most liked posts in the PEP 727 thread. But if the interest in this thread (from both core developers and other users) is anything to go by it seems that documenting parameters is best left for 3rd parties.

ajoino · September 6, 2023, 5:47am

My very first idea was to use special markers in docstrings and toml to specify parameter descriptions, but as you noticed I felt that was an xkcd 927 moment. Thus, the idea I chose to present was one that I feel actually represents the consensus, that there are multiple equally-valid formats that should all be usable with a new API. In my mind, the parsers would live in 3rd-party repositories on PyPI (or maybe under some PSF banner). But the main point of the post was open a discussion about how this could be done in general, with the suggestion serving as an appetizer of sorts.

Regarding my more-specific-but-flawed-in-an-xkcd-927-way proposal, it would look like this:

Potential parameter description docstring specification

I will admit that I hadn’t put this particular idea on paper until now, and I realize that toml doesn’t feel quite right. However, I do feel that marking the lines containing the parameter descriptions in the docstring is a good idea. But good enough to be standardized? Not sure.

Parameter descriptions in docstrings are represented as a toml with three tables, parameters, return, and exceptions. The parameters of function def add(a, b): return a + b would be described by the following document

[parameters]
a = "first term of addition"
b = "second term of addition"

[return]
"->" = "The sum of two terms"

The toml document is inlined in the docstring, marked by strings starting with ###, similar to how lines executed by doctests start with >>>. Thus, in a Python script the full docstring could look like this:

def add(a, b):
    """Adds terms a and b together.
    
    ### [parameters]
    ### a = "first term of addition"
    ### b = "second term of addition"
    ###
    ### [return]
    ### "->" = "The sum of two terms"
    """

    return a + b

Comments on this idea:
While this scheme, i.e. marking the lines containing the parameter descriptions is nice IMO, the exact format to be used inside them, if others like it, is not very important to me. It could be toml, yaml, rST, your-favorite-flavor markdown, or a new nice format. Then we could add some functions to the stdlib to a la the first post. But since this idea goes against the current consensus of docstrings (i.e. no conensus on the format) I think it’s doomed from the start.

drunkwcodes · September 6, 2023, 10:32am

I’m also used to google-style docstrings.

I think it’s good to have a consensus about this and leave typing alone.

PEP727 is a big no for me.
typing will be more and more complex when time goes by.

I hope it will not be convoluted by irrelevant feature.

pawamoy · September 14, 2023, 2:49pm

I very much agree with @brettcannon here.

Before even thinking about providing APIs to parse different docstring formats into structured data, these format must at least be properly specified (they are not), and at best support documenting the same things (they do not).

Many tools have implemented parsers for Google-style or Numpydoc-style docstrings: it’s not that difficult. The issue is that they all probably differ a bit (because these styles have no specs), and that the structured data you get out of them are not the same. It means that documentation generators and other tools must both parse these different formats and render these different formats differently. This is super cumbersome. Adding new docstring formats without standardizing the data itself will just make these tools’ job more complicated.

I tried to address the second point in Griffe by declaring data classes that are used by both parsers (Google/Numpydoc), like a common denominator of structured data. The documentation renderer (mkdocstrings-python) can then declare templates for each of these classes, without even knowing if they come from Google or Numpydoc docstrings.

It works, but I had to drop some features of Numpydoc (See also, Warnings, References, because they are just markup), add some of them to the Google-style (named returned/yielded/received values, Methods), and add some to both (Functions, Classes, Modules, generic admonitions). As long as only a subset of all styles’ features have a common ground regarding data, it will be difficult to maintain or evolve.

So IMO the absolute first thing to do before creating new docstring formats (even if they’re already based on a data-friendly declarative syntax like TOML ) is to standardize the data itself.

Here is the data that Griffe currently handles:

regular text sections: plain markup, like Markdown, rST, Asciidoc, etc.
parameters sections: a list of parameters, each with a name, type, description (markup) and default value
other parameters sections: same thing, for keyword arguments, without default values
raises sections: list of exceptions raised by a function/method/property, each with a description
warns sections: list of warnings emitted by a function/method/property, each with a description
returns sections: list of returned values (think tuples), each with an optional name, a type, and a description
yields sections: list of yielded values (again, tuples) for iterators/generators, each with an optional name, a type, and a description
receives sections: list of received values (again, tuples) for generators, each with an optional name, a type, and a description

…as well as summary sections, like attributes, functions/methods, classes, and modules:

attributes have a name, a type, a description, and an optional value
functions/methods have a signature (can be just their name) and a description
classes have a signature (can be just their name) and a description
modules have a name and a description

…as well as admonition-like sections, such as examples, notes, warnings, deprecations, and any other generic kind that uses the syntax of the chosen style (tip, danger, quote, see also, preview, you name it), because users don’t like to mix style syntax with markup syntax in their docstrings.

I was seduced by the idea behind PEP 727 because it moves the data out of the docstrings, so that docstrings can simply be written using the chosen markup (Markdown, rST, etc.), without mixing it with a particular docstring style. Here is an example of what it can accomplish: Examples - Griffe TypingDoc. No docstring style used here, so no docstring parsing required (the whole docstring is “parsed” as a single regular text section, and collected data is inserted/appended before/after it).

pekkaklarck · November 18, 2023, 2:33pm

Python having a standard for documenting parameters, return values and possibly exceptions as part of the docstring would be great. It probably should be based on one of the current pseudo-standards, but I wouldn’t have a problem adapting something new either. The syntax needed to be easy to read and understand for humans, but structured enough to be parsed unambiguously. Including types should be optional to allow using type hints instead.

With such a standard available, linters, IDEs, documentation generators, etc. would have a lot easier time mapping actually parameters to parameter documentation. I’m sure we’d very quickly have generic modules for parsing parameter docs in PyPI, and eventually something could be added to the standard library as well.

The obvious problem is agreeing on the syntax.