Convention for encouraging citation of python packages

Python is very, and increasingly, popular in science/ engineering/ etc… Science suffers from under-recognising the contributions of software engineers and the open source community, when in reality FOSS is fundamental to the workflows which generate an increasing proportion of scientific progress. One small corner of this issue is that citing software packages is awkward: the best you can do with the industry standards are “misc” (BibTeX) or “webpage” (CSL). Citing a virtualenv-full of python packages is a painstaking process of clicking through each project’s PyPI page, git(hub|lab) README, readthedocs page, and website, in the hope that there might be some reference to a conference proceeding from a decade ago, most likely in the form of a text citation which you then have to encode into your serialisation format of choice (bibtex, csl-xml etc.). If not, you cobble together what you can from the PyPI metadata.

I wrote a small package automating that last option, but it’s far from perfect: if a maintainer does have a “real” publication they’d prefer to be cited by, it’ll be missed entirely. Author names are taken as a literal rather than the complex structure that real human names actually take.

I propose that a convention is recommended to greatly ease the citation of python packages. In the same way that we recommend the __version__ module attribute, we could recommend a __cite__ attribute for scientific packages, containing everything a prospective referencer may need.

This needs to be introspectable (fields can be accessed without additional parsing) and optional (does not require external libraries), so an assembly of python builtins is better than, say, a bibtex-formatted string, or raw XML. There are two JSON reference data formats worth considering: BibJSON and CSL-data JSON. Neither has much of an ecosystem, and neither seems to acknowledge the other so comparing is a bit tricky, but CSL seems more generic and has a schema.

This would allow the scipys of the world to incorporate their actual publications into the __cite__ field, and smaller projects to just point directly at their PyPI page. A referencer could look for the field, and fall back on generating the information from PyPI metadata. Everyone gets cited as accurately as they want, but there is at least one-- and preferably only one-- obvious way of doing it. You can include multiple items, so you can have one which automatically updates with every release, and one “static” one which points at a publication.

Here’s an example:

from datetime import date

__version__ = "0.2.3"

__cite__ = [
    {
        "URL": "https://www.github.com/clbarnes/citepy",
        "abstract": "Automatically create citations for packages",
        "accessed": {
            "date-parts": [
                list(date.today().timetuple()[:3])
            ]
        },
        "author": [
            {
                "given": "Chris L.",
                "family": "Barnes"
            }
        ],
        "categories": [
            "software",
            "python",
            "libraries",
            "pypi"
        ],
        "id": "citepy",
        "issued": {
            "date-parts": [
                [2019, 5, 28]  # release date
            ]
        },
        "original-date": {
            "date-parts": [
                [2019, 5, 25]  # first release
            ]
        },
        "publisher": "GitHub",
        "title": "citepy",
        "type": "webpage",
        "version": __version__
    }
]

__author__ = "{given} {family}".format(**__cite__[0]["author"][0])

This would need a bit more downstream tooling to make it valuable: citeproc-py (for converting that data into other formats) is dead, and there is some sort of disagreement between the CSL-data JSON schema and python’s jsonschema implementation. Hand-writing and validating JSON-like structures is obviously a pain but citepy has some convenience classes for that purpose.

This doesn’t require any code, just a PEP establishing the standard. Do you have any thoughts; whether it’s even valuable enough to warrant that?

Does it even need a PEP? Is it not something that scientific package authors can just agree on as a convention? It’s not a matter of whether the proposal is “valuable enough” to warrant a PEP, but rather whether having a PEP rather than just a convention is of sufficient benefit to justify the cost (the PEP process is pretty laborious, and once finalised, making changes is relatively difficult).

Regardless, the proposal sounds like a reasonable idea to me.

For me, the value of making a PEP out of it is that there is something concrete to say “This has been looked at and discussed by The Community, and have agreed it’s a good standard to use going forward”: it has more authority than me just writing a blog post and expecting the ecosystem to revolve around it. It’s hard enough trying to get people to accept standards which are PEPs (e.g. PEP8); I think this would be beneficial to scientific python and science in general, but am not optimistic about it gaining any traction without some sort of thumbs up from people more integral to the community than me, and somewhat-official documentation.

The proposal sounds complicated enough to warrant an actual standard. There’s no one obvious schema to use, the exact mechanics of “falling back to PyPI metadata” won’t be straightforward, and so on.

My worry is that the PEP process won’t give you the right audience. You’ll need to discuss with the scientific community, the people who wrote SciPy’s citation page, astropy.__citation__ or duecredit. A PEP will normally attract CPython developers and/or the PyPA (Python Packaging Authority). I’m afraid there’s not much overlap.

1 Like

I wonder whether this should be included in package metadata instead. Doing this in code makes it difficult to validate the correctness of the citation. It would be equally simple to introduce a collection of optional wheel metadata fields to convey the information.

I’d agree with @uranusjr. Looks like something to include in pyproject.toml.

My rationale for including it in the code is that you could get it from an installed instance of the library, just like __version__. I may be misunderstanding, but the pyproject.toml doesn’t get pulled down when you pip install something, right? In the future it would be awesome if PyPI had fields on the site/ in the REST endpoint for citation information, but that’s a pretty big change to hope for as it’s relatively niche.

Another positive of doing it in code is the ability to dynamically generate things like the access date, and reuse the existing __version__ field rather than having to keep track of yet another version string.

Doing this in code makes it difficult to validate the correctness of the citation.

Correctness in terms of the structure, or that it points to something worthwhile? The format of the citation could be checked by jsonschema (when they work together, of course). In terms of pointing to something worthwhile, I’m not sure using metadata helps much: anyone can write semantically incorrect text wherever they want in any kind of project.

The downside of doing it in code is that you need an intermediate step to turn it into something usable by e.g. citation-js. But keeping an extra metadata file around is not great either, given you’d then have to include it in the manifest (the procedure for which doesn’t have the clearest documentation) and then dig it out of site-packages.

Looks like something to include in pyproject.toml.

Another concern with this is that we wouldn’t be able to leverage existing citation data standards like CSL-data JSON. We could invent our own which trivially maps to it (CSL-data TOML), but given we already need a third-party tool (citation-js) to turn the JSON into other formats, going TOML -> JSON -> bibtex/whatever seems unnecessarily tortuous. We could store a raw JSON or bibtex string in the pyproject but that is even harder to write and validate, and doesn’t really make sense IMO.

This is all great feedback!

This actually wouldn’t be appropriate for pyproject.toml as this has nothing to with building a package (and isn’t included in a wheel which is what people will be installing and using anyway). Embedding it in wheel metadata as @uranusjr suggests is a better fit if it were to be kept with the code that’s actually installed.

I want clarify my comment, and also provide some context for those not knee-deep in packaging nuances.

I view the citation information as part of the package metadata, i.e. data that describes the package, but not strictly required for its general functionality. In that sense, citation info is not different from attributes like package version (as previously mentioned), author, documentation URL, etc. All the above information is canonically store in dist-info. Attributes like __version__ are only a common place to also expose the same information. They are nice to have, but not canonical.

This is why I said the citation should be included in package metadata; whether the end user may access the same information via import foo; foo.__cite__ is up to the package maintainer (and community convention), but package metadata should be the canonical storage, since it is where the information can be validated, either when the maintainer produce the wheel to upload to PyPI, or when a user install the package from source (during pip’s build step).

A part of the question I didn’t cover is how this information can be put into package metadata. If we follow the thought of treating it like package metadata, the answer would be to write it somewhere for a build backend to read, and the backend can read, validate, and record it when building a wheel. The somewhere could be pyproject.toml (as attributes under a tool section), a separate config file (like setuptools’s setup.py and setpu.cfg), or foo.__cite__ (like how flit reads the package version).

Those all make sense, but it’s up to the tools to decide on that; if PyPA is to do anything, it’s to lay out a how the information is stored (my recommendation is inside the dist-info directory), not how the information should be produced.

One of the insights of duecredit is that you want to be able to attach citations to parts of packages, e.g. many individual functions in scipy have associated citations. If you’re just using scipy.signal, there’s no reason to cite the algorithms in scipy.spatial. So that argues for putting the citation metadata in the code itself, rather than the package metadata.

Everyone upthread is right though that you can’t solve this problem with a PEP. The critical people aren’t here. You need to start by getting the major scientific projects on board. The SciPy conference is a conventional place to have conversations like this, or some place like the scipy-user mailing list could work.