Python is very, and increasingly, popular in science/ engineering/ etc… Science suffers from under-recognising the contributions of software engineers and the open source community, when in reality FOSS is fundamental to the workflows which generate an increasing proportion of scientific progress. One small corner of this issue is that citing software packages is awkward: the best you can do with the industry standards are “misc” (BibTeX) or “webpage” (CSL). Citing a virtualenv-full of python packages is a painstaking process of clicking through each project’s PyPI page, git(hub|lab) README, readthedocs page, and website, in the hope that there might be some reference to a conference proceeding from a decade ago, most likely in the form of a text citation which you then have to encode into your serialisation format of choice (bibtex, csl-xml etc.). If not, you cobble together what you can from the PyPI metadata.
I wrote a small package automating that last option, but it’s far from perfect: if a maintainer does have a “real” publication they’d prefer to be cited by, it’ll be missed entirely. Author names are taken as a literal rather than the complex structure that real human names actually take.
I propose that a convention is recommended to greatly ease the citation of python packages. In the same way that we recommend the __version__
module attribute, we could recommend a __cite__
attribute for scientific packages, containing everything a prospective referencer may need.
This needs to be introspectable (fields can be accessed without additional parsing) and optional (does not require external libraries), so an assembly of python builtins is better than, say, a bibtex-formatted string, or raw XML. There are two JSON reference data formats worth considering: BibJSON and CSL-data JSON. Neither has much of an ecosystem, and neither seems to acknowledge the other so comparing is a bit tricky, but CSL seems more generic and has a schema.
This would allow the scipys of the world to incorporate their actual publications into the __cite__
field, and smaller projects to just point directly at their PyPI page. A referencer could look for the field, and fall back on generating the information from PyPI metadata. Everyone gets cited as accurately as they want, but there is at least one-- and preferably only one-- obvious way of doing it. You can include multiple items, so you can have one which automatically updates with every release, and one “static” one which points at a publication.
Here’s an example:
from datetime import date
__version__ = "0.2.3"
__cite__ = [
{
"URL": "https://www.github.com/clbarnes/citepy",
"abstract": "Automatically create citations for packages",
"accessed": {
"date-parts": [
list(date.today().timetuple()[:3])
]
},
"author": [
{
"given": "Chris L.",
"family": "Barnes"
}
],
"categories": [
"software",
"python",
"libraries",
"pypi"
],
"id": "citepy",
"issued": {
"date-parts": [
[2019, 5, 28] # release date
]
},
"original-date": {
"date-parts": [
[2019, 5, 25] # first release
]
},
"publisher": "GitHub",
"title": "citepy",
"type": "webpage",
"version": __version__
}
]
__author__ = "{given} {family}".format(**__cite__[0]["author"][0])
This would need a bit more downstream tooling to make it valuable: citeproc-py (for converting that data into other formats) is dead, and there is some sort of disagreement between the CSL-data JSON schema and python’s jsonschema implementation. Hand-writing and validating JSON-like structures is obviously a pain but citepy
has some convenience classes for that purpose.
This doesn’t require any code, just a PEP establishing the standard. Do you have any thoughts; whether it’s even valuable enough to warrant that?