Package database

EpicWink · January 2, 2023, 10:45pm

I’ve seen more ideas for additional package^[1] metadata recently, including security advisories, non-Python dependencies, hardware capabilities, and links to external package managers.

How excited would people be about getting the extrinsic metadata from artefacts (eg dependencies, classifiers) and putting them in a bespoke database?

This metadata would be free to be updated at any point, meaning both new metadata added, and existing metadata updated.

Initially, this database would be read through a single query endpoint, so to get one distribution you would need to specify the environment in the query filters (eg project name, version, Python version, OS/manylinux version, etc).

Query results would contain all the aforementioned details (extrinsic metadata for now, suggested metadata later), along with a link to the artefact for download (and it’s verification details).

I expect the query results could be used in place of the simple index API, although I’m not advocating for it’s depreciation.

One endpoint each for create, update and delete would also exist for writing to the database, although create and delete might be not exposed.

Some things I haven’t figured out:

Would this be part of the simple API? Even optionally? It couldn’t be served through static files, but it would be nice to have this metadata in a distribution-specific endpoint, obsoleting PEP 658
Who has the permission to create/update/delete metadata
What filters should be required in query requests
What happens when dependencies are updated and differ from what’s in the artefact. I expect updated dependencies to be more correct

If people are excited, would you want a PEP?

really for distributions and releases ↩︎

brettcannon · January 4, 2023, 12:57am

You mean something like GitHub - sethmlarson/pypi-data: Data about packages and maintainers on PyPI, but more official?

EpicWink · January 4, 2023, 1:23am

Yes, but the official part would be the (web) API, not the underlying DB^[1]. The API would be JSON, and of course it would be managed and maintained officially, with specs in the packaging docs site.

I would expect the DB to be more scalable and distributable, likely NoSQL (eg MongoDB), for the ~10000 requests per second it would be handling. In addition, having a single table/collection for the artefacts simplifies the implementation ↩︎