I’m sure most would agree that the PyPI.org web interface is an invaluable entry-point to the Python Package Index, offering essential search and browsing capabilities. It is so useful that I often found myself wishing that I could have the same search and browse functionality for the Python package repository at my place of work. Unfortunately, it isn’t intended that
warehouse, the software powering PyPI, be used for running a general package repository, as its primary purpose is to serve the robust operational needs of PyPI.org itself.
However, it turns out that the evolutionary improvements to the simple repository interface, originally standardised in PEP-503, and refined in PEP-658/714 (accessing metadata separate from package), PEP-691 (JSON interface), and PEP-700 (additional metadata), offer interesting functionality beyond their original goal of improving package installation (e.g. with
With all of the aforementioned PEPs backing a repository, accessing metadata of projects becomes cheap, and there is (almost) as much information in the resulting simple endpoints as there is on PyPI.org. Add a crawler of interesting projects into the mix, and one could implement the search and browse functionality of PyPI.org with nothing more than a simple-repository interface as a data source, and without access to the underlying PyPI database.
In fact, at CERN we prototyped such an interface, and our implementation has now been running solidly for nearly 2 years, giving us powerful search and discoverability of our internal repository, and offering a single entry-point for our Python community. In July we released the source under the MIT license, presented this at EuroPython, and we recently published a paper and poster in the International Conference on Accelerator and Large Experimental Physics Control Systems (ICALEPCS) on the topic.
In this tech-preview instance, you may notice that you can browse packages that don’t yet have PEP-658/714 metadata (including sdist-only projects) on PyPI. I will go into much more detail in another post, but this stems from the fact that we have developed a standards-compliant simple repository server as a separate project, and have architected it such that we can re-use key components in the
simple-repository-browser. In short, you can point
simple-repository-browser at the most basic PEP-503 compliant repository (or even just a directory of dists grouped by project name), and the browser can internally enhance the repository (e.g. by extracting the metadata from the dists) on the fly if it isn’t provided by the repository. If you go looking at old releases of projects with big artefacts (e.g.
tensorflow), you might find that we have to do on-the-fly metadata extraction because PyPI doesn’t (yet) serve it, and this can take up to 30 seconds since we have to download the dist on the server. Once it has been extracted, we cache the metadata for faster future lookup (but discard the dist itself for space-efficiency reasons).
Another feature of this tech-preview is the search functionality. Unless the repository is pre-indexed, you will only be able to search by project name. In this instance, we have crawled the top 100 PyPI projects and their transient dependencies (~2000 projects), which means that you get search based on the project summary as well as its name (e.g. search for
array and see that some of the results don’t have
array in their project name) for those projects. Notice that we do on-the-fly indexing, such that once you’ve visited a project that hasn’t previously been indexed, it can then be searched for by summary also.
To re-iterate the key points:
simple-repository-browserwill work just as well against any PEP-503 compliant repository, no matter the software actually running it (devPi, Artifactory, Sonatype Nexus, bandersnatch+
python -m http.server, etc.).
- If you have a repository which serves PEP-658/714 and/or PEP-700 metadata,
simple-repository-browserwill have to do less work to attain a project’s metadata.
- It can run without internet (e.g. it works in air-gapped environments), and just needs access to the simple-repository via http/file.
- Indexing of projects takes place when a project page is visited for the first time. There is also a basic crawler implemented to systematically index projects (e.g. the top 500 PyPI projects).
- Search by name for non-indexed projects, search by summary (and in the future, classification, description etc.) for index projects
Internally, we have specialised the interface for our internal needs (e.g. we have authentication, and an interface to manage yanking (PEP-592) through an external API). Maintaining the base
simple-repository-browser interface which can be generically extended for these specialisations isn’t cost-free, and so before committing to move
simple-repository-browser from a one-off prototype to an openly developed project, we are seeking feedback to understand the interest and potential impact of the project. Specifically, we would be interested to know if there others who might be interested to run this interface for their own package repositories? If so, how likely is it that you would contribute enhancements to the project in the future?
Whilst the implementation needs some refinement (we know the back-end needs some refinement, and we definitely aren’t front-end specialists ), we think that this project is applicable in many places, and so we are looking to validate that assumption before making any commitments in terms of maintaining this as an open-source project.
We are all eager to hear your feedback and thoughts,