A PyPI-like interface to browse and search packages in any PEP-503 compliant simple repository

pelson · November 6, 2023, 10:59am

I’m sure most would agree that the PyPI.org web interface is an invaluable entry-point to the Python Package Index, offering essential search and browsing capabilities. It is so useful that I often found myself wishing that I could have the same search and browse functionality for the Python package repository at my place of work. Unfortunately, it isn’t intended that warehouse, the software powering PyPI, be used for running a general package repository, as its primary purpose is to serve the robust operational needs of PyPI.org itself.

However, it turns out that the evolutionary improvements to the simple repository interface, originally standardised in PEP-503, and refined in PEP-658/714 (accessing metadata separate from package), PEP-691 (JSON interface), and PEP-700 (additional metadata), offer interesting functionality beyond their original goal of improving package installation (e.g. with pip).

With all of the aforementioned PEPs backing a repository, accessing metadata of projects becomes cheap, and there is (almost) as much information in the resulting simple endpoints as there is on PyPI.org. Add a crawler of interesting projects into the mix, and one could implement the search and browse functionality of PyPI.org with nothing more than a simple-repository interface as a data source, and without access to the underlying PyPI database.

Introducing `simple-repository-browser`

In fact, at CERN we prototyped such an interface, and our implementation has now been running solidly for nearly 2 years, giving us powerful search and discoverability of our internal repository, and offering a single entry-point for our Python community. In July we released the source under the MIT license, presented this at EuroPython, and we recently published a paper and poster in the International Conference on Accelerator and Large Experimental Physics Control Systems (ICALEPCS) on the topic.

We’ve stood up a tech preview of simple-repository-browser at https://simple-repository.app.cern.ch/. This instance is using nothing more than the simple repository interface from PyPI.org:

In this tech-preview instance, you may notice that you can browse packages that don’t yet have PEP-658/714 metadata (including sdist-only projects) on PyPI. I will go into much more detail in another post, but this stems from the fact that we have developed a standards-compliant simple repository server as a separate project, and have architected it such that we can re-use key components in the simple-repository-browser. In short, you can point simple-repository-browser at the most basic PEP-503 compliant repository (or even just a directory of dists grouped by project name), and the browser can internally enhance the repository (e.g. by extracting the metadata from the dists) on the fly if it isn’t provided by the repository. If you go looking at old releases of projects with big artefacts (e.g. tensorflow), you might find that we have to do on-the-fly metadata extraction because PyPI doesn’t (yet) serve it, and this can take up to 30 seconds since we have to download the dist on the server. Once it has been extracted, we cache the metadata for faster future lookup (but discard the dist itself for space-efficiency reasons).

Another feature of this tech-preview is the search functionality. Unless the repository is pre-indexed, you will only be able to search by project name. In this instance, we have crawled the top 100 PyPI projects and their transient dependencies (~2000 projects), which means that you get search based on the project summary as well as its name (e.g. search for array and see that some of the results don’t have array in their project name) for those projects. Notice that we do on-the-fly indexing, such that once you’ve visited a project that hasn’t previously been indexed, it can then be searched for by summary also.

To re-iterate the key points:

simple-repository-browser will work just as well against any PEP-503 compliant repository, no matter the software actually running it (devPi, Artifactory, Sonatype Nexus, bandersnatch+python -m http.server, etc.).
If you have a repository which serves PEP-658/714 and/or PEP-700 metadata, simple-repository-browser will have to do less work to attain a project’s metadata.
It can run without internet (e.g. it works in air-gapped environments), and just needs access to the simple-repository via http/file.
Indexing of projects takes place when a project page is visited for the first time. There is also a basic crawler implemented to systematically index projects (e.g. the top 500 PyPI projects).
Search by name for non-indexed projects, search by summary (and in the future, classification, description etc.) for index projects

What’s next?

Internally, we have specialised the interface for our internal needs (e.g. we have authentication, and an interface to manage yanking (PEP-592) through an external API). Maintaining the base simple-repository-browser interface which can be generically extended for these specialisations isn’t cost-free, and so before committing to move simple-repository-browser from a one-off prototype to an openly developed project, we are seeking feedback to understand the interest and potential impact of the project. Specifically, we would be interested to know if there others who might be interested to run this interface for their own package repositories? If so, how likely is it that you would contribute enhancements to the project in the future?

Whilst the implementation needs some refinement (we know the back-end needs some refinement, and we definitely aren’t front-end specialists ), we think that this project is applicable in many places, and so we are looking to validate that assumption before making any commitments in terms of maintaining this as an open-source project.

I’d also like to take this moment to acknowledge the significant contributors to the project so far: Ivan Sinkarenko (@ivany4), Francesco Iannaccone, Wouter Koorn (@wkoorn) & Cristian Baldi.

We are all eager to hear your feedback and thoughts,

Thanks!

Phil

pradyunsg · November 6, 2023, 3:07pm

This looks super neat!

puts on Bloomberg employee hat

Yes. This looks like something that Bloomberg might be interested in – we also have an internal PyPI-like package index that could benefit from such a web UI ^[1].

I still need to discuss this with all the relevant people internally though, so there’s that additional caveat which does have “we might never actually use this” somewhere in the decision tree.

If we do end up using this, I think it’s quite likely that folks from the company would contribute enhancements to the project.

takes off Bloomberg employee hat

You might also want to reach out to the maintainers of piwheels and devpi directly, to see if they’re interested in reusing such a UI in their projects (or otherwise consider helping set up an cleaner integrations between the two)!

I talked about this in my PackagingCon talk about how Bloomberg does Python Packaging! Python packaging and Bloomberg :: PackagingCon 2023 :: pretalx ↩︎

ofek · November 6, 2023, 5:14pm

I’m interested because I’m in the process of creating one: is this a pure mirror or mirror+internal hosted?custom implementation or devpi?

willingc · November 6, 2023, 5:29pm

Hi Phil,

Thanks for sharing this. It looks really nice. Do you want to put an entry on the Projects page of the packaging.python.org guide?

EpicWink · November 6, 2023, 10:03pm

Here’s (a PR for) a list of existing solutions you can pull ideas from: Document more package index host/cache/mirror options by EpicWink · Pull Request #1202 · pypa/packaging.python.org · GitHub

ofek · November 6, 2023, 10:31pm

Thanks! I already have the setup working for us but I was just curious what other large organizations are choosing to do now.

pelson · November 7, 2023, 5:15am

Great! This is the kind of information we are looking for (caveats understood).

I know this was a question to Pradyun, but we are also running a PyPI-like package index (aka a simple-repository API). I’ll will be sharing details in a post in the coming days, but if you are interested to explore now, you can check out simple-repository-server, which gives you a CLI to run a fully-featured package repository (caching proxy) with:

simple-repository-server \
    /path/to/local/files/grouped/by/project/name \
    https://pypi.org/simple/

(this is just a teaser - more in another post, I don’t want to divert attention from the PyPI-like interface )

Thanks for this suggestion, but the objective for now is to get a better feeling for the realistic success of the project if we were to commit to generalising and openly developing it, so I think it is premature to add it to the list - it isn’t a solution that has any maintenance commitment just yet.

Another data-point that might help us with our decision is whether the existence of a project such as simple-repository-browser might help drive the simple repository API standards forward. I’ve seen on multiple occasions it discussed (by @pf_moore and @dstufft, I believe) that the project-list endpoint is considered unused… Is it the case that today, PEP-503’s only intended purpose is to serve package installers?

Given my original post, I could imagine proposing PEPs to enhance the simple repository project-list endpoint for optional pagination, sorting and filtering (e.g. use project-list to find up to 100 projects with “numpy” in their summary, or even find the paginated list of projects updated in the last 24 hours…). I appreciate that proposals similar to these have been discussed elsewhere, and that the project-list endpoint is sensitive (e.g. caching is critical for the stability of PyPI.org), and this would have to be handled carefully by any PEP…

The question I’m asking here is: Does the existence of a maintained and used simple-repository-browser make PEPs such as the ones suggested more likely to be accepted? (fwiw I’m looking for an indication, rather than a commitment)

Another (very ambitious) question, more for the warehouse maintainers… internally, we are specialising simple-repository-browser to augment it with other (non-standards based) information (e.g. ownership). Technically, I believe it would be possible to do the same for PyPI itself… it is even conceivable that one-day, warehouse becomes a user of simple-repository-browser, and that warehouse’s primary implementation responsibilities are serving the repository APIs and the UIs for project administration panels etc. in a scalable manner. I appreciate the devil is in the detail, but in general, do you think this would make warehouse easier, or harder, to maintain and develop?

pf_moore · November 7, 2023, 8:34am

I would say that a PEP indicating that the proposed feature would be useful for a reasonably popular piece of software would be more compelling than one that simply said the feature would be “nice to have”.

pawamoy · November 10, 2023, 12:21pm

Amazing work @pelson!

It’s interesting to me, for my PyPI Insiders project, which is basically a tool to keep track of private GitHub projects, automatically pulling latest tags from configured repositories to build distributions and upload them to a private index (the self-hosted one, bundled with PyPI Insiders, or any other online private index). I’m thinking of bundling simple-repository-browser with it.

I don’t think I have the bandwidth to contribute fixes or features to it, but when I start using a project I always take time to report issues, suggest docs improvements, or help other users.

Anyway, thank you for sharing! GitHub repos starred

A PyPI-like interface to browse and search packages in any PEP-503 compliant simple repository

Introducing simple-repository-browser

What’s next?

Introducing `simple-repository-browser`