Pre-PEP: Optional search endpoint for the Simple Repository API

So with the XML-RPC disabled likely for a long time and going away, here’s a low-effort idea to provide a replacement.

The Simple Repository API will define an optional endpoint /_/search/. The endpoint will accept one GET query argument q that contains a search term. How a tool interprets the search term is up to the implementer. The response should be a UTF-8 encoded HTML5 page. Each tbody > tr row in it represents a project entry. Each entry row should contain at least one cell containing the package name. An optional second td contains the Summary field of the latest version. If multiple distributions of the latest version are available, the implementation is free to choose any of them to read Summary from.

Opinions? I can turn this into a PEP if it sounds like a reasonable idea.

I mildly dislike this, because I prefer the fact that the simple API is currently defined in a way that can be served completely statically. I think search fits better in the JSON API, which I believe someone is looking at standardising.

1 Like

The idea I had when this issue first came up was to have an endpoint which lists the package name and summary in the same kind of layout you’re suggesting, but for all packages like /simple does. That way the search can be performed locally, after one request to PiPI. The response can be cached, potentially on both ends. I think this is similar to how apt-cache search works – I’m sure that’s all client side.

I have no idea how much more load generating that page would put on the PyPI servers compared to /simple, but surely it would be less than doing all searches server side?

Just a thought.

Does pip search need to change?
https://github.com/pypa/pip/blob/master/src/pip/_internal/commands/search.py

If you’re putting data in HTML, it might as well be RDFa (because there are many standardized parsers) instead of ad-hoc data attributes.

Loading a PyPI-catalog-worth of JSON[-LD] into RAM is going to be slow.
Wouldn’t it make more sense to just ship a SQLite database [over rsync over SSH over HTTPS], …, do we have technology for handling cryptographically-signed p2p data replication (… “CT Certificate Transparency blockchain”) that scales without a single point of failure … status quo: CDN, cached JSON API, ElasticSearch, TUF PKI keychain and catalog to dist

“Regular dump of PyPI database”
https://github.com/pypa/warehouse/issues/1478#issuecomment-373050940

“Add API endpoint to get latest version of all projects”
https://github.com/pypa/warehouse/issues/347

1 Like

This is correct, but that is because APT is able to download a pre-built set of repository metadata from the server, and knows (via timestamps) when the local metadata is out of date. At the moment ‘simple’ Python repositories don’t have any metadata available (just directory listings), and adding metadata involves ensuring that it is updated atomically (as new packages are uploaded). It is not practical to ‘cache’ the directory listings as if they were metadata, because the moment a new package is uploaded then the cached copy is out of date and user can get the wrong result from a search request.

So yes, there can be significant benefits to providing a comprehensive set of metadata for client-side usage (which can also include dependency information) but providing it is a non-trivial thing to do and will definitely not be a ‘simple’ repository service.

2 Likes

a ‘simple’ repository service.

./pip-20.3.3-py2.py3-none-any.whl
./pip-20.3.3-py2.py3-none-any.whl.json
./index.html  # RDFa and/or JSON-LD
<script type="application/ld+json">
{"@context":{
    "schema": "https://schema.org/",
    "url": "https://schema.org/url",
    "name": "https://schema.org/name",
    "pypa": "https://pypa#"
    },
 "@graph": {
    "@type": "schema:CreativeWork",
    "name": "Simple PythonPackageRepository",
    "hasPart": [
       {"name":"pip",
        "pypa:_urls_": "./pip-20.3.3-py2.py3-none-any.whl",
        ...
},
    ]
  }
}
</script>

The directory listing with metadata view should be a generic handler function that can be called from an http.server. SimpleHTTPRequestHandler or a WSGI app or a CGI script. A POST to something like /_/cache/invalidate could update the cached cache key prefix for the view (e.g. every time an inotify event for ./* occurs) so that the metadata listing needn’t be calculated for every request.

Search engines can index CreativeWork - Schema.org Type and I believe also hasPart - Schema.org Property (which isn’t necessary for a simple repo generated from a directory listing).

1 Like

The Codemeta “Crosswalk for Python distutils” lists which setup.py attributes map to schema.org CreativeWork properties:

Property Python Distutils (PyPI)
codeRepository url
programmingLanguage classifiers[‘Programming Language’]
applicationCategory classifiers[‘Topic’]
operatingSystem classifiers[‘Operating System’]
softwareRequirements install_requires
keywords keywords
license license
version Version
description description, long_description
name name
email author_email
developmentStatus classifiers[‘Development Status’]

FWIU, e.g. Google does store most schema.org data but there are not display card properties or special support for many or most properties; though there is a Dataset - Schema.org Type search engine

I think there’s a strong case to be made for supporting Codemeta in order to support search across many ScholarlyArticle, SoftwareApplication, and SoftwareApplicationSource resources with a consistent schema.