Pre-PEP: Optional search endpoint for the Simple Repository API

uranusjr · December 19, 2020, 2:36pm

So with the XML-RPC disabled likely for a long time and going away, here’s a low-effort idea to provide a replacement.

The Simple Repository API will define an optional endpoint /_/search/. The endpoint will accept one GET query argument q that contains a search term. How a tool interprets the search term is up to the implementer. The response should be a UTF-8 encoded HTML5 page. Each tbody > tr row in it represents a project entry. Each entry row should contain at least one cell containing the package name. An optional second td contains the Summary field of the latest version. If multiple distributions of the latest version are available, the implementation is free to choose any of them to read Summary from.

Opinions? I can turn this into a PEP if it sounds like a reasonable idea.

pf_moore · December 19, 2020, 3:04pm

I mildly dislike this, because I prefer the fact that the simple API is currently defined in a way that can be served completely statically. I think search fits better in the JSON API, which I believe someone is looking at standardising.

domdfcoding · December 19, 2020, 5:43pm

The idea I had when this issue first came up was to have an endpoint which lists the package name and summary in the same kind of layout you’re suggesting, but for all packages like /simple does. That way the search can be performed locally, after one request to PiPI. The response can be cached, potentially on both ends. I think this is similar to how apt-cache search works – I’m sure that’s all client side.

I have no idea how much more load generating that page would put on the PyPI servers compared to /simple, but surely it would be less than doing all searches server side?

Just a thought.

westurner · December 19, 2020, 6:13pm

Does pip search need to change?
https://github.com/pypa/pip/blob/master/src/pip/_internal/commands/search.py

If you’re putting data in HTML, it might as well be RDFa (because there are many standardized parsers) instead of ad-hoc data attributes.

Loading a PyPI-catalog-worth of JSON[-LD] into RAM is going to be slow.
Wouldn’t it make more sense to just ship a SQLite database [over rsync over SSH over HTTPS], …, do we have technology for handling cryptographically-signed p2p data replication (… “CT Certificate Transparency blockchain”) that scales without a single point of failure … status quo: CDN, cached JSON API, ElasticSearch, TUF PKI keychain and catalog to dist

“Regular dump of PyPI database”
https://github.com/pypa/warehouse/issues/1478#issuecomment-373050940

“Add API endpoint to get latest version of all projects”
https://github.com/pypa/warehouse/issues/347

kpfleming · December 19, 2020, 7:17pm

This is correct, but that is because APT is able to download a pre-built set of repository metadata from the server, and knows (via timestamps) when the local metadata is out of date. At the moment ‘simple’ Python repositories don’t have any metadata available (just directory listings), and adding metadata involves ensuring that it is updated atomically (as new packages are uploaded). It is not practical to ‘cache’ the directory listings as if they were metadata, because the moment a new package is uploaded then the cached copy is out of date and user can get the wrong result from a search request.

So yes, there can be significant benefits to providing a comprehensive set of metadata for client-side usage (which can also include dependency information) but providing it is a non-trivial thing to do and will definitely not be a ‘simple’ repository service.

westurner · January 2, 2021, 8:22pm

a ‘simple’ repository service.

./pip-20.3.3-py2.py3-none-any.whl
./pip-20.3.3-py2.py3-none-any.whl.json
./index.html  # RDFa and/or JSON-LD

<script type="application/ld+json">
{"@context":{
    "schema": "https://schema.org/",
    "url": "https://schema.org/url",
    "name": "https://schema.org/name",
    "pypa": "https://pypa#"
    },
 "@graph": {
    "@type": "schema:CreativeWork",
    "name": "Simple PythonPackageRepository",
    "hasPart": [
       {"name":"pip",
        "pypa:_urls_": "./pip-20.3.3-py2.py3-none-any.whl",
        ...
},
    ]
  }
}
</script>

The directory listing with metadata view should be a generic handler function that can be called from an http.server. SimpleHTTPRequestHandler or a WSGI app or a CGI script. A POST to something like /_/cache/invalidate could update the cached cache key prefix for the view (e.g. every time an inotify event for ./* occurs) so that the metadata listing needn’t be calculated for every request.

Search engines can index CreativeWork - Schema.org Type and I believe also hasPart - Schema.org Property (which isn’t necessary for a simple repo generated from a directory listing).

westurner · January 2, 2021, 9:35pm

The Codemeta “Crosswalk for Python distutils” lists which setup.py attributes map to schema.org CreativeWork properties:

Property Python Distutils (PyPI)

codeRepository url

programmingLanguage classifiers[‘Programming Language’]

applicationCategory classifiers[‘Topic’]

operatingSystem classifiers[‘Operating System’]

softwareRequirements install_requires

keywords keywords

license license

version Version

description description, long_description

name name

email author_email

developmentStatus classifiers[‘Development Status’]

FWIU, e.g. Google does store most schema.org data but there are not display card properties or special support for many or most properties; though there is a Dataset - Schema.org Type search engine

I think there’s a strong case to be made for supporting Codemeta in order to support search across many ScholarlyArticle, SoftwareApplication, and SoftwareApplicationSource resources with a consistent schema.

Topic		Replies	Views
PEP 691: JSON-based Simple API for Python Package Indexes PEPs	90	6856	December 4, 2022
Potential inconsistency w/ PEP 503 (Simple Repo API) Packaging	6	1536	March 2, 2020
A PyPI-like interface to browse and search packages in any PEP-503 compliant simple repository Announcements	8	833	November 10, 2023
PEP 700: Additional Fields for the Simple API for Package Indexes Packaging	45	2488	January 2, 2023
Pip search is still broken Packaging	19	8622	December 27, 2022

Property	Python Distutils (PyPI)
codeRepository	url
programmingLanguage	classifiers[‘Programming Language’]
applicationCategory	classifiers[‘Topic’]
operatingSystem	classifiers[‘Operating System’]
softwareRequirements	install_requires
keywords	keywords
license	license
version	Version
description	description, long_description
name	name
email	author_email
developmentStatus	classifiers[‘Development Status’]

Pre-PEP: Optional search endpoint for the Simple Repository API

Related Topics