PEP 658 - Clarification on how to construct the metadata file URL

This just came up in PEP658 metadata URL calculation bug · Issue #12033 · pypa/pip · GitHub - pip is currently calculating the metadata file URL in a way that doesn’t account for query strings in the original file URL.

Reading the PEP, I couldn’t find a clear explanation of how to build the metadata file URL. My assumption is that the following code would do what’s intended:

from urllib.parse import urlsplit, urlunsplit
def get_metadata_url(file_url):

    parts = urlsplit(file_url)
    # I hate the "URL parts" API...
    new_parts = list(parts)
    new_parts[2] += ".metadata"
    return urlunsplit(new_parts)

This doesn’t handle “parameters” (mentioned in the urlsplit docs as “(see RFC 2396”) properly - something like ../myproject.whl;parameter would result in a misconstructed metadata URL, I think. I have no idea if that’s important and PEP 658 doesn’t mention it.

Is the above function a good implementation of the intended semantics? And would a clarifying PR for the PEP to include this implementation be acceptable? @uranusjr as PEP author, are you OK with this?

One other possibility is that we should simply drop the query part of the file URL. Pip already drops the fragment part here. I don’t know what makes the most sense semantically.

It should probably not drop the query?

1 Like

It would be interesting to see what browsers do though with a page that has a query, with a <a href="blah.metadata">link</a>, do they keep the query?

Browsers do not forward the query part.

If we have to pick, my instinct is that doing what browsers do with relative URLs is probably better - at a minimum, it matches what people are likely to assume on the basis that it’s (in effect) a relative URL.

So maybe the right approach would be:

def metadata_url(url):
  path = urlsplit(url).path + ".metadata"
  return urljoin(url, path)

which is explicit that this is a new path based on the given URL. And as a bonus, we can blame any weirdness in the behaviour on urllib :slightly_smiling_face:

2 Likes

Somewhat related: Supporting signed file URLs in the simple package repository API


I agree that query parameters should simply be dropped

If the query is important then I would assume it would be part of the href itself and not something that is implicitly carried forward (although I’m sure folks would like that in certain instances to track the original of a request or something).

Howdy, as the issue reporter let me explain what and why we do:

  • our clients running pip are in “air-gapped” environment: no internet, only our index server is accessible to them
  • hence, we need to rewrite the index, to make pip “come back” for packages as well
  • clients are usually VMs being provisioned, hence load comes in huge “swarms”: performance is critical to us
  • before PEP658 (before pip 22.3) the flow was simple: we rewrote the index and we reused the fact that pip “did the work” for us: it parsed HTML and extracted full href, where we added the query “originalHref”, so the server job was simple: check package cache, if stale, use originalHref query sent by pip to grab the package from internet and cache it.
  • We have a “safety” fallback: if for any reason originalHref query was not present, we mapped easily the incoming URL to index entry (as they overlap) and reused our own “data-original-href” attribute from index, but this required HTML parsing for each request on server side with all the perf penalties.
  • this also had the benefit that index server knew ahead the time (as it rewrote the index HTML) all the remote URLs in play (prefetch). Also, we had 1:1 correspondence between incoming request URLs, the indexed packages and remote/internet URLs we should go for (in our custom attribute data-original-href).

With PEP658 things change: pip is constructing URLs on it’s own and server getting URLs that are not on index.

IMHO ideally the PEP658 attribute “data-dist-info-metadata” would contain the href value of metadata (that we also can rewrite). As in that case premise would not change: index would still have all the URLs from outer world (maybe at the cost of some redundancy, as metadata is usually “next to” package file, but URL could be relative as well).

Our problem currently is:

  • index rewrote by us contains entry “package.whl?originalHref=http://internet/packages/package.whl” (query value encoded as Base64)
  • pip gets this index, and then sends us a request for “/package.whl.metadata” (well, currently due bug request is package.whl?originalHref=http://internet/packages/package.whl.metadata as it blindly appends .metadata to URL string)
  • this URL is not present on index (or the URL base64 encoded originalHref is broken as appended “.metadata” is not base64)
  • so we have no straightforward means to figure out is this a package, or some package subordinate request, and finally, where should we go remote to get this, unless we reverse engineer the URL composition that happens on pip side. And that makes us fragile (also, as I’ve been told, same stands for “data-gpg-sig”)?
  • without originalHref we are being penalized for each request, to parse HTML page, plus the extra logic would make us fragile as well.

This is why I believe that retaining original “index” premise is the correct way: pip not constructing any URLs on it’s own, just following URLs from index would be the proper way. By reusing same logic for metadata, PGP signature attributes etc as it stands for original a.href index entry would make things consistent…

But, as PEP658 is out in the wild, I believe all this is too late.

For this reason, I’d ask for following:

  • any attribute-derived-URL on pip side (like metadata URL, where pip constructs URL for metadata on client side, without having it present in index) should fully retain all the original a.href URL queries, keep the query unchanged. Anything present in URL should handled as “opaque” by pip IMHO.
  • as this would keep incoming request “logically linked” to the index entry
  • index server would need no heuristic or any kind of “logic” to deduce what this request is about
  • index server can still easily “figure” and maybe adjust originalHref to make remote request for whatever is being asked for
  • as according to PEP658 spec, the index server CANNOT respond with anything else than HTTP 200

I don’t think this is appropriate, and modelling after what browsers do is cleaner. Could you elaborate on why this is more useful/preferable to doing what browsers do?