PEP RFC: Python Package Index (Warehouse) JSON API v1

@cooperlees I’m guessing nobody was about. I have just arrived from the pip issues on github where progress on providing a package query interface depends on this.

I think the fastest thing reasonable to generate an initial standard that pip could target would be a good idea. As it is, there is no real way to predict what pip will do before running a download or install command.

There is a lot of work here needed to make pip being able to truel tell you what it will do a reality. One is we need warehouse to store better and stricter metadata …

Have you seen /tmp/test/bin/pip install --dry-run --report /tmp/black_install.json PACKAGE

Also - metadata-only resolve with `pip download --dry-run --report`! by cosmicexplorer · Pull Request #10748 · pypa/pip · GitHub might be of interest.

I’m not sure if this is the correct place to bring this up, but one issue I’ve run into several times with Poetry (and likely PDM’s) usage of the current JSON API is that dependencies are frozen based on the first wheel that is uploaded. This means that if a wheel on other architectures has additional dependencies, but happens to be uploaded by the maintainer second, the JSON API will not have all of the required dependencies.

Here is one example of this happening with Poetry and Open3D:

If the new API made it easier for tools such as Poetry and PDM to get an accurate cross-platform list of dependencies that would be a nice improvement.

The simple API in theory allows you to fetch metadata per-wheel. In practice, Warehouse doesn’t yet expose that data, but it could. So if that’s what you want, it’s just a matter of waiting for that to be implemented. This is the tracking issue for that work.

The JSON API is fundamentally flawed, as it exposes metadata at the project/version level, and in fact, the way metadata is defined, it’s not possible to say what the dependencies of “X version 1.0” are - they can differ between builds.

It’s far easier to manage dependencies if you ignore this issue, and assume that the same dependencies are valid for all builds of a given project/version, but it simply isn’t true. We’re slowly trying to move projects to a position where we could feasibly make a rule that enforces that, but even if we do, there will still be older projects that won’t follow that rule, and it’ll be many years before we can ignore those :slightly_frowning_face:

Without a change to the way metadata is standardised, “an accurate cross-platform list of dependencies” isn’t an achievable goal, except for projects that self-impose that restriction (by use of markers and similar techniques).

The Simple Repository API specifies this per-file, so once PyPI exposes this then that should help alleviate this problem.

Isn’t there a topic somewhere involving the lock file discussions where we talked about pushing projects to have their dependencies be consistent across sdists and wheels by using markers to help alleviate this problem?

Quite possibly. I’m sure it’s a good idea (and markers make it feasible) but I’m not sure how we go about getting projects to change.

I have wheel metadata from PyPI, so I might be able to formulate a query to get a feel for how many projects right now are shipping metadata that’s consistent across builds, but that leaves a lot of projects we’d still not know about. So I’d be very hesitant about making decisions based on it.

Could PyPI flag it somehow, especially if PyPI adopts the Simple Repository API and chooses to expose the metadata separately?

To be clear, Poetry has a concrete goal of using PEP 658 (per-distfile metadata) to replace the JSON API and behave better. Currently, supporting split metadata across packages is a non-goal; the current plan for implementation is to bail out with an error when trying to lock a package with split metadata, which will greatly reduce user confusion and provide gentle pressure to projects to provide consistent metadata with markers if they want to be compatible with Poetry.

We haven’t ruled out supporting split metadata in the future, but it would require major invasive changes with unknown performance and complexity implications, so for now it’s a far-off goal.

1 Like

FWIW, I think that sounds reasonable.

It’s unlikely that there would be multiple compatible wheels on a single platform and as long as Poetry handles that discrepancy and correctly uses the metadata coming from that wheel (rather than whichever one was uploaded), things should work correctly.

I don’t think allowing the different dependencies for different packages is something that every installation tool needs to support – we have multiple, which I have opinions on. :slight_smile:

I’m a bit confused - PyPI does use thag API and PEP 658 (dist-info-metadata) isn’t particularly related.

If you’re saying what I think you’re saying, then I imagine having a two step upload for PyPI would help with resolving this; where PyPI would check/validate the distributions before the final publish.

It’s in regards to people downloading a wheel or some such and then relying that on representative for all_ wheels, partially because downloading is an “expensive” operation.

As in flag whether a version/release has consistent dependencies? We could, but as the index format is designed, we do treat each file as independent. We could provide a new key with that information, but I’m not sure how valuable it is, especially once dist-info-metadata gets added.

As you say, I think we have standards in place that, when they finally get implemented, would make it relatively painless to get per-file metadata:

  • PEP 658 for exposing the data in the index
  • PEP 643 for sdist metadata (note that PEP 658 can be used for sdists, not just wheels)

With tools like markers, it’s also reasonable for projects to write their metadata in such a way that it’s entirely static across all built distributions (sdists and wheels) if they want to. But they aren’t required to, hence this discussion.

I don’t think a per-version “is the metadata static” flag is helpful. It might act as a stopgap until the standards noted above are fully implemented, but it also pushes tools into a “get per version metadata when possible” mindset, which will make it a lot harder to retire APIs like the JSON API which publish per-version metadata.

But if you do like per-release metadata, here’s a thought experiment. Rather than having a flag saying whether the release has consistent metadata, only provide metadata if it’s known consistent. Then if metadata is present, it’s reliable, and if it’s not then you shouldn’t have been using it anyway. If you’re not happy with that approach, explain why :slightly_smiling_face:

PS Just to be clear, I wish we did have metadata that was guaranteed the same for all files in a release. It would be way easier to handle. It’s not the reality we live in, though…

1 Like

I do like that idea. :slightly_smiling_face:

Is there something more we can do to help make this a reality at this point?

Let’s get PEPs 658 and 643 universally implemented in backends (and PyPI), and then we can query projects for what metadata is still marked as dynamic in sdists. Then we can see where the remaining pain points are.

We would still have to decide what to do about the terabytes of legacy code on PyPI, though. There’s no transition mechanism I can think of for that, apart from “wait for it all to be superseded by new releases”. And nobody has ever dared to broach the subject of housekeeping PyPI yet, so I have no feel for whether that is viable. (Frontends like pip could decide to ignore sdists with dynamic metadata, but that’s just housekeeping done at the client end, in practice).

1 Like