Yes. Unless you rely on end users to explicitly indicate to the client that they want to use the JSON repository (i.e. effectively host 2 indexes, HTML only and JSON only and users configure the URL for the one their client supports).
Yea, backwards compatibility is the big concern. That doesn’t exist for new things we add though, so it’d be reasonable (I think) to have clients be stricter for new things, even if older things have to be more relaxed due to backwards compatibility concerns. To me that falls under the “where reasonable” vagueness I added to my statement. I wouldn’t think it is reasonable for pip & friends to just randomly break tons of indexes out of the blue, but I do think installers should strive to be strict in what they accept when adding new things.
That’s not unique to installers though, I think in general packaging tooling should try to be stricter in what it accepts, at least for new things .
I think the key aspect here is simplicity, i.e. to get to a set-up that works with minimal complexity, not compliance to any particular standard. In practical terms, if pip can install from a static page, that’s good enough for these use cases.
why do organisations like this bother laying out an index structure in any case? What makes doing that worthwhile, considering that just serving up a flat directory of wheels and files using --find-links is just as effective, and (slightly) less work to implement? In your experience, is it simply that you didn’t know that was an option, or is there a deeper reason?
–find-links is great in some scenarios, but not in others. E.g. if you have a developer workstation setup in a corporate Windows environment, –find-links works great because you can connect to a shared drive and it will work. Add a Linux server that needs to install packages from the same location, or a CICD server that needs to build docker images for deployment to a K8s cluster, the shared drive approach doesn’t work anymore. That’s where the ability to have a static page set up is easier, despite involving an extra step.
If there was a tool that could be used to build an index-compliant static website from a directory full of wheels and sdists, would people use that? It would just add a build step every time a new wheel was published - is even that too much? How annoying would installers need to be in order to make teams accept the extra step?
Yes, if setting up the static site is merely some additional command or option when downloading the packages, e.g. pip download --static and the result can still be served from a static html server, my guess is people would use it. As I said above, the key aspect is simplicity, people are not trying to actively avoid meeting standards (most would probably not be aware, nor care for, that there is a standard in the first place).
So I’d like to explore whether such static publishing could be a realistic compromise (between “trivially easy to throw up an index” and “needs to expose rich enough data to allow tools to do their job effectively”).
Thanks for asking these questions. It would be interesting indeed to get more data on this.
But with regard to sub-optimal performance, I assume (and I’d appreciate your perspective from real-life experience) that such indexes are either small enough that the optimisations aren’t important, or performance isn’t important enough for them to matter. Is that a fair assessment?
Fair - the ability to serve from a static page is the optimisation. Performance is not a concern.
To add some context, the teams I work with are typically small, 1 to 5 people, or several teams of that size working in different departments across an organisation (e.g. data scientists/analytics, devops teams, internal tools development for department-level automation, early stage startups etc.) Their main focus is on delivering day-to-day results, and they use Python to improve their efficiency and productivity. Usually there is no time or background, nor the business justification, to build sophisticated infrastructure. Thus the least-effort alternative available wins, unless there is a specific constraint or policy that requires a more sophisticated setup (e.g. some regulatory constraint with more control - which adds business justification for a more complex setup).
To be clear, the primary difference between --find-links and --(extra-)index-url is that the former is a flat HTML page with all the links to files on it, and the latter is expected to be a standards compliant repository, with a specific structure, serialization rules, etc.
While I agree, if we add new features to the JSON index only, I doubt there’s much issue. JSON (and the JSON index standard) is pretty well-defined, so deviations are likely to be uncommon, and clear bugs. As the OP here implied, it’s the HTML index that has a ton of non-compliant variations and implementations, and little or no realistic opportunity for enforcement.
So I think we’re agreeing, just with differing levels of optimism
“Can install from” suggests that efficiency is irrelevant. That’s disappointing, but about what I expected.
But --find-links can still be used with a flat directory statically served over http(s). I wasn’t talking about a shared drive, which I agree is a much more limited option.
So build_index would be a separate tool, rather than a pip subcommand. In theory it could be a pip subcommand, or an option to pip download, but if everything is blocked on “make it a pip subcommand or we won’t use it”, then we have a different problem
I note that your example was about downloading stuff from PyPI and then serving it from a local directory. What’s the driver there? Why isn’t it OK to just use PyPI directly? (I can think of a number of potential answers here, but I’m interested in what actually matters to you, so we’re addressing the right issue).
I should also be clear here that if your honest answer is “we just did whatever worked, never thought beyond that, and didn’t even know other options existed”, that’s a perfectly valid position. I have a strong suspicion that a significant part of the reason we have so many backward compatibility and adoption concerns like this one is because we’re very bad at explaining to people what the best practices are, and how to set up a good, practical workflow.[1]
I say this because I, speaking as an experienced member of the packaging community, still struggle to set up a good workflow every time I start a new project. This is not something I think we should be proud of ↩︎
Strongly agree with Paul here. Going back to my earlier point, a tool like build_index has the issue that people want to make indexes in a lot of different ways. Some users build indexes locally, some want to push packages and updated index files, others don’t take wheels as input at all (e.g. dumb-pypi).
The issue in my mind is not that web servers don’t support conneg, it’s that they require more configuration to do so.
Going back to the original question of “Can we deprecate PEP 503”, my answer would be no, because not enough clients support PEP 691 and the tooling doesn’t exist to easily host an index that supports both PEP 503 and 691. I definitely think you can host such a server but the path of least resistance is hosting a PEP 503-only index at the moment. So in my mind to deprecate PEP 503 either:
2-3 years pass and everyone can start setting up PEP 691 indexes
We make it extremely easy to host indexes that support both PEP 503 and PEP 691
I do not think adding new features only to the JSON representation will encourage people hosting PEP 503-only indexes to switch. If PEP 503 is working for them and supported by clients, inertia keeps them on a 503-only index.
I also think a tool like build_index really only helps after the 2-3 year window when clients that don’t support PEP 691 phase out. Not to say it shouldn’t be built! I think it could be useful for some users.
What kind of efficiencies were you expecting everyone to want? Being able to filter versions based on python_requires or solve dependency conflicts without downloading tonnes of wheels only comes into play if you’re dealing with tightly constrained dependency trees or Python versions behind what’s supported upstream. Keep away from those issues and anything beyond a dumb file system server feels overkill.
Yeah, that’s sort of my point. But equally, a flat directory, served over HTTP, accessed via --find-links, serves the “dumb filesystem server” use case just as well. So I guess where I’m coming from is trying to understand why people are using an index, which is overspecified for their use case. I suspect there are 2 answers:
Education - people simply don’t know that --find-links is just as good for this use case.
Standardisation - the index API is standardised, whereas --find-links is pip-specific[1].
If we could get people who don’t need the complexity of a real index server to use the simpler alternative, maybe we’d be left with a group of users for whom there are benefits to supporting the more advanced features, and we could make progress that way.
Although uv pip supports it as well, and I can’t imagine any other installer will be able to get away with not supporting something basically the same ↩︎
I expect this is the main reason, and so if there really was a desire to restrict/limit --index-url to require JSON protocol, perhaps making sure --find-links has good enough heuristics to handle the HTML format is a suitable escape hatch?
Seems I confused this with my recent use of --find-links in a shared drive scenario. Yes it works just as well for the same directory served over http(s).
So build_index would be a separate tool, rather than a pip subcommand. In theory it could be a pip subcommand, or an option to pip download, but if everything is blocked on “make it a pip subcommand or we won’t use it”, then we have a different problem
In principle using a separate tool is not a problem, at least if there is one that is stable and maintained. Over the years there have been quite a few attempts to solve this problem outside of pip, e.g. pip2pi, dump-pypi, simple503, and a few others I have come across but can’t seem to recall right now. While the extra step is not a problem per se, a maintained and integrated option to pip downloadwould be useful imho.
In mid to large corporate IT environments pypi is typically blocked or at least behind a guarded proxy, and people are either discouraged or restricted to download packages freely. The core reasons to maintain an internal subset of packages are security and traceability concerns, i.e. in order to lower the probability of an unwarranted supply chain attacks, and to know what actually gets installed in internal systems, and from what source.
I wouldn’t say it is whatever works, rather the simplest approach that meets our needs. This is of course not due to a lack of appreciation for a well managed solution. Rather it is with the primary concern for simplicity, for the reasons mentioned in my previous answer (no focus/time for complexity, or at least no immediate need for the features and efficiency that “proper” repositories like Artifactory provide).
There is some truth in that I guess, although I think the situation has improved tremendously (e.g. the Python Packaging User Guide is a really useful reference). The larger impact however is that migrating to new standards is typically hard relative to the importance of other projects and thus often gets put in the backlog for as long as possible. I’m not saying that’s a good practice, just a common one.
I will say (to probably no one’s surprise) that I have a PEP idea that would require index support and I was not looking forward to trying to support the HTML index. So at least I will happily skip HTML support.
But that is one static file server, not most that innately know what to do with a .json file.
I’m not saying drop support for conneg, just at least support application/json as the YOLO version.
Well, packages have to be brought into the company somehow, in a consistent manner. pip download is a good way to build a “golden” copy, either from inside or outside. Some corp IT will allow downloads from pypi for that (by a dedicated role/person/system), others have a provider do it outside and copy through some file transfer means - e.g. simply set up a https site available to internal users.
Sorry for my delayed response! I just meant that right now the living standard language is “may” for supporting this and “should not” for installers relying on it, but we could make it “should” (or “must”) for supporting this and “may” (or “should”) for installers relying on it.
(But per the other responses, I don’t have a strong since that this is the best way to go about things.)
Also, I want to offer a hearty thanks to everybody who has responded and offered their insights and context so far – I threw this idea out there, and it’s exceedingly useful to hear people offer reasons for PEP 503’s usage that weren’t obvious to me! I wish some of the larger third-party index providers would chime in as well, but that alone is incredibly valuable.
I personally think this would be valuable to do – even if it’s not a removal (or even deprecation per se) of PEP 503, having a PEP that declares the HTML representation as formally frozen in terms of features would be IMO a useful signal that consumers should begin migrating away from it. I’d be happy to work on a PEP that codifies that.
I have doubts this will cause folks to migrate since it’s clear a number of folks just want something that works. But I think it is reasonable to not maintain two parallel representations of the same data format, so I’m in favor of deprecating PEP 503 by means of no new features.
I agree. Freezing PEP 503 seems like a reasonable thing to do, but as you say it wouldn’t persuade the people who just want something that works to switch. For them, the additional features the JSON format offers wouldn’t be of interest.
For the bigger providers (PyTorch, piwheels, Artifactory, Azure, Gitlab, …) if they aren’t already providing the JSON format, I doubt this will change their minds either. But they are the ones we really want to switch, so IMO we still need to do some sort of user survey to find out what it would take to get them to change. They don’t participate here, so we need explicit outreach - there’s really no way round that.
So we can freeze PEP 503, but we should be clear that it’s for our convenience, not to drive adoption of the JSON format.
I’ll add a scenario to the mix of where HTML can still be better than JSON for clients, not just servers:
When resolving packages that have a large number of releases, e.g. boto3, a client can save a lot of memory by iterating through the HTML lazily, but that’s tricky in pure Python with the JSON response. e.g. this has some real world impact on pip’s memory footprint when doing a large resolve on PyPI vs. a private HTML index.
If I understand correctly this could be addressed by a SAX-style JSON parser, right? Streamwise JSON parsing is possible, just not with the stdlib at the moment.
The general ecosystem has not adopted JSON, there is PyPI and a few niche services, we clearly cannot mandate that they switch.
Further, there is not even many good libraries to serve the JSON API whereas the HTML API is native to python -m http.server and dozens of other libraries and languages going back since near the dawn of HTML (and I know the simple API is technically HTML5 but in practice that is not).
Given that, I think it would significantly serve the ecosystem to add the missing fields, or at a bare minimum the optional upload-time, to the HTML standard.
Once upload-time is served in the HTML of PyPI it can easily and be automatically mirrored by private artifactories, allowing tools to offer features like cooldown with private indexes.
Given the current state, I would be opposed to officially freezing freezing the HTML API. In fact I think it should be at parity with the JSON API wherever there is a trivial one-to-one mapping, to reduce the chance of fragmentation.