AFAIK, OpenSSL doesn’t have any HTTP protocol support built-in. libcurl does.
The idea is to have a simple to use URL fetch interface, which takes a URL and returns the result in e.g. a Request instance. And ideally, it should be independent of the underlying TLS layer and fully maintained by the OS, rather than anything we do in Python. [1]
That way we get a mechanism which opens up more options for the lower level TLS support, i.e. this could come from PyPI rather than the stdlib, and then permits the stdlib protocol modules to branch out to different TLS providers. We could have such providers for using OpenSSL, GNU TLS, OS provided TLS stacks, etc.
I’m not sure whether having the stdlib use PyPI providers is a good approach, as it creates stronger ties between the stdlib and PyPI, but it’s certainly an approach to consider, rather than moving both the low level and the higher level modules completely out of the stdlib and on PyPI. Why is that better ? Because our users could continue to rely on the common APIs defined in the stdlib, while the low level details can be maintained outside the stdlib. This will only work, if there is interest in maintaining such code on PyPI, of course, since we’re essentially just moving the maintenance cost elsewhere.
[1]: For Windows and macOS there are OS provided APIs. For Unixes, the choice is not clear, but libcurl certainly comes to mind.
Bootstrapping PyPI support needs pip, which requires requests (which pip vendors). So going down this route would require a “fetch API” that would allow requests to continue working without needing any binary extensions (pip has a hard requirement that vendored libraries are pure Python).
Writing a “mini pip” that uses a stdlib fetch API and can only install wheels would be plausible, but turning that into a usable “getting Python up and running” experience for end users would be challenging (even the current experience is a struggle for many users).
Whatever we do, I think that in the short to medium term, “keeping requests a pure-python library” will be a key part of any solution.
Does pip need anything more than the fetch API (requests.get()) from the requests package for basic package installs ? Perhaps it would be possible to replace requests with the stdlib fetch API in this part of pip (or perhaps even everywhere).
AFAK, pip uses many of Requests’ advanced features – sessions (connection pools), proxies, custom certificates…
OTOH, most of those would be needed for “fetch”. (Web fetch API can reuse the browser’s settings, so it’s deceptively simple.)
Hmm, the point of having a generic basic OS based fetch API would be to not have a requests replacement, with all the bells and whistles, but merely provide something which could be used for simple query and download purposes. Nothing optimized for speed, just for usability (= no knobs) and safety (OS takes care of updates, certificates and config).
Would it be possible to have pip optionally use a much simpler fetch API instead of requests (via a command line switch) ?
What would it gain for pip? If we can’t de-vendor requests, there’s no incentive for us to not use it. And if an alternative doesn’t support all of our functionality, we still need requests. De-bundling pip (so that we don’t vendor our dependencies) is a much more complex question, and not something we’re likely to consider in the foreseeable future.
In all honesty, this is going to need a lot more co-operation between the PyPA and the core devs, if we want to do anything significant here. Unfortunately, the trend has been in the opposite direction - treating packaging as a separate topic outside of “core Python”. I’d be more than happy to support an initiative to review the whole “bootstrapping Python to the point where using 3rd party libraries is seamless” mechanism. We currently have ensurepip which is based on having a standalone self-bootstrapping pip. We could do something very different, but not without a reasonable amount of work on both sides.
PS This does nicely illustrate one of the constraints on the idea of debundling chunks of the stdlib. We have to be careful to leave a big enough stdlib to bootstrap the packaging toolchain in a user-friendly way.
FYI having supported a Python distribution in a large enterprise my experience is most of the knobs are to just get pip working, you need to make sure at a minimum:
Read and use custom CAs, even if you are using one provided by OS user may have CA set-up not in OS store
Support proxies, these can sometimes be the only way to make an HTTP request outside the machine
Sessions so that proxies can identify you on each request, otherwise you can end up in an infinite proxy request loop
Configure proxies, even if you are sourcing OS defined proxies there’s no guarantee that’s the one the user needs to get PIP working
If you do support OS defined proxies you need to be able to interpret JavaScript as proxies on Windows can be defined via PAC file which is a code-as-config file using JavaScript
Proxy authentication, ideally both HTTP AUTH and OS Specific ones such as SSPI
Both HTTP and HTTPS (internal repo might be provided via HTTP only)
HTTPS with certificate validation off, sometimes internal repo is HTTPS only but just can not be validated by the user on first run
If an http.get() was created that successfully used the system APIs to make an HTTP request and allowed for all the above knobs to be configured you would also find a large number of users will start to heavily depend on it outside bootstrapping things like PIP as currently Python’s HTTP ecosystem does not do a good job with a lot of these edge cases.
If an http.get() with this level of capability was available in the stdlib, it could well be something that pip would use as a replacement for our use of requests[1].
There would still be a significant amount of work involved, though. We’d need to deprecate pip’s --proxy option, for example, and ask users to change to configuring their proxy via whatever mechanism http.get() supported (which I assume would be some OS-level configuration). That would likely be a very painful transition (repeated across a number of options), so it wouldn’t be a quick process.
One additional feature we’d need is the ability to transparently cache downloads. ↩︎
The bundled pip would still include the code for requests and use this per default, but for the purpose of bootstrapping into this mode would use the basic http.fetch() API.
So pip could continue use all the optimizations requests provides for regular operation, but also support bootstrapping without having access to the ssl module. At least that’s the idea - not sure whether other vendored modules need the ssl module as well.
As I have already mentioned on the PEP index topic recently, I very much believe that the Python dev teams should try reunite rather than drift apart even more, so definitely on this.
Very true.
The alternative to all these musings would be to develop a completely new crypto API, which unifies available OS level APIs and provides the same interface on all supported platforms. However, this would be a lot of work and surely won’t be ready and stable in time for the EOL of OpenSSL 1.1.1.
That in effect means we’re relying on a debundled ssl module. Which has its own problems (a user who wants to install a different version of ssl won’t be able to unless pip works with that version, for example). It’s possible, but from experience it’s a lot more trouble than you might think.
I think it’s reasonable to simply say at this point that having any of pip’s existing functionality depend on a module that isn’t vendored with pip, would be a major project, and would almost certainly need close consultation with users, distribution vendors, and other stakeholders with complex requirements (CI/cloud providers, for example). Not something the pip developers can take on by themselves.
Agreed - I can’t judge whether it would be more work than a “pip using a fetch API” solution, but my gut feeling is that both would involve about the same amount of work, just by different groups.
Well, ensurepip simply installs the bundled copy of pip. That’s by design, to ensure that all Python installations, even ones done offline, have pip ready to go. So bootstrapping pip via ensurepip doesn’t use any of the network stack. (Sorry, that’s a nuance that I just realised now probably isn’t obvious to everyone here).
The question here is how we get from “python is installed and python -m pip works” to “pip is fully functional”. Currently, there’s nothing more needed - ensurepip results in a fully functional pip. The user may need to add configuration to make pip aware of their proxy, or details like that, but such configuration isn’t really any different from any other program.
With any sort of “ensurepip results in a limited copy of pip” scenario, the user has an extra step to do - to install the necessary extra components needed to make pip fully functional. And they need to do that with a version of pip that has limited functionality by design. Probably a version of pip that doesn’t work in their environment (otherwise why not stick with the “limited pip”?) That’s not a good user experience, IMO - and that’s in the packaging ecosystem, where some pretty terrible user experiences end up actually being considered “better than average”
For Windows at least (I don’t know MacOS handles CAs) libcurl supports using the OS’s certificates via the CURLSSLOPT_NATIVE_CA flag: CURLOPT_SSL_OPTIONS
I’m not sure adding new non-trivial + security-sensitive APIs to the stdlib is really going to help things here. The whole problem is that whenever we add APIs like that they end up eventually becoming more of a burden than a help (see ssl, urllib).
I’m likely showing off my ignorance here, but TLS support seems to be something that’s both widely useful and can have a stable interface for most usecases. As such it seems to me that this is functionality that can be kept in the standard library.
As often a major challenge is the availability of developers able and willing to work on this, that’s not necessarily something that will change when we’d remove TLS support from the standard library.
In a way, yes, but normally the OS will already have the right configuration for things like proxies, certificates, etc.
Since we’re focusing on the ssl module here, I don’t think we’d run into such problems. People will in general always want to use to the latest version to avoid security issues.
That’s a good point, and one we often tend to forget when talking about how great things would be if we’d move things out of the stdlib and onto PyPI. We take it for granted that someone will jump on the tooling on PyPI and simply take over. This oversimplifies things and in fact we’re just pushing problems out of sight that way.
It may help in some cases where there are people who want to take over maintenance, but without those volunteers, the situation won’t be any different compared to keeping the code in the stdlib.
Overall, I’d say the short excursion into testing whether a simple HTTP fetch API would help solve the ssl module maintenance issue has shown that this will not effectively help us.
So perhaps it’s better to bite the bullet, wait for OpenSSL 3.0 to stabilize and start work on a new unified TLS API, which focuses on using OS facilities as much as possible, without binding to a single tool.
OpenSSL would continue to be this tool on Unixes (with OS vendors taking care of the maintenance, certificate stores, etc), but on Windows and macOS (and perhaps other platforms), the OS provided TLS layer would be the ones to target.
This is the answer I was hoping to hear, because this API is totally doable. Damian’s list matches my own experience, which is that pip’s thorough use of the API is merely to make normal scenarios work, and not because pip needs anything special from those APIs.
Unfortunately, this is the hard part. As I mentioned earlier, we’ve trained our (CPython) users to specify OpenSSL specific settings and files to make their normal scenarios work (we see the same for Azure SDK users). Migrating to “your OS defaults now work as you expected originally” is fine, but plenty of people are going to be using these settings for Python-specific overrides (we see this, too), and those will not migrate easily.
This is why I suggest that as long as OpenSSL or a substitute keeps the same configuration settings, we may as well stick with them. Once we have to force users to change environment variables or configuration files to keep up, we may as well move directly to “Python behaves the same as your system browser”.
For additional context, I do have a full wrapper of the Windows HTTP APIs that mimics requests/aiohttp and is in use by some teams at work. I currently don’t have the time/resourcing to maintain it publicly, and because I built it on work time I can’t just release it, but I’ve jumped through the hoops before and it works (and is significantly faster than requests/aiohttp on certain workloads - slower on others). So I feel I’m not totally shooting in the dark on this topic
When you say “default browser”, do you mean only the one(s) shipped with the OS, or do all browsers query the OS for their settings? It’s been a long time since I did anything with proxy settings on Windows, but back in the day, Internet Explorer (yeah, I did say it was a long time) used the system settings, but Mozilla-based browsers had their own internal settings. Which was quite useful, since Windows Update used the IE settings, so you could configure IE to go one way and the user’s actual preferred browser to go another.
Fair point. I meant to imply any browser that follows the system settings (which back in the day were the Internet Explorer settings, but eventually became OS settings).
I’ll update my original post to read “system browser” just to be clear.