Pip install from S3 bucket without pypi server

meatgrinder · February 22, 2023, 6:04pm

Sorry guys, if there are any etiquette or procedural faux pauxs. I’m totally new here.
Ive looked for this as a pre-existing topic and couldn’t find one.

basic use case:
gitlab pipelines have access to private pypi server.
However EC2 instances running user-data scripts to initialize and do “pip install” don’t.
As a result, the top level package needs to be downloaded via awscli.
That package needs to be inspected (eg. via zgrep or pkginfo) to see the versions of other private packages that will need to be downloaded via the awscli… etc.
Then they can be installed as local files in reverse order.
It’s a total hack.

yes a pypi server can use an S3 bucket on the backend and that seems like a good solution for publishing since there are a number of different client tools to do that.
However “pip install” should be able to specify --extra-index-url s3:/// --profile
and do download/installs directly from S3 buckets.
You can use a bucket if you make it public and create the appropriate index.html files.
I believe you can also use a private bucket but the bucket policy restrictions are very limited (e.g sourceIP)

I’ve forked that latest pip and done just that.
It depends upon the awscli being installed. I did not want to use boto.
It will also take an option “–profile” arg for the awscli.
There’s no need for index.html files as those are dynamically generated using the “list-objects” command

I’ve written an S3Adapter(HttpAdapter) to do the heavy lifting.
I wanted to just write it as a wrapper, but that was a little difficult since the LinkCollector imports a global function like “build_source” and a few other hurdles.
Anyways. I was wondering if there was any interest in integrating this into pip and if so, what the process would be to do that from my fork?
I’ve never contributed to a public repo before, So I’m lost in this whole thing but I’d love to make a contribution if you all think it would be valuable?

EpicWink · February 22, 2023, 11:14pm

The AWS CLI is implemented in boto3, though it is contained in a separate environment.

You can protect an S3 static server by putting it behind AWS CloudFront, which can have a custom authenticator and authoriser.

If you’re only accessing the S3 bucket from an EC2 instance, you can set up the S3 bucket to only allow traffic from within a VPC, then run the EC2 instance inside the VPC (or link it in from another VPC).

There are a number of existing PyPI servers which support being backed by S3. If you’re lucky, you’ll find someone’s Terraform config to set up a server which uses AWS Lambda and API Gateway.

meatgrinder · February 23, 2023, 3:18am

awscli implemented in boto3…
Goog point, but I was noting this more because I simple didn’t want to do the implementation in boto directly. I simple 1 line awscli call was
I’m not a bit fan of it. It’s cumbersome, especially if you’re changing the implementation. Chainging a single command line and parsing the json is so easy.

You’ve come up with a lot of work to get around simple direct S3 bucket access leveraging all of the bucket policy access for free. What if you have artifacts in a bucket an deployments in different VPCs or accounts? If your iam profile has access… no problem
Why would you go through the hassle and expense of setting up cloudfront and going through security reviews on that. When you already have awscli access to the bucket and pip could just use that?

The pypi servers that have S3 buckets backing them are yes. We are currently using that for publishing the artifacts probably from your internal pipelines. The problem is that you need to secure the pypi server for install requests wherever the client is being deployed

using the default profile:
pip install ps-scanner-service==99.99.98 --extra-index-url=s3://272739225222-sms-data-dev/packages --trusted-host=272739225222-sms-data-dev/packages

Now looking at the private packages…
pip freeze | egrep “ps-scanner|ps-common|static-scanning”
ps-common==23.2.4
ps-scanner-service==99.99.98
static-scanning-events==0.2.4

meatgrinder · February 23, 2023, 3:29am

and just because you could allow an EC2 from a certain VPC to have read privileges on the bucket that doesn’t effect access via http/https via pip. You would need to set up an endpoint url and work policies around that protocol. It’s not the same as “S3 GetObject…”
You don’t have those same sets of access policies as you do via S3.

if only you could just put s3://bucket/prefix --profile
from a user perspective it’s so simple and painless

sinoroc · February 23, 2023, 10:22am

I understand that this is about having the feature directly in pip itself. But I still feel like that for anyone that stumbles on this topic by looking similar keywords, it is worth mentioning these technical paths to pip install from AWS S3 (or something similar):

GitHub - uranusjr/simpleindex (I link the source code repository because the PyPI page does not have the long description containing the relevant details)
pywharf · PyPI
pypicloud · PyPI
and probably others, but I listed the ones that seem the moset frictionless

(I guess that is what was already hinted at with “There are a number of existing PyPI servers which support being backed by S3”.)

Additionally, if a feature such as in the following ticket were to be implemented in pip, this could result in a relatively smooth experience for such use cases:

An option to start a local index proxy when running pip · Issue #11771 · pypa/pip · GitHub

If you have code ready, I believe you could make a pull request. I doubt it would be integrated (I am not a maintainer, so that is not my call to make), but it could lead to an interesting conversation anyway. On the other hand I spotted this seemingly related pull request:

Adding S3 and GoogleStorage URL support by brian-dlee · Pull Request #10789 · pypa/pip · GitHub

sinoroc · February 23, 2023, 10:50am

Maybe another (indirect) way to getting pip to learn how to install from AWS S3 is to help with the currently ongoing work on pahtlib. As far as I understood, one of the goals of this work is to get pathlib to handle s3://. I am really not familiar with the topic so I do not want to give false hopes, but maybe it is worth looking into it.

steve.dower · February 23, 2023, 8:01pm

It’s more to allow a third-party library to extend pathlib with that scheme (or any other filesystem-like namespace). We’re not planning to add any support for specific cloud services into the Python standard library.

I would strongly recommend looking at simpleindex (or an equivalent, but I like simpleindex). It allows you to run a PyPI-compatible index entirely on your local machine, and then if you want it to direct certain packages to S3 you can write a small extension to do that.

Otherwise, just as the Python standard library is unlikely to directly support a specific cloud service, I expect pip will be unlikely to do it. If you can mount your S3 bucket as a local/networked file system, you should be able to use that with --find-links. Otherwise, the supported protocol is PEP 503 – Simple Repository API | peps.python.org, and so you’ll need to satisfy that to use with --index-url.

EpicWink · February 24, 2023, 2:45am

See the rendered packaging guide for my PR for more details, but here’s a short-list:

uranusjr · February 24, 2023, 4:54am

Regarding simpleindex, if you do end up using it and implementing a custom s3 route, please also publish it to PyPI. I’ll add the reference to the project page so others can also benefit from the effort

meatgrinder · March 2, 2023, 3:59pm

just back from vacation…

Yes. I’ve already completed it and been using.
I posted here to see if there was interest in having this as part of the main release.
Shocked that it isn’t a no-brainer.

Got to do one more commit to remove a whole bunch of superfluous logging.

EpicWink · March 2, 2023, 11:14pm

The reason it isn’t no-brainer is because it’s supporting a small segment of commercial users (who use AWS) to the detriment of all others: not only will the pip download size become a little bigger (when including all reasonable authentication flows, and all possible exception handling), but it would take away maintainer time to support the feature, especially in the long-term.

Not to mention there are other cloud-storage providers out there: why not also support those? At least Azure Storage (blob) has similar popularity, and I’m sure people would request support for Google Cloud, Alibaba Cloud, and smaller providers.

Most of these providers (including AWS) already have a service to host Python packages today, which pip can use without change.

meatgrinder · March 3, 2023, 3:57pm

to the detriment of all other users?!! Size? There’s 1 addl class. There must be less than 100 lines of code added.

“small segment”… idk about that. I didn’t research the segment size or demand and I don’t think you did either. I had a need and it’s a no-brainer for my use case. S3 is pretty prevalent. I would def be surprised if it didn’t simplify workarounds for a lot of users. I’d go so far as to guess that it would benefit more users than the vcs support that’s being added.

I didn’t want to refactor pip, but I did think about it so that you could make it pluggable. You could load your own adapter and just register the prefix and any command line args needed specific to your adapter.

EpicWink · March 3, 2023, 10:32pm

Ah sorry, I made a logical leap without explaining myself. I had assumed the pip maintainers wouldn’t want to add support without including the dependencies. Obviously, you wouldn’t bundle AWS CLI with pip, so my thinking was the request construction and signing and exception handling would be reimplemented in pip.

Of course, there’s precedent for using external libraries if present (keyring), so that would be more reasonable.

Nope, but I would certainly bet that the vast majority of pip users don’t even know what an AWS or an S3 is.

I actually don’t know what this is (I thought pip already had VCS support). Could you please send a link to the PR / code lines / discussion that you saw?

uranusjr · March 4, 2023, 1:18am

To offer some more concrete examples. pip supports VCS URLs by calling out to respective VCS tools. When it’s first implemented (before it’s even called pip), everyone used SVN and it’s implemented first. Now only a very limited number of people use that feature, if there are any at all (it’s the only VCS backend without a bug report in the last couple of years as far as I can recall). It’s only an additonal class for you, but pip maintainers are stuck maintaining it until practically pip dies, because open source is free as in “free,” and these things add up. It’s easy to argue for every small bits of them to go in, but it’s always too easy to tip the scale over if you are too happy to accept any of them. And S3 is not even half as popular as those things we added back in the days to begin with.

meatgrinder · March 5, 2023, 6:42pm

well this is at least a much more constructive conversation. Thank you.
It was just sounding negative without any serious consideration before.

There is no dependency on awscli unless there is an actual attempt to use the S3 protocol and direct bucket access. There’s no initial importing or anything like that.

I’d imagine that the number of people who use an S3 bucket as a backing store for a private pypi server would want this. It would eliminate all access security considerations to the server for “pip install” I would also add auditing access, etc.

If the resistance is due to a small segment of users, maybe a simple refactoring would be better. It would def not take much to get rid of that global build_path function and just have a class that lets create wrapper projects where you just register your s3, Azure, etc Adapter class along with the url prefix and to register any additional specific command line args.
This would make pip extensible to any protocols.

I was originally looking to do that but it didn’t seem worth it for my specific task.

I’ll have to talk with my work, but I believe I can and will post my current project so people can just check it out.

meatgrinder · March 5, 2023, 7:01pm

by allowing you to import and register you own adapter/prefix
you could also improve package install access from gitlab repos and they have added package registries. The current use case shows using curl and a job token but only from within git pipelines themselves.

This is not the only place where you’d want to access that package repository, so it’s not a complete solution. You’re just asking for problems needing to publish packages to multiple places.

import pip_adapter_registry as r
r.add( GlabAdapter, “glab://”, GlabCmdlineArgs )
…