Pip download (just the source packages, no building, no metadata, etc.)

Hi.

Sometimes all I want is the source for modules - and I’ll work out the dependencies later - manually.

So, I have been trying something such as:

pip3 download --no-deps --no-build-isolation pandas

However, it still goes into a long process (building wheel data I guess, it from what it finally says).

(py37) aixtools@x070:[/home/aixtools/python/download]pip3 download --no-deps --no-build-isolation pandas
Collecting pandas
  Using cached pandas-1.0.5.tar.gz (5.0 MB)
    Preparing wheel metadata ... done
  Saved ./pandas-1.0.5.tar.gz
Successfully downloaded pandas
(py37) aixtools@x070:[/home/aixtools/python/download]ls -l panda*
-rw-r--r--    1 aixtools aixtools    5007108 Jul 04 11:03 pandas-1.0.5.tar.gz
(py37) aixtools@x070:[/home/aixtools/python/download]

However, to get this I first had to install cython and numpy. What I would rather see happen is that either ONLY pandas package source gets downloaded, or pandas, numpy and cython sources get downloaded (and I have to figure out the dependencies).

Likely, I am not understanding one of the command-line arguments - if so, my many many thanks for pointing it out!

1 Like

You probably want --no-binary :all: to tell pip not to try to find wheels. You shouldn’t need --no-build-isolation as you don’t want to build :slight_smile:

This is similar to pip#8387. To summarise:

  1. pip needs to get package metadata to verify a downloaded sdist.
  2. Since sdist metadata is currently not trustworthy (problem 1), pip needs to build a wheel for that.
  3. Pandas has a highly customised setup.py that makes its invocation very involved, even if you don’t want to actually build the package, but just retrieving metadata (problem 2).
  4. Without build isolation, pip cannot even run its setup.py.

The deal with the problem long-term, we need to make sdist metadata useful, but that’d take a long time. Before that, it’s unfortunately up to individual package maintainers to make metadata retrieval a viable operation.

Alternatively pip can quit verifying downloaded artifacts (beyond checking the hash). That may or may not be a good idea, I don’t have an opinion on that.

Shouldn’t using --no-deps avoid building the metadata in this case?

It somehow still does, I’m not sure why, probably something in InstallRequirement triggers it. I think it’s very possible to delay that until the installation phase with some refactoring though (although that’d also mean pip download loses the metadata integrity checks that come with preparing the sdist, e.g. ensuring the package name and version is correct).

OK. What I read here is, roughly:

For Data Integrity pip builds things (regardless of arguments) - and to do that it will always need the dependencies (which is why even with --no-deps both cython and numpy had to be ‘installed’ first before the download of pandas could proceed). And, I am guessing - that also explains why --no-build-isolation had no noticeable affect.

Would it be conceivable to have an option e.g., --no-integrity-check that is intended for source downloads (e.g., this option forces --no-binary :all:).

p.s. I did not check for pip issues - as I did not want to assume there was a bug here - I expected “user error”.

Not quite. There’s a lot of messy history here, which I doubt would be of interest, around what we have and haven’t standardised, and how pip treats sdists (which have no standard).

There is a possible pip issue here. In theory, pip should only need the project name and version as long as you say --no-deps, and those are available from the sdist filename. So there should be no need to build when you have --no-deps. However, we don’t yet have a standard that says sdist filenames must include the name and version, so there’s a remote possibility that by assuming that, pip gets things wrong. We double-check during the build (when we do get a reliable name and version) and give an error if there’s a problem.

I don’t believe we’re deliberately doing that check for pip download - it’s a side-effect of implementation details. Certainly I don’t think the check is valuable enough to warrant a command line option.

The real solution here is to finally get round to standardising the sdist format. Obviously that’s not much help right now, though. We could try to stop pip doing this check on pip download as a short-term fix, as @uranusjr suggested. We may have to document that we dropped that check, but I doubt (famous last words!) anyone would care.

However, this is all in an area that the “new resolver” work is changing quite significantly. It may be that the new resolver code doesn’t even have this issue, which would mean that “use the new resolver” would be a sufficient workaround in the short term.

So, could you try pip download --no-deps --no-binary :all: --unstable-feature=resolver pandas, and see if that avoids the build step?

1 Like

The new resolver also installs build deps and build the wheel. I think @aixtools wants (because this is what I also wants sometimes) is something equivalent to apt source, which solely fetch the source distribution. IIUC pip download was designed to have a different purpose of downloading packages for later installation but I’m not entirely sure.

In this very case, it seems the fastest way is to run

pip download --no-deps --no-binary pandas --no-build-isolation pandas

which skips the build (even on new resolver this is still needed). However, it’s worth noticing that metadata is still prepared

Running command /usr/bin/python3 .../pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmp12w7gxw0

and due to a long standing Cython-setuptools integration issue, cythonization was still invoked and it took quite some time.

The best work-around for this AFAIK is to go to PyPI and find the distribution directly, unfortunately.

I dug around and thought about this a bit more, and came to the conclusion that we really, really need to standardise sdist. pip needs the name and version from a distribution artifact. Both are pratically specified in the file name, but pip cannot really use that without checking since there is no guarentee an archive file follows the sdist naming convention (even that is not standardised).

I can think of the following choices:

  1. Standardise sdist metadata, and let pip use that instead, eliminating the need to build wheel metadata.
  2. Standarise sdist filename, and amend PEP 503 (Simple Repository API) to mandate that if a file served under this name pattern MUST be an sdist, so we can guarentee foo-1.2.zip always means “project foo, version 1.2” and does not need to check for consistency. (We need to limit this to PEP 503 indexes because e.g. on GitHub you can download the repository as a zip, and for a repo named foo and branch 1.2… ouch).
  3. Standarise sdist filename, but invent a new extension like wheel did. This would avoid the requirement to amend the Simple Repository API.
2 Likes

We’ve always hit complicated debates when we try to standardise sdist metadata. I’m not 100% sure why, I think there are some cases where people get quite anxious about the possibility that backends could generate different metadata when building the wheel than they did when building the sdist. I honestly don’t know why that could happen, or why we can’t simply declare that as something that backends are no longer allowed to do - but it does make standardising sdist metadata a potentially time-consuming process.

But I see no reason why we can’t standardise the filename - projects are already effectively required to freeze the name and version when building the sdist, so we’re not imposing anything new.

I wish we could just bless the current format as standard, but you make a good point that other sites like github generate names in that format that aren’t sdists. But conversely, I’d somehow feel uncomfortable if sdists got a new extension. I know that’s silly, so I’m not going to argue too strongly, but how about this:

  1. sdist filenames MUST take the form NAME-VERSION.sdist.tar.gz. The “name” and “version” portions must be canonicalised the same way as wheels. Tools MUST assume that any file with extension .sdist.tar.gz is a sdist.
  2. The NAME and VERSION parts of the filename MUST match the distribution metadata - both the metadata in the sdist itself (when that gets standardised) and the metadata of any wheel built from that sdist. It is a backend error to create a wheel whose name and version don’t match the sdist filename.

Currently PEP 503 doesn’t make any statement about what “project files” an index can serve. I’m inclined to leave that unchanged, as it requires tools to make judgements about files without considering where they came from (which is overall a good thing). But I would make one exception to that, for compatibility purposes, and say that tools MAY assume that files named *.tar.gz and served from a PEP 503 index are sdists, and proceed as if they had been named *.sdist.tar.gz.

(It’s not inconceivable that some tool will choose to treat all .tar.gz files like this, but I’d view that as an implementation choice about how to treat non-standard files, rather than something the standard should take a view on).

Even if we do want to go further and standardise sdist metadata, I’d still advocate for the above as the specification of the sdist filename. It feels like the minimum change needed to give us reliable information.

.sdist.tar.gz counts as a new extension in my mind, so that works in this regard :slightly_smiling_face:

One problem with this particular design though would be backwards compatibility. If my memory to pip’s implementation serves (I didn’t actually check), a foo-1.0.sdist.tar.gz would be identified as an sdist of project foo and version 1.0.sdist (legacy, non PEP 508 version), gets picked up and downloaded, and then fails to install. We need to invent something that does not accidentally get picked up by old pip, maybe also easy_install, versions.

1 Like

Sigh. That basically means a completely new extension (i.e., not one that ends with .tar.gz) :frowning:

I foresee endless bikeshedding. But I’ll start by suggesting NAME-VERSION.sdist, in the hope that it’s sufficiently obvious to be uncontroversial.

I created a PEP draft and put things discussed here in it.


Edit: The PEP is in discussion at PEP 625: File name of a Source Distribution

1 Like