Fetch zip files with HTTP range requests for wheel metadata?

dholth · April 6, 2020, 2:34pm

In another thread someone mentioned a version of pip that fetches wheel metadata without downloading the entire wheel. Normally this will be at the end of the file, so a good bet is to start by downloading the last n kilobytes of the file based on typical metadata size, and go from there. I dug up something to do that some time ago and put it on github. It emulates a seekable file object. https://github.com/dholth/httpfile/blob/master/httpfile.py

No warranty about whether this is a good idea or not.

EpicWink · April 7, 2020, 7:45am

Extracting metadata from parsing a binary stream (aka sniffing) feels implicit. Two ideas for a more explicit approach (both of which are feature-adds which require changes to the package index code) are:

A different request completely to get a package’s metadata (in the HTTP body)
Using HTTP HEAD to ask for only package metadata (AWS S3 does this for file metadata, for example). Perhaps the package installer would fall back to downloading the entire package to get metadata if the required information isn’t in the header, ensuring compatibility

In any case, I can’t see any reason to not implement the suggestion anyway, but just make it an implementation detail for pip info which can’t be relied on.

uranusjr · April 7, 2020, 10:08am

This was actually brought up in pip’s issue tracker a while ago, and @chrahunt also has a working prototype. I can’t find the thread now, but IIRC @EWDurbin confirmed this would work for PyPI, and no maintainers thought this is a bad idea.

The support for Python indexes in general though would depend on the server implementation, so eventually a general tool would need to implement feature detection and fallback to downloading the whole wheel if range requests are not available. But I would definitely not discourage this if you only want to work with PyPI.

dholth · April 7, 2020, 6:12pm

I’m not likely to pursue this idea further myself, but if anyone else is go for it.

Some more information about ZIP which whl is based on: The very end of the file points to the zip index which is also at the end of the file. If you fetch that much of the end of the file then you know which byte ranges in the ZIP store each zipped ffile.

uranusjr · April 7, 2020, 6:57pm

Remember though this feature is not mandatory, as explicitly said in the wheel PEP. Although AFAIK all mainstream production-level wheel builders implement this recommendation.

dholth · April 7, 2020, 7:01pm

The ZIP metadata is at the end of the file by definition. But you could also try fetching more bytes than you think the ZIP metadata takes and if you were lucky get files out of the .dist-info directory.

bdist_wheel still puts .dist-info at the end in sorted order. https://github.com/pypa/wheel/blob/master/src/wheel/wheelfile.py#L122

gpshead · April 7, 2020, 8:23pm

Depending on the ordering of data within a .zip archive is fragile. @EpicWink’s suggestion of an explicit API to request just the metadata makes more sense. That leaves PyPI as the single canonical source of truth parsing the binary blob (a zip today) to determine what the metadata to be served is.

There are potential dangers to a range request trying to get lucky with a partial zip file read. You now need to worry about the zip file format internals itself including if someone crafts a malicious zip file that appears to have the metadata desired in the range-requested end but actually yields different metadata when parsed as a whole file.

I know too much about the zip archive format and zipfile implementations to understand that while this type of attack feels unlikely… I can’t rule it out. The code paths involved everywhere are complex and not something I recommend depending on. Doing this is relying upon an implementation of file format parsing outside of our direct control to be lucky enough to have a convenient side effect.

Yes, zip files are supposed to have an end of file central directory. But despite that, they also have inline file headers which can be used, various things do not guarantee that this redundant duplicate information actually matches up making it possible to see two different views of the contents.

If it were an archive format of our own design and entirely under our control I would feel less nervous about this concept. It isn’t.

dholth · April 7, 2020, 8:46pm

If that attack works, then the same attack would surely work when you download the entire wheel and seek to METADATA when installing, using the same builtin Python zip handling locally as on pypi.org… zipfile.py at least doesn’t let the per-file headers mismatch the index…

This is not random seeking within the zip. It’s more like a remote filesystem, you don’t have to download the whole file.

dholth · April 21, 2020, 4:03pm

From the command line.

If you were doing it for real, you would make up to three requests. 1. The last few bytes of the archive to get a pointer to the zip index. 2. The entire zip index. 3. The byte range that contains the file you want. It would be exactly like reading a zip from any seekable stream.

curl -H "Range: bytes=-16384" -o endofwheel -v https://files.pythonhosted.org/packages/36/ac/c8627c214954b18b197f137ee96bc99e1cc31913d80d7ad59fbab3b05945/kiwisolver-1.2.0-cp38-cp38-manylinux1_x86_64.whl

$ unzip -l endofwheel 
Archive:  endofwheel
error [endofwheel]:  missing 75835 bytes in zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
   254056  03-27-2020 02:27   kiwisolver.cpython-38-x86_64-linux-gnu.so
       11  03-27-2020 02:27   kiwisolver-1.2.0.dist-info/top_level.txt
      416  03-27-2020 02:27   kiwisolver-1.2.0.dist-info/RECORD
     1718  03-27-2020 02:27   kiwisolver-1.2.0.dist-info/METADATA
      108  03-27-2020 02:27   kiwisolver-1.2.0.dist-info/WHEEL
---------                     -------
   256309                     5 files

$ unzip endofwheel 
Archive:  endofwheel
error [endofwheel]:  missing 75835 bytes in zipfile
  (attempting to process anyway)
error [endofwheel]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
  (attempting to re-compensate)
file #1:  bad zipfile offset (local header sig):  0
  (attempting to re-compensate)
error [endofwheel]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
  inflating: kiwisolver-1.2.0.dist-info/top_level.txt  
  inflating: kiwisolver-1.2.0.dist-info/RECORD  
  inflating: kiwisolver-1.2.0.dist-info/METADATA  
  inflating: kiwisolver-1.2.0.dist-info/WHEEL

Topic		Replies	Views
RFC: Public Wheel API Packaging	12	956	June 10, 2020
Figuring out what is missing from dedicated packages for supporting downloading and installing a wheel from PyPI Packaging	5	1127	August 13, 2019
Dataset for efficiently querying files and metadata within Python distributions on PyPI Announcements	11	615	November 16, 2023
PEP 658 & 714 are now live on PyPI Packaging	15	3050	July 13, 2023
Query package metadata from pypi.org Packaging help	2	667	December 16, 2022

Fetch zip files with HTTP range requests for wheel metadata?

Related Topics