Sdist metadata: Store in special fields?

pganssle · July 14, 2020, 12:08pm

I don’t have time at the moment to address the question of the format of the metadata (I’m in favor of some variation of METADATA rather than pyproject.toml) or what can and cannot be specified, but one thing I’d like to suggest is that whatever the format of the static metadata, it might be a good idea to store it in a gzip extra field. There’s also something analogous for zip files. There’s currently an open bug for adding support for this to the standard library, but I think writing ad hoc code to read the extra fields wouldn’t be too hard.

This seems like it would be particularly useful if we’re going to continue using the .tar.gz format, since there’s no way to get at individual files within the tarball without first decompressing the whole thing, which I think would be much slower than jumping to an address in memory and reading the data there for large source tarballs (and source tarballs may easily be large if they contain tests with test data).

I believe we could specify each metadata field separately using the format itself, but probably the easiest thing to do would just have a field called something like PY_SDIST_METADATA or whatever the naming convention is containing the contents of whatever file we choose for metadata.

pf_moore · July 14, 2020, 1:35pm

From a pip maintainer POV, -1 on using special fields. Apart from the lack of stdlib support, I’d be concerned about something like that not being properly recognised by other tools (something that loses the data while repacking the file, for example, or virus checkers flagging “suspicious content”).

If getting at individual files is important (and it might be, although I’d rather not get into arguments about whether shipping large amounts of data not needed for the wheel is acceptable) then I’d rather we switched to a different format which supported that directly.

dholth · July 14, 2020, 3:28pm

We had talked about doing a nested wheel, where the main content was stored in an archive and the .dist-info was the same: randomly accessible .zip archive members. That strategy would make a decent sdist 2.0 since you could save space and time compared to the current .tar.gz, on top of having quickly accessible metadata.

I’m suggesting that the .dist-info directory would be the metadata. Like how sdists contain just PKG-INFO. I don’t recall if they sometimes contain the *.egg-info directory?

pf_moore · July 14, 2020, 3:32pm

That was the sort of thing that I had in mind. Or just a simple zip, if we don’t want to get over-complicated (I don’t know how significant the file size issue is in practice). But it’s somewhat off-topic here, where we are talking about specifying metadata, and not about the file format itself.

pganssle · July 16, 2020, 6:19pm

I asked the admins to move this into its own thread, to avoid derailing the main thread.

I don’t think it’s a good idea to decide on our format based on what bad virus checkers will or won’t do, and hopefully virus checkers are OK with data stored in a specified field (though, TBH, it’s hard to underestimate how stupid virus checkers are).

As for losing the data as part of repacking, I also think we don’t have to worry about that too much, because that would just be taking something that is an sdist and creating something that is not an sdist.

That said, I don’t think it would hurt much to have this be an additional storage place for the data, which is likely to be small. If it’s missing, you fall back to reading it from .dist-info/METADATA or wherever it is supposed to go.

I agree that this would be good, but I would imagine that the projects most likely to benefit from not needing to be extracted (large projects) would also benefit most from the improved compression achieved by using a tarball (in practice I spot checked a bunch of people who have requested size increases and most of them are only distributing wheels, and the one that distributes an sdist has all its files individually gzipped, so it’s not really getting much compression, but I don’t know how typical this is).

I think the solution of a nested tarball (zip file containing tarball + metadata) is nice and clever, but I imagine that will introduce more problems than using the extra fields will, since it will break all workflows that involve a simple “unzip the tarball and then build it” — that’s not terribly difficult to fix, but adding in more possible ways to unzip a source distribution will make a lot of simple bash scripts a lot more complicated.

dholth · July 16, 2020, 7:11pm

The bash script would call a new Python command line tool for unpacking sdists, simplicity restored. Anyway we use pip for that. It does a lot to make things build.

The point of the idea is that you only need random access to the metadata (the dist-info directory).

If you were to check I think you would find that literally 10 pypi packages use about half the bandwidth. If your 300MB tensorflow wheel shrinks to 90MB you would notice. You could literally special case that particular one and get most of the benefit. Haven’t evaluated the lots of smaller ones case.

The nesting and un-nesting operation is very cheap compared to the time spent running the compression algorithm.