File metadata after wheel installation

Suppose I’m building wheels for a package that needs cache files, which are detected as up-to-date or not based on the modification date in their file metadata. Basically similar to pycs, except that the language is not Python. Is there a way to specify that these files must get a higher modification date than the files they refer to after unpacking the wheel?

Is reordering files in the wheel archive a reliable workaround?

I don’t think there’s any approach that’s going to be reliable here.

If the files installed by the wheels may be replaced at a later date, you really ought to store the newer cache files elsewhere (since the install directory may not be writable later on). In this case, if there are no files in the “mutable cache” location, or the ones that are there are older than the pre-filled cache[1], use the pre-filled cache.


  1. Which should always be true enough, whether the installed cache kept the original creation times or got the install times on the files. If there’s some link to the version of the package then you’ll need that separately anyway. ↩︎

Strictly speaking, I don’t think you can rely on anything in particular about modification times or extraction order.

Is there a specific reason you want to include cached data in wheels? Note that .pyc files aren’t included in wheels — they are created after the .py files from wheels are installed. Unless the process of generating them takes very long time, perhaps it’d be best to simply generate them on the first run. This would also mean smaller wheels.

Also it’s a good idea not to rely solely on mtimes to determine whether the cache is up-to-date. Modern .pyc files default to using a hash of the source file rather than mtime, as that is more reliable. You could also use a two layer approach, i.e. use mtimes first and if they indicate that the file is outdated, check hash to avoid regenerating unnecessarily.

Thank you @steve.dower and @mgorny for your thoughts.

Well, I’m packing wheels of LilyPond, an application written in C++ using Guile Scheme as its extension language, and these files are Guile bytecode files, essentially Guile’s equivalent of .pyc files. (The goal is to make it easy for Python programs like abjad to manage their dependency on LilyPond.)

I know that Guile should really really use hashes of the source files instead of mtimes, but this is not under my control. I’m just reusing the officially distributed static binaries, so I’d like to avoid patching Guile, as it is far more convenient for me not to do my own builds.

It’s not possible to generate these files on the first execution for a couple reasons; most prominently, initializing the Guile runtime relies on some of these files being already available in precompiled form. The bootstrap process to generate these takes an incredibly long time.

After reordering the members of the archive to put the folder containing these files last, pip install seems to yield functional LilyPond binaries. Of course this is suboptimal, although to be honest I’m happy enough with it.

I wonder though: would it be worthwhile to specify somewhere what the mtimes of files in extracted wheels are set to (i.e. current time in order of extraction or time from archive)?

If more people ask the same question, then it’s probably worthwhile.

I predict the team will specify them as “unspecified, do not rely on mtimes” though, which isn’t really helpful for you. (But it’s incredibly helpful for pip, as they don’t have to implement a new archive protocol across every platform in order to maintain a requirement that 99.9% of users don’t need or care about.)

For me, both (a) “make mtimes strictly increasing respective to the order of archive members” and (b) “set mtimes to what the archive specifies” are workable options. (In the first one, I order the members as I want, and in the second one, I set the times accordingly. Now that I think about it, I could do both to be extra sure.)

On the other hand, stuff like “make mtimes increasing respective to the lexicographic order of archive members” would not work. I’m not really sure there would be a reason for pip to do that.

Currently, pip does (a) (as confirmed by unzip_file in src/pip/_internal/utils/unpacking.py). I’m just wondering if pip could make this behavior a documented guarantee instead of internal detail, not if it could change its source code in any way.

(PS: By Hyrum’s law, I’m pretty sure there are already people who are relying on this…)

Have you considered reporting this to them? I’m not saying that this will help you right now but it may be a good future development for guile.

In my personal opinion, it’s not a good idea. This is roughly a tricky corner case, and in the best case people will simply miss it, in the worst case they’ll have to hack around unpacker implementation to get the correct behavior.

The relationship between LilyPond and Guile developers is… let’s say complicated. The problems with Guile bytecode files (which include things like a total lack of dependency management, so Guile will happily compile code with macros into a bytecode files, then “miss” recompilation when those macros, provided by another module, change) are infamous in the small community Guile has, but the developers have not made improving this situation their priority in the 12 years since bytecode files have been introduced.

Sorry, I don’t understand what you mean. What is the “correct behavior” according to you?

I mean “correct” according to whatever gets specified. People easily miss things like this.

Well, I was suggesting to specify pip’s current behavior.

Are there important tools besides pip that need to install wheels? (I know lots of tools consume them, like auditwheel and such, but pip is the only installer I’m aware of, though I’m not an expert at all).

Wheels are an interchange format, which means you need to specify that wheels preserve (or have rules about) file metadata. The current specification does not.

pip’s behaviour isn’t relevant here, it’s just following what the spec says.

That said, you’re in the right place to propose a change to the specification. I think it’s unlikely to be accepted, but you’re welcome to make the proposal.

OK, then I am wise enough not to make the proposal :slight_smile: Thank you for your time.

I went ahead and published lilypond · PyPI, fully accepting the risk that if pip’s behaviour changes, I might need to switch to doing my own builds with a patched Guile.

While it’s unlikely that pip would deliberately change this behaviour, it’s certainly possible it could be changed inadvertantly as a result of some other change. As a purely hypothetical example, if pip needed to process the set of file names in the wheel, we might extract the names into a (Python) set and “do stuff”, and then, rather than re-read the zip index, iterate over the set to extract the files, which would make the order of extraction arbitrary. As I say, this is very unlikely, but it’s the sort of thing that could happen, which would make maintaining the current behaviour problematic (and hence, not something we’d normally do unless the spec mandated it). So yes, I’m afraid “unspecified, do not rely on mtimes” is the answer you’d get from me.

One thought - you describe the files as like Python bytecode, could you maybe ship only the bytecode files (or ship the source in a location that’s not visible to the Guile interpreter), so that it will be the only file the interpreter sees and there’s no “newer source” to worry about?

Interesting. Unfortunately, from a quick test, Guile doesn’t look prepared for that (but thank you for the idea).

Hmmmm… Looking into it a bit more, I think it might be feasible to design a solution around this. I’ll see what we can do for this in LilyPond upstream.

If recreating the files is hard, could you maybe keep the files in a separate location and install/move them somewhere to a writeable location upon the first run?

installer · PyPI is an important tool/library as well. Plus, most of the operating systems and distributions have their package managers that can affect the results as well, though they are generally mtime-preserving by design.