"Safe" archive unpacking and path resolution?

I’d like to be able to distribute resource bundles (say) as a zip file that contains some kind of manifest at the top level, that may refer to paths to specific files within the zipped hierarchy. On the client side I want to be able to validate those paths and ensure that they only refer within the hierarchy, and don’t try to cast around for existing files on the user’s filesystem.

I think there are two steps to that task:

  • Ensure, before unzipping, that the zip file either does not represent any symlinks or that they are only to other files within the archive.
  • Ensure, when processing a manifest, that the paths:
    • are relative;
    • either do not use .., or do not go past the top level of the zip file’s hierarchy when using it

There’s a security warning in the documentation for zipfile.ZipFile.extractall, but I don’t find it very elucidating:

Never extract archives from untrusted sources without prior inspection. It is possible that files are created outside of path, e.g. members that have absolute filenames starting with "/" or filenames with two dots "..". This module attempts to prevent that. See extract() note.

It doesn’t explain how to do such inspection; nor is it clear why inspection is still necessary as a general security consideration, in spite of the module’s “attempts to prevent that”.

On the other hand, for my own purposes, it seems as though the normalization will do things I don’t necessarily want. I’d like to reject the archive, for example, if it tries to put something in /foo or C:\foo, rather than having it create a relative foo instead.

Finally, it’s not entirely clear to me whether archives can contain representations of symlinks, hard links, Windows shortcuts or whatever else I might need to worry about, or what the semantics of extracting those will be, especially if it happens cross-platform.

Multiply all of the above concerns by the number of archive formats, of course.

For path resolution, I know that I can use .is_absolute and .resolve on a pathlib.Path for basic checks (and ensure that the resolved result is within the hierarchy’s resolved path). However, I’d prefer a resolution that, unlike .resolve, doesn’t have a chance to “get lucky” by guessing a path back into the archive hierarchy. I’d prefer if e.g. ../alice/resource.txt didn’t work to find a top-level resource.txt, even if Alice unzips the resource bundle in her home directory. I want the step-by-step resolution of the path to never step outside the hierarchy, while ideally still being able to process .. and symlinks.

Do I have to build that myself?

PEP 706 to the rescue, implemented in Python 3.12…

While currently only implemented for tar, because the relatively safe data strategy is similar to what zipfile already does by default, the nominal plan is to extend this to zipfile in the future, which would allow custom filters for additional inspection, and being able to choose what error handling you want for them. If you’re interested, maybe you could reach out to @encukou about that, since it seemed one of the main blockers to that, per the PEP, was “research” into existing use cross-platform cases for extracting ZIPs, which you seem to have a solid background in.

1 Like

Hmm, I don’t know that I do have much expertise there; mainly what I have is a specific use case motivated by one of my designs. That looks like an excellent starting point, though, thank you.

I do this:

  • path is not empty
  • check that not(isabs(path))
  • check that path == normpath(path)
  • check that not(path.startswith('../'))

You may want to check PyEmpaq’s source code. It does several of those validations before packing the directories/files tree into the zip, you can reuse test cases, etc.