I’d like to be able to distribute resource bundles (say) as a zip file that contains some kind of manifest at the top level, that may refer to paths to specific files within the zipped hierarchy. On the client side I want to be able to validate those paths and ensure that they only refer within the hierarchy, and don’t try to cast around for existing files on the user’s filesystem.
I think there are two steps to that task:
- Ensure, before unzipping, that the zip file either does not represent any symlinks or that they are only to other files within the archive.
- Ensure, when processing a manifest, that the paths:
- are relative;
- either do not use
..
, or do not go past the top level of the zip file’s hierarchy when using it
There’s a security warning in the documentation for zipfile.ZipFile.extractall
, but I don’t find it very elucidating:
Never extract archives from untrusted sources without prior inspection. It is possible that files are created outside of path, e.g. members that have absolute filenames starting with
"/"
or filenames with two dots".."
. This module attempts to prevent that. Seeextract()
note.
It doesn’t explain how to do such inspection; nor is it clear why inspection is still necessary as a general security consideration, in spite of the module’s “attempts to prevent that”.
On the other hand, for my own purposes, it seems as though the normalization will do things I don’t necessarily want. I’d like to reject the archive, for example, if it tries to put something in /foo
or C:\foo
, rather than having it create a relative foo
instead.
Finally, it’s not entirely clear to me whether archives can contain representations of symlinks, hard links, Windows shortcuts or whatever else I might need to worry about, or what the semantics of extracting those will be, especially if it happens cross-platform.
Multiply all of the above concerns by the number of archive formats, of course.
For path resolution, I know that I can use .is_absolute
and .resolve
on a pathlib.Path
for basic checks (and ensure that the resolved result is within the hierarchy’s resolved path). However, I’d prefer a resolution that, unlike .resolve
, doesn’t have a chance to “get lucky” by guessing a path back into the archive hierarchy. I’d prefer if e.g. ../alice/resource.txt
didn’t work to find a top-level resource.txt
, even if Alice unzips the resource bundle in her home directory. I want the step-by-step resolution of the path to never step outside the hierarchy, while ideally still being able to process ..
and symlinks.
Do I have to build that myself?