I want a way to resize images that is guaranteed to give the same result forever, given the same input

brunoh · May 29, 2023, 8:15pm

I want a way of doing it that will always give the same result, like a function, to avoid duplicate results that have different hashes.

Say I download an image, and use Image.resize from PIL to resize it, and delete the original. Then I take the sha256 of the result. I can make sure the file still exists, by comparing it to the sha256 I have stored.

If, years later I accidentally download the original file again, and use Image.resize with the exact same arguments and file; am I guaranteed to get the same result? Or might I get a similar, but different file, with a different sha256, that would be automatically stored alongside the result I got before?

If I make the program check what version of PIL is being used, so it’s always the same, would that work?
I could have my own copy of PIL somewhere on my PC that never changes.
Maybe that isn’t necessary, and it would be fine to update PIL.
Maybe there is something to use better than PIL.

Thanks for making it this far.

sinoroc · May 29, 2023, 9:14pm

I have the impression it is beyond the scope of Python, in the sense that you would need to answer the same questions if you were working with any other programming language.

I do not know much about this topic specifically, but you might want to read about “reproducibility”. Getting the same tools (in your case Python interpreter version, PIL library version and dependencies, and so on) is probably an important part of the solution. Maybe a containerization solution (like Docker) might help. But also you probably need to check timestamps of the files you are working with. Since we are talking about images, I seem to recall that they often contain a lot of metadata (EXIF or something like that maybe), and in this metadata there might be some other timestamps you would want to keep an eye on. Typically the thing you want to do is to either remove all timestamps or give them a know set value (epoch maybe).

Rosuav · May 29, 2023, 9:31pm

It may also depend on which file format you save into. Due to the different compression algorithms used in various formats, bitwise identity might not be assured. I would recommend doing a test with an uncompressed format like TIFF (be aware that TIFF also has a compressed variant, but take the uncompressed), and verify that you’re getting the exact same number of bytes of file each time. You then MIGHT get the exact same file.

But I would be inclined to work differently. More on that later. For these examples, I took this 1x1 PNG (base64’d for the post):

iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAC0lEQVR42mNgAAIAAAUAAen63NgAAAAASUVORK5CYII=

which has this SHA256: 2aa4fa20701cdd6d8d56046069001186b5267e3ee7d0ef618ad2f4a683723e11

Now, I could resize this up to 50x50, then re-encode it to PNG. and save it as a file. (I didn’t use PIL for this, used imagemagick instead, but the same effect happens.) These are the files:

b68999e40c9ef6a0f20884b5e21411a6016e6000c1ce89518e5ce756e942fa4c  test50:1.png
f8b3d7f8d8ddbd4d78e155a1a59a3b571a7e5f6f5a1abb319fd3622c80ad3570  test50:2.png
5f9ef5eba314d44008ecbf48c6c2e66cd34a3dfb57160fb6c2debed177be6823  test50:3.png
f23609f2b5d6e1390fbc77808cc93a50a202b7be2cedaa7d1c475fe51dca454d  test50:4.png
1870451f5d518eb5261edf634e953644d69e456f0a731c1452333b19e9957e0f  test50:5.png

So, even with the exact same command, same library, etc, no you’re not guaranteed the same result. (These files all have the same file size, but they aren’t bit-for-bit identical.)

Using TIFF, I do achieve bitwise identical files:

96d82681cf9a24452d5a63ce2044f5184d9686e366676d92fb5e7ffbb7fdf493  test50:1.tiff
96d82681cf9a24452d5a63ce2044f5184d9686e366676d92fb5e7ffbb7fdf493  test50:2.tiff
96d82681cf9a24452d5a63ce2044f5184d9686e366676d92fb5e7ffbb7fdf493  test50:3.tiff
96d82681cf9a24452d5a63ce2044f5184d9686e366676d92fb5e7ffbb7fdf493  test50:4.tiff
96d82681cf9a24452d5a63ce2044f5184d9686e366676d92fb5e7ffbb7fdf493  test50:5.tiff

But even this isn’t going to be absolutely 100% guaranteed. So here’s my recommendation: Don’t take a sha256 of the result; instead, take a sha256 of the input, and annotate that with the changes made. So store a nice usable PNG file, but instead of storing its own hash, store something like this:

2aa4fa20701cdd6d8d56046069001186b5267e3ee7d0ef618ad2f4a683723e11-resize:50x50

This is absolutely guaranteed to be semantically correct, but it doesn’t specify a precise bit pattern. In HTTP terms, this is what would be called a “weak Etag” rather than a “strong Etag”. A web browser will happily retain its cached version of an object, knowing that it’s semantically equivalent to whatever it could fetch.

Would that be suitable for your purposes?

brunoh · May 29, 2023, 10:14pm

Thank you for taking the time to advise once again.

I know what to do now.

I never even considered that running the same resizing program on the same file would give different results. That’s bizarre, and I never once found it in my testing, which was big → smaller, png → jpg and jpg → jpg. But, to be fair, I didn’t do a big auto test on lots of files.

I should have made it clearer what my goal actually was. If anyone’s interested, I had two goals: To use stored hashes to make sure the files I have aren’t corrupted, and to automatically detect if the same file was downloaded twice, despite not keeping the full-size versions.

I could have easily done that by storing two hashes for each file; from the original, and the current, resized file.

I thought I could be extra clever, only needing to store the hash of the resized file. I thought that I could use a re-producible conversion process to compare any new file to the resized hash, without the need of an original hash.
And the reason I liked that system, was that if something went wrong with my stored hashes, I could just regenerate them from the stored files.

brunoh · May 29, 2023, 10:21pm

You make some good points. Maybe it wasn’t meant to be.

For future reference, though: On the point of metadata interfering with the uniqueness of the hash, I planned to use the ImageMagick signature to avoid that problem.

“Note, the image signature is generated from the pixel components, not the ima ge metadata.”

Rosuav · May 29, 2023, 10:36pm

Ahhh yes, that makes a lot of sense. It would be very nice if you COULD double up like that, but unfortunately, I don’t think it’d be possible, or at least, not without some very careful tuning. I don’t know whether you’ll actually run into this, but one thing I’ve seen occasionally that gets in the way of reproducibility is residual bits.

Let’s say you’re compressing data with a Huffman encoder (one part of many algorithms including deflation, which is used in PNG, JPG, and a number of other formats). A byte of input is translated into some sequence of bits, and those bits are then packaged back up into bytes with no particular meaning. What happens if you need, say, 734215 bits to represent your data? You don’t have a multiple of eight bits, but you have to write out a full byte. The spare few bits don’t matter and the decoder will ignore them; so what will the encoder put in them? Unfortunately, some encoders allow random bits from memory to end up in those bytes. That could disrupt your hash just for the sake of a couple of completely insignificant bits.

Maybe you’ll be lucky and this won’t happen, but I do worry that the quest for reproducibility might place undesirable restrictions on your choice of file format, encoder, etc.

Fortunately though, hashes aren’t long. Storing two hashes seems like the best way to do this.

cameron · May 29, 2023, 11:49pm

Runing a later version of the same resizing programme might easily give
different results, and/or the same pogramm on another platform (which
might like to different or revised underlying libraries, or have
different defaults such as compression levels etc).

Cheers,
Cameron Simpson cs@cskk.id.au