Difflib should support iterators as input

ahvigil · April 21, 2023, 3:44pm

Currently, difflib.unified_diff and difflib.context_diff raise errors at initial type checking if you try to pass in generators (TypeError: 'generator' object is not subscriptable). Would be nice if you could open a file, pass in a generator over the lines in that file, and have difflib read lines as necessary so callers don’t have to read the inputs into memory themselves.

a = ['some\n', 'expected\n', 'text\n']
b = ['some\n', 'unexpected\n', 'text\n']
difflib.unified_diff(a, b))

vs

c = (line for line in a)
d = (line for line in b)
difflib.unified_diff(c, d))

which produces

File ~/.pyenv/versions/3.10.1/lib/python3.10/difflib.py:1263, in _check_types(a, b, *args)
   1256 def _check_types(a, b, *args):
   1257     # Checking types is weird, but the alternative is garbled output when
   1258     # someone passes mixed bytes and str to {unified,context}_diff(). E.g.
   (...)
   1261     #   +++ b'newfile.txt'
   1262     # because of how str.format() incorporates bytes objects.
-> 1263     if a and not isinstance(a[0], str):
   1264         raise TypeError('lines to compare must be str, not %s (%r)' %
   1265                         (type(a[0]).__name__, a[0]))
   1266     if b and not isinstance(b[0], str):

TypeError: 'generator' object is not subscriptable

tjreedy · April 21, 2023, 5:22pm

You should be asking that difflib functions support iterators as input, as generators are a specific subcategory or iterator. Iterators in general are not subscriptable. Have you checked whether the idea is feasible? IE, can you replace indexing with next calls in any of the functions?

ahvigil · April 21, 2023, 8:31pm

Implementing proper support for iterators in a way that doesn’t just materialize the entire iterator isn’t trivial. The underlying SequenceMatcher class uses an algorithm based on finding the longest matching subsequence which heavily indexes into the underlying sequence. One might be able to reimplement the algorithm there to add windowing so that only part of the sequence is considered as the diff gets generated, but that would be a little bit of a project with potentially some algorithmic changes.

A small optimization might be to read through the two iterators until the first difference and only start the matching against the contents at the first diff. That would save you from having to read everything into memory for the cases where two texts are identical or have sparse diffs placed well into each file.

of course you could always trivially add logic at the beginning to unroll an iterator, though that becomes more of a stylistic thing.

tjreedy · April 21, 2023, 8:43pm

I don’t understand. Supporting generators is no different from supporting iterators as generators are iterators and are accessed the same way. Anyway, I strongly suspect that in general a pair would have to be materialized in total to be diffed.

ahvigil · April 21, 2023, 10:42pm

yes, I updated the title of the thread to iterators based on your previous comment- you can forget I mentioned generators before.

The point for me of using an iterator would be to avoid having to read the lines of a very large file into memory. Certainly the differ needs to read all 40GB of a 40GB file, but for files that differ by a few lines most of that 40GB might never be used again

I suggested 3 approaches in my comment above one might take to support iterators streaming in data for these difflib functions:

rewrite the algorithm so the entire contents to be diffed doesnt need to be available at any given point in memory (this seems like a hard problem and probably less feasible unless someone really got into it)
add a step at the beginning of the function that seeks to the first difference and starts the diff there, since we never have to look at anything before ever again. This would somewhat help with large sparsely diffing files where most of a hypothetical 40GB diff is matching content.
unroll the iterator at the very beginning and turn it into a list. this would let me pass an iterator in but doesn’t really help with my 40GB file diff problem

I should’ve been more specific with my underlying issue- I don’t care about iterators per se, just wondering if there is a way to diff very large files efficiently without reading them into a list all at once

ahvigil · April 21, 2023, 10:50pm

more specifically, my idea here is to reduce the average space complexity of difflib functions

tjreedy · April 22, 2023, 3:12am

The problem with changing difflib very much is that diffing is a specialized algorithm area. Difflib was written by Tim Peters, who retired years ago, and who has not been replaced.

If one wants an efficient line-change only differ, one could iterate through line sequences A and B in parallel. If lines a and b are not equal, use difflib to make a within-line diff. If one wants to allow single line inserts and deletions and line swaps, then on a != b, look ahead to the next lines a’ and b’ and compare a to b’ and a’ to b, and if either are equal, proceed appropriately. Otherwise revert to a within-line diff. Given a deletion or insert, one could iterate one file looking to detect a larger deleted or inserted block. I don’t know if such simple cases are common enough to make posting something on pypi worth while.

ahvigil · April 25, 2023, 2:49pm

Thanks for your suggestion! I’ll go ahead and wrap the current difflib functions to meet my needs.