Refactoring the difflib

Hi,
I would like to refactor the difflib for a few reasons. One of them is kind of a bug in the ndiff which leads to waiting forever for the result. I would start with dividing this library into smaller chunks for better maintainability. After that I would fix that bug. Fix is surprisingly easy which concerns me if something in the logic below is ok. I would also like to improve overall code quality of this library. What do you think about this whole “action”?
Best regards

Why is the bug fix only possible after a refactor?

1 Like

It can be done without refactoring. Nevertheless I would like to improve the overall quality of code in that library.

1 Like

This wouldn’t affect the end user, only people who already do maintenance on the library. It’s not really an idea that needs to be discussed. I would start by fixing the bug and submitting a pull request (becoming in a small part one of the people for whom refactoring matters), then possibly starting with some small refactoring and see if you can get that accepted, etc.

4 Likes

Sounds reasonable. I’ll make a change that fixes/addresses issue I’ve spotted and then I’ll start by fixing other issues related to the difflib. Thanks for the answer :slight_smile:

1 Like

Perhaps you’ve already done this. Bug-fixing should start by writing an issue describing the bug.

4 Likes

What do you think about this whole “action”?

It would be best if you got buy in from the module maintainer, Tim Peters who is still an active core developer.

Also, it is not reasonable to ask for carte blanche to substantially rewrite a module. Whether a refactor is beneficial or not can only be known if people know specifically what you intend to do. Perhaps start with a small edit so that reviewers can judge whether it is an actual improvement.

A cursory review of difflib shows that it already is heavily factored. The ndiff function is a one-liner that calls Differ.compare which is itself a thin wrapper about SequenceMatcher and a few internal methods that appropriately factor-out the replace and dump logic.

To my eyes, the module already has a “single source of truth” for every concept and it is mostly broken down into methods that have a “single responsibility”.

One of them is kind of a bug in the ndiff which leads to waiting forever for the result.

Do you have an example that invalidates Tim’s performance claim?

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Tim Peters is almost never wrong about such things.

I would also like to improve overall code quality of this library.

Also take into a account that this is long stable code. There is almost no benefit to be had be rearranging it at this point.

Hyrum’s Law says at least someone will depend on every observable behavior. It would be challenging to prove that a substantial refactor is 100% correct, does not change observable behaviors, and has no reduction in performance. The purported “improvement to overall code quality” would need to be substantial win in order to offset the risks to users. Otherwise, it is just code churn.

4 Likes

As others have said, open an issue report first focused solely on the behavior you believe is buggy, There’s not enough information here to say anything more specific.

In general, rewrites aren’t solicited. difflib is very old and stable code, and is not a maintenance burden. Especially for difflib, Hyrum’s Law applies: at the time it was written, Python was a playground for enthusiasts and tinkerers, and computers were much feebler (both much slower and much less RAM). That in turn drove a number of engineering tradeoffs as compromises with practical realities, many of which have visible consequences that existing code relies (knowingly or not) on.

For example, its similarity function (.ratio()) is not symmetric.

>>> import difflib
>>> difflib.SequenceMatcher(None, "diet", "tide").ratio()
0.5
>>> difflib.SequenceMatcher(None, "tide", "diet").ratio()
0.25

The far more common Levenshtein (“edit”) distance is symmetric, but at the time a pure Python implementation of that was too costly to endure.

Does it matter? I’m not sure :wink:. Symmetry is “intuitively appealing”, but unclear beyond that. For example, what’s 64% of 25? Most people scratch their heads, or reach for a calculator. But what’s 25% of 64? It’s really the same question, but with the operands in the other order, and most people have no trouble thinking “OK, a quarter of 64 is half of 32, which is 16”. difflib’s whole purpose for existence was to do a better job of producing code diffs that “made sense” to humans than did Unix​:trade_mark: diff at the time, and “human factors” concerns like that weren’t brushed off. There remains no obvious reason I can see to imagine that changing string A to string B is exactly as costly as changing B to A, to human eyes. But, then again, there remains no obvious reason I can see either to imagine that difflib’s .ratio() reflects that in a useful way.

But it’s been that way forever, and too late to change now.

Relatedly, the later-added autojunk feature strikes me as a failed experiment. It was rammed in quickly to greatly speed a specific user’s applications, but without sufficient thought. Overall I’d guess it did more harm than good, but a whole bunch of apps would slow down a whole lot if it were removed. People remain mostly unaware of it, and have no idea whether they benefit from it.

And so on. The problem isn’t that “grand rewrites” in general are too aggressive, but that they’re too timid. If people want to improve difflib, go whole hog: propose a new library to supplant it, with many “pluggable pieces”. Advances in bioinformatics in particular have pioneered whole new universes of interesting ideas for similarity and differencing approaches. Just follow the links here and come back in a month (:wink:):

Far less ambitious than that, git’s “patience” diff takes a different approach to creating “diffs that make sense to humsns”, and git’s “histogram” diffs look like an improved generaliztion of that.

To be fair, those docs are about SequenceMatcher, but the OP is talking about ndiff. The latter builds on the former, but is more expensive. In the most recent release, one of ndiff’s unique-to-it quadratic-time subsystems was replaced by a worst-case linear-time one, which may or may not be related to the OP’s case.

That could change results too, but in context I was persuaded the algorithm was trying so hard to make some human sense out of seemingly hopeless input that there was no appreciable “human sense” to be found regardless of cost.

What do I mean by “human sense”? An example from the comments captures it:

Before:

private Thread currentThread;

After:

private volatile Thread currentThread;

It’s dead obvious to human eyes that someone inserted " volatile" after “private” (or "volatile " before “Thread”).

But, as far as Levenstein (or many other approaches) are concerned, “minimal edit distance” is achieved by inserting “e volatil” between the “t” and “e” in “private”. Nobody on the planet\ would do that. DIffs building on such stuff don’t help understanding, they impede it. dfflib works against in two ways (viewing blanks as low-value “junk” characters, and favoring contiguous junk-free matches).

Pointing that out at the time brought out legions of “diff zealots” who insisted that humans were wrong to think that at all. I note with some amusement that now it seems to be universally accepted as self-evident truth :smile:

10 Likes

I don’t understand most of what is being said in this thread, but as a generic user I’d like to see an autojunk parameter on difflib.unified_diff. Right now I have to override the module level SequenceMatcher to get the desired effect. I’ve found autojunk is really bad when diffing pretty-formatted JSON.

It would be an added bonus if turning off autojunk didn’t slow the diff down significantly.

1 Like

Do open an issue report. Offhand it sounds like a good idea for higher-level functions to pass on important options to the functions they rely on. This isn’t the place for it, though.

Not enough information to say anything useful. Details are everything. In general, difflib’s original goal was to produce superior diffs for code changes made by humans, by hand. The “human factors” it tries to exploit often just don’t apply to machine-generated output (even just “pretty printing” can destroy clues about changes a human made).

That’s unlikely. autojunk can enormously speed the process, by vastly reducing the space of candidate matches that need to be looked at. That’s why it was introduced. The original motivating use cases were sped by over a factor of 10, with no apparent harm to subjective diff quality to human eyes. But those cases were in fact instances of creating diffs for code changes (mostly C and Python) made by humans, by hand.

For a total disaster, newcomers to Python and bioinformatics often try to use difflib for genome analysis. Very long strings using very few letters (ACGT). “autojunk” foolishly “deduces” they’re so heavily used that they’re all “junk” characters, and produces diffs that amount to no more than "delete all of the original sequence and replace it by all of the second sequence’. But it does so quickly :wink:.

2 Likes