The ignore parameter of shutil.copytree should return a set

I was playing with the shutil.copytree function and its ignore parameter. The docs and the copytree example show that the ignore callback should return a sequence (The provided example returns an empty list). Then I looked at the source code of the ignore_patterns factory function, and I saw that it returns a set rather than a list. At first, I thought it was simply to remove duplicates, but then I looked at the source code of shutil.copytree itself, and I realized there is a deeper reason to prefer sets over lists. The code only uses the in operator to test if the current entry is in the sequence returned by the ignore callback and nothing else. Sets are obviously specialized for fast membership tests (O(1) for the in operator as opposed to O(n) for lists), so if all we do with the result of the ignore callback is testing whether a given entry is in it or not, then it makes sense to use a set rather than a list. However, the docs don’t mention anything about returning a set to improve performance, and the example even returns a list (the specific example in the docs always returns an empty list, so there are no performance gains from returning a set instead, but the general point still stands).

So, if I write a custom callback for the ignore parameter, and I only read the docs and not the source code, I would probably return a list rather than a set (by filtering the given names list or something similar), which would be less efficient, especially when copying directories that contain a lot of entries.

I think this situation should be solved in one of the following ways:

  1. The docs should mention that it’s recommended to return a set rather than a list, and update the example, or add another “heavier” example that doesn’t return an empty sequence and could actually benefit from returning a set instead of a list.
  2. The docs can stay as they are, but the source code should be changed to convert the result of the ignore parameter to a set. It’s as simple as passing the ignore result in the code to the set constructor. This approach has the added benefit that ignore can now return any iterable and not just a sequence, so even a generator can be used.
1 Like

I suspect the correct abstraction to mention is a Container.

1 Like