Os.path.commonpath doesn't need a Sequence

os.path.commonpath is documented to require a sequence:

Return the longest common sub-path of each pathname in the sequence paths .

Typeshed respects this promise:

def commonpath(paths: Sequence[LiteralString]) -> LiteralString: ...
@overload
def commonpath(paths: Sequence[StrPath]) -> str: ...
@overload
def commonpath(paths: Sequence[BytesPath]) -> bytes: ...

However, the implementation for both ntpath and posixpath can work with an Iterable just fine, since 3.6. At a glance, both implementations could be re-written without requiring random access and re-iteration, i.e. optimized to work in one go.

In my particular use-case I have a large collection of objects that contain paths. Materialization of the generator comprehension seems wasteful if it can be avoided.

1 Like

Are you asking for the implementation to change or a doc fix?
If a implementation change then discuss in ideas?

Both, but separately.

The documentation and typeshed change can happen right now since both implementations can accept an iterator.

Optimization can be done later.

Unless I’m missing something obvious here that prevents a one-go implementation and the input must be a sequence.

I suggest you get it agreed that an iterator is supported.
As opposed to accidentally works first.

Indeed, as of now it won‘t work for an empty collection, but that is a trivial fix.

Do all Python implementations’ os.path.commonpath support iterables?

There is no reason they couldn’t. If a particular implementation cannot do the work in one go then it can materialize the iterator internally.

I think the posixpath and ntpath implementations can be done in one go, I’m working on a PoC and will report back here.

This seems to be more a typing issue than a docs one, no?

It appears to be an unnecessarily strict requirement due to implementation detail. Current implementation will work for iterators as long as they non-empty. It can be trivially modified to allow empty iterators as well.

But I think I can reasonably rewrite ntpath and posibpath implementations to work in one-go, so the iterators can be truly supported instead of under-the-hood materialization.

To the contrary, I think that it happens to accept iterables by accident of implementation and we shouldn’t commit all future implementations to the same implementation.

1 Like

I’m seeing three independent concepts here.

  1. Does commonpath currently accept non-sequence iterables? This is a simple question of implementation, and currently the answer is “Yes”, with the very very small exception that it does a falsiness test before tuplifying, which could easily be tweaked.
  2. Should commonpath be documented as accepting arbitrary iterables? IMO this is completely orthogonal, and I have no strong opinion on the matter.
  3. Can commonpath be reimplemented to perform one pass over the input? Simple question of implementation, again, and really just a matter of performance.

What’s the advantage of doing precisely one pass over the input (as opposed to forcing it to a tuple and then using it any way it likes)? Are you passing vast numbers of paths, such that the multiple passes actually cost more than the cost of combining them would be? From my reading of the code, it does a few checks, then grabs the min and max of the sequence. Yes, those could be done simultaneously in a single pass, but it wouldn’t really save much.

So of the three questions, one isn’t a proposal, and I’m -0 on the other two.

  1. Faster code: the underlying algorithm should, in principle, find common path in one go
  2. Cleaner code: IMO it’s annoying seeing a collection iterated over five times for no good reason
  3. Relaxed requirement of the API (Sequence gets downgraded to Iterable)

What’s not to like?

The first two need to be proven (which probably means writing the code), but the third is quite orthogonal; the current implementation can accept arbitrary iterables, because it just tuplifies straight away.

I made an issue on GitHub. I will follow up with code and tests there.

Added the PR and the benchmark results.

Can you add the code that you used for the benchmark please.
I cannot tell what is the old vs. new timings.