Os.path.commonpath doesn't need a Sequence

Kentzo · November 30, 2022, 7:55am

os.path.commonpath is documented to require a sequence:

Return the longest common sub-path of each pathname in the sequence paths .

Typeshed respects this promise:

def commonpath(paths: Sequence[LiteralString]) -> LiteralString: ...
@overload
def commonpath(paths: Sequence[StrPath]) -> str: ...
@overload
def commonpath(paths: Sequence[BytesPath]) -> bytes: ...

However, the implementation for both ntpath and posixpath can work with an Iterable just fine, since 3.6. At a glance, both implementations could be re-written without requiring random access and re-iteration, i.e. optimized to work in one go.

In my particular use-case I have a large collection of objects that contain paths. Materialization of the generator comprehension seems wasteful if it can be avoided.

barry-scott · November 30, 2022, 8:48am

Are you asking for the implementation to change or a doc fix?
If a implementation change then discuss in ideas?

Kentzo · November 30, 2022, 4:41pm

Both, but separately.

The documentation and typeshed change can happen right now since both implementations can accept an iterator.

Optimization can be done later.

Unless I’m missing something obvious here that prevents a one-go implementation and the input must be a sequence.

barry-scott · November 30, 2022, 4:47pm

I suggest you get it agreed that an iterator is supported.
As opposed to accidentally works first.

Kentzo · November 30, 2022, 4:54pm

Indeed, as of now it won‘t work for an empty collection, but that is a trivial fix.

EpicWink · December 1, 2022, 1:00am

Do all Python implementations’ os.path.commonpath support iterables?

Kentzo · December 1, 2022, 2:01am

There is no reason they couldn’t. If a particular implementation cannot do the work in one go then it can materialize the iterator internally.

I think the posixpath and ntpath implementations can be done in one go, I’m working on a PoC and will report back here.

CAM-Gerlach · December 1, 2022, 3:32am

This seems to be more a typing issue than a docs one, no?

Kentzo · December 1, 2022, 5:27am

It appears to be an unnecessarily strict requirement due to implementation detail. Current implementation will work for iterators as long as they non-empty. It can be trivially modified to allow empty iterators as well.

But I think I can reasonably rewrite ntpath and posibpath implementations to work in one-go, so the iterators can be truly supported instead of under-the-hood materialization.

guido · December 1, 2022, 6:10am

To the contrary, I think that it happens to accept iterables by accident of implementation and we shouldn’t commit all future implementations to the same implementation.

Rosuav · December 1, 2022, 6:32am

I’m seeing three independent concepts here.

Does commonpath currently accept non-sequence iterables? This is a simple question of implementation, and currently the answer is “Yes”, with the very very small exception that it does a falsiness test before tuplifying, which could easily be tweaked.
Should commonpath be documented as accepting arbitrary iterables? IMO this is completely orthogonal, and I have no strong opinion on the matter.
Can commonpath be reimplemented to perform one pass over the input? Simple question of implementation, again, and really just a matter of performance.

What’s the advantage of doing precisely one pass over the input (as opposed to forcing it to a tuple and then using it any way it likes)? Are you passing vast numbers of paths, such that the multiple passes actually cost more than the cost of combining them would be? From my reading of the code, it does a few checks, then grabs the min and max of the sequence. Yes, those could be done simultaneously in a single pass, but it wouldn’t really save much.

So of the three questions, one isn’t a proposal, and I’m -0 on the other two.

Kentzo · December 1, 2022, 6:56am

Faster code: the underlying algorithm should, in principle, find common path in one go
Cleaner code: IMO it’s annoying seeing a collection iterated over five times for no good reason
Relaxed requirement of the API (Sequence gets downgraded to Iterable)

What’s not to like?

Rosuav · December 1, 2022, 7:56am

The first two need to be proven (which probably means writing the code), but the third is quite orthogonal; the current implementation can accept arbitrary iterables, because it just tuplifies straight away.

Kentzo · December 3, 2022, 2:04am

I made an issue on GitHub. I will follow up with code and tests there.

Kentzo · December 14, 2022, 3:44am

Added the PR and the benchmark results.

barry-scott · December 14, 2022, 7:47am

Can you add the code that you used for the benchmark please.
I cannot tell what is the old vs. new timings.

Kentzo · December 14, 2022, 4:38pm

gist.github.com

https://gist.github.com/Kentzo/a36decb53cedbe9da0646a395a1accbb

README.md

The benchmark measures and compares execution time and peak memory consumption of the current and new commonpath methods.

Measurement is performed using batches of *1000* paths. Each batch contains paths generated with the following variables:
1. Length of each path part (`range(16, 65, 16)`)
2. Number of parts (`range(4, 17, 4)`)
3. Number of common parts (`range(parts_count + 1)`)

The batch is then split into chunks of equal size such that paths in each following chunk reduce the number of common parts so far by 1 up to the selected number [3], e.g. if the selected number of common parts is *1* then the batch would be *"a/b/c", "a/b/c", "a/b/d", "a/b/d", "a/c/d", "a/c/d", "a/c/d"*.

Each batch is tested in two permutations:

This file has been truncated. show original

os_path_commonpath.py

import itertools
import os
import timeit
import statistics
import tracemalloc
from typing import NamedTuple

import genericpath
from ntpath import splitdrive

This file has been truncated. show original