Add pathlib.Path.stems

AJNeufeld · April 20, 2023, 8:26pm

Path.suffix is to Path.stem as Path.suffixes is to …?

>>> from pathlib import Path
>>> p = Path("CurveFile.vs2022.vcxproj")
>>> p.suffix
'.vcxproj'
>>> p.stem
'CurveFile.vs2022'
>>> p.suffixes
['.vs2022', '.vcxproj']

While you can programmatically get each and every suffix, currently, there is no way to extract first part ('CurveFile') without doing string manipulation.

Suggestion: Path.stems would return a list of the stems, to which suffixes could be appended which would reconstruct the original Path.name component. eg)

>>> p.stems
['CurveFile', 'CurveFile.vs2022']
>>> for idx, stem in enumerate(p.stems):
...     print(stem + "".join(p.suffixes[idx:]))
...
CurveFile.vs2022.vcxproj
CurveFile.vs2022.vcxproj

Alternately, just a Path.initial_stem which returns the 'CurveFile'?

Alternately, a Path.name_parts which returns ['CurveFile', '.vs2022', '.vcxproj'] and Path.suffixes could be re-implemented as just return self.name_parts[1:]

h-vetinari · April 20, 2023, 11:55pm

Constructing a list of stems which is redundant with the suffixes (requiring careful matching to not drop or duplicate one) sounds like a bad idea to me. IOW

[p == st + su for st, su in zip(p.stems, p.suffixes)]

would yield a list full of True, as in, each way we can reconstruct p more than once is redundant IMO. And woe betide anyone who gets some indexing wrong between stems & suffixes, then you end up with 'CurveFile.vs2022.vs2022.vcxproj', or 'CurveFile.vcxproj', etc.

To me, Path.stem in that example should just be 'CurveFile', though I understand we probably can’t (easily) change that behaviour. Still, that way things would be unambiguous, minimal & complete for the purposes of being able to take paths apart & put them back together again, which sounds more desirable IMO than the other alternatives (which would however still be better than p.suffixes, IMO).

pylang · April 21, 2023, 12:45am

The problem this solution is after has bitten me a few times.

I think I’ve always wanted a convenience property/method that basically returns non extensions, similar to fname.split(".", maxsplit=1)[0]):

>>> Path("path/to/archive.tar.gz").filename_no_exts
'archive'
>>> Path("/path/to/.bashrc").filename_no_exts
''

Though there may be some edge cases I’m not remembering right now.

barneygale · April 21, 2023, 7:09pm

File extensions are difficult to get right, because filenames like pip-8.1.1.tar.gz are commonplace, and humans rely on context and experience to figure out where the stem ends and the extension begins. I think this is why os.path provides splitext(), but not a function that repeatedly applies splitext() to get “all the extensions”.

Similarly pathlib has path.stem and path.suffix, which split on the rightmost period. This produces reasonable results 99% of the time. There’s path.suffixes too, but it should be treated with care: items earlier in the returned list are less likely to be file extensions:

>>> import pathlib
>>> pathlib.Path('pip-8.1.1.tar.gz').suffixes
['.1', '.1', '.tar', '.gz']

Personally I don’t much like path.suffixes - it’s too easy to get misleading results like the above. A path.stems property would be affected by the same problem, I think.

AJNeufeld · April 21, 2023, 7:37pm

Well, clearly, that file should have been pip-8.1.1.tgz then.

In all seriousness, the result ['.1', '.1', '.tar', '.gz'] highlights the incompleteness of .suffixes. Path.parts yields all of parts of the given path, and could be reversed to get back the original Path.
I think we need that for the Path.name as well. Path.suffixes is close, but omits the “prefix”.

If Path.name_parts returned ['pip-8', '.1', '.1', '.tar', '.gz'], then the user could use "".join(p.name_parts[:-2]) to recover the path’s name without the final two extensions, be that pip-8.1.1 or CurveFile, but that is falling back on string concatenation.

Something simpler than:

>>> Path("a/b/pip-8.1.1.tar.gz").with_suffix("").with_suffix("")
WindowsPath('a/b/pip-8.1.1')

Maybe Path("a/b/pip-8.1.1.tar.gz").without_suffixes(2)?

barneygale · April 21, 2023, 7:48pm

How about pop_suffix(), similar to splitext()?

>>> path = Path("a/b/pip-8.1.1.tar.gz")
>>> path, suffix = path.pop_suffix()
>>> path
Path("a/b/pip-8.1.1.tar")
>>> suffix
".gz"

Could be called in a loop until some condition is met (e.g. suffix no longer in an allowlist)

AJNeufeld · April 25, 2023, 8:02pm

Using path.pop_suffix() would be misleading, in that we aren’t popping anything. The path object is immutable. It would effectively just be an alternative spelling for:

>>> path, suffix = path.with_suffix(""), path.suffix

Maybe it is sufficient to strip off a known suffix?

for path in dir.rglob("*.tar.gz"):
    archive_name = path.without_suffix(".tar.gz")
    ...

… and the method would throw an exception if the path didn’t end with the given suffix (or equivalent of it on a case-insensitive filesystem).

Or simply have a Path.prefix attribute, which is everything not removed by .suffixes?

AJNeufeld · May 9, 2023, 3:58pm

I’m not seeing much support for .without_suffixes(2) or .without_suffix(".tar.gz"). How about a .splits attribute, which returned all prefix/suffix pairs as a list of (named) tuples? If you want just the prefix before the first period, that is just the prefix of the first element. If you want to split off exactly 2 suffixes, that is the second-last element.

>>> p = Path("a/b/pip-8.1.1.tar.gz")
>>> p.splits
[('pip-8', '.1.1.tar.gz'), ('pip-8.1', '.1.tar.gz'), ('pip-8.1.1', '.tar.gz'), ('pip-8.1.1.tar', '.gz')]
>>> p.splits[0][0]
'pip-8'
>>> p.splits[-2]
('pip-8.1.1', '.tar.gz')

Alternate spellings: .name_splits, or .name_parts or …?

Or would a function be better:

>>> p = Path("a/b/pip-8.1.1.tar.gz")
>>> p.name_split(0)      # Split leaving 0 dots in prefix
('pip-8', '.1.1.tar.gz')
>>> p.name_split(1)      # Split leaving 1 dot in prefix
('pip-8.1', '.1.tar.gz')
>>> p.name_split(-2)     # Split leaving 2 dots in suffix
('pip-8.1.1', '.tar.gz')

pylang · May 9, 2023, 6:34pm

At first glance, I like

name_split over split (to disambiguate from str.split()
passing in the “index” to a name_split() method rather than the double bracket indexing.

ntessore · May 9, 2023, 11:24pm

If p.suffixes is ['.vs2022', '.vcxproj'], then p.stems ^[1] could be ['CurveFile', '.vs2022'], so that all parts are had with p.stems + [p.suffix]

Although I like p.nodes better, since a stem ('CurveFile.vs2022') contains nodes (['CurveFile', '.vs2022']). ↩︎

AJNeufeld · May 12, 2023, 7:21pm

p.stems returning ['CurveFile', '.vs2022'] isn’t quite right. That would actually be p.stem_parts. I don’t think we can actually call '.vs2022' a “stem”.