Add tuple support to more str functions

Nineteendo · April 10, 2024, 7:51am

These are all the functions of str (and bytes too) that could get tuple support like str.startswith() & str.endswith().

I think it would be most useful for str.find(), str.rfind(), str.rsplit() & str.split().
This would improve some code in the stdlib and would also benefit end users.

from collections.abc import Sequence
import sys
from typing import LiteralString, SupportsIndex


class str(Sequence[str]):
    def count(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
    def find(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
    def index(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
    @overload
    def partition(self: LiteralString, sep: LiteralString, /) -> tuple[LiteralString, LiteralString, LiteralString]: ...
    @overload
    def partition(self, sep: str, /) -> tuple[str, str, str]: ...
    if sys.version_info >= (3, 9):
        @overload
        def removeprefix(self: LiteralString, prefix: LiteralString, /) -> LiteralString: ...
        @overload
        def removeprefix(self, prefix: str, /) -> str: ...
        @overload
        def removesuffix(self: LiteralString, suffix: LiteralString, /) -> LiteralString: ...
        @overload
        def removesuffix(self, suffix: str, /) -> str: ...
    def rfind(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
    def rindex(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
    @overload
    def rpartition(self: LiteralString, sep: LiteralString, /) -> tuple[LiteralString, LiteralString, LiteralString]: ...
    @overload
    def rpartition(self, sep: str, /) -> tuple[str, str, str]: ...
    @overload
    def rsplit(self: LiteralString, sep: LiteralString | None = None, maxsplit: SupportsIndex = -1) -> list[LiteralString]: ...
    @overload
    def rsplit(self, sep: str | None = None, maxsplit: SupportsIndex = -1) -> list[str]: ...
    @overload
    def split(self: LiteralString, sep: LiteralString | None = None, maxsplit: SupportsIndex = -1) -> list[LiteralString]: ...
    @overload
    def split(self, sep: str | None = None, maxsplit: SupportsIndex = -1) -> list[str]: ...

Nineteendo · April 10, 2024, 7:58am

I forgot to mention that this would also be useful for bytes functions.

Stefan2 · April 10, 2024, 8:09am

What ugly code?

Nineteendo · April 10, 2024, 8:16am

This is what I have been able to find so far (but I expect there to be more):

github.com

python/cpython/blob/73906d5c908c1e0b73c5436faeff7d93698fc074/Lib/ntpath.py#L238-L240


      
          i = len(p)
          while i and p[i-1] not in seps:
              i -= 1

github.com

python/cpython/blob/0d42ac9474f857633d00b414c0715f4efa73f1ca/Lib/ntpath.py#L337-L338


      
          path = os.fsdecode(splitroot(path)[2]).replace(altsep, sep)
          return any(_isreservedname(name) for name in reversed(path.split(sep)))

github.com

python/cpython/blob/f90ff0367271ea474b4ce3c8e2643cb51d188c18/Lib/ntpath.py#L378-L380


      
          i, n = 1, len(path)
          while i < n and path[i] not in seps:
              i += 1

github.com

python/cpython/blob/0d42ac9474f857633d00b414c0715f4efa73f1ca/Lib/genericpath.py#L164-L167


      
          sepIndex = p.rfind(sep)
          if altsep:
              altsepIndex = p.rfind(altsep)
              sepIndex = max(sepIndex, altsepIndex)

pf_moore · April 10, 2024, 9:50am

None of them look particularly ugly to me, and I’m not clear how they’d look better with the proposed changes. Also, it’s worth pointing out that all of the proposed changes are available alreday using the re module. So to make a case for these changes, you need to show that splitting on (or finding) a list of values, is a very common need, and that it’s possible to implement it more efficiently than the regex equivalent. (And if you can do that, you might be better off implementing the equivalent optimisation in the re module, as that would benefit more people).

JamesParrott · April 10, 2024, 9:59am

[edit] Ignore me - I mistakenly thought the proposal was to add string methods to tuples, whereas it is to add tuple args to more string methods.

Nineteendo · April 10, 2024, 10:04am

They at least look very inefficient! Here’s how they would look with the new syntax:

i = p.rfind((sep, altsep)) + 1

path = os.fsdecode(splitroot(path)[2])
return any(_isreservedname(name) for name in reversed(path.split((sep, altsep))))

i = p.find((sep, altsep), 1)
if i < 0:
    i = len(path)

sepIndex = p.rfind(seps)

Much cleaner in my opinion.

kknechtel · April 10, 2024, 10:15am

At least some of the nearly a million viewers here or 1.5 million viewers here would seem to have this need for splitting. I’m not sure whether multiple-character delimiters are supposed to be in scope for OP.

For some cases, this is possible just combining other existing pieces:

>>> timeit.timeit('p.split(x)', setup='import re; x="fooafoobfoocfoo";p=re.compile("[abc]")')
0.6335026849992573
>>> timeit.timeit('x.translate(t).split()', setup='x="fooafoobfoocfoo";t=str.maketrans("abc", "   ")')
0.49312453702441417

It stands to reason that at the C level, a simple split on multiple characters could be implemented faster than a translation of multiple characters through a lookup table followed by a split… on multiple characters (a hard-coded selection of them) that also handles runs.

(Fixed timing test; patterns from repeated calls to re methods in a tight loop are supposed to get precompiled and cached, but it seems not to be very effective here.)

erlendaasland · April 10, 2024, 10:18am

FTR, instead of labelling the work of other developers as “ugly”, try to chose your wording more carefully; it’s better to instead focus on the benefits, rather than stepping on toes and being dismissive about other people’s work. “The existing code can be improved” is a nicer way to say it.

pf_moore · April 10, 2024, 10:50am

But str.split doesn’t split on characters, it splits on strings (which can be multi-character). And splitting on (or finding) multiple strings is non-trivial. (Not hard, and there are well-known efficient algorithms, but not trivial either).

This gives a different result - the original code points i before trailing separators, this points i after the first separator.

I don’t find this one any more readable.

Same problem as above, this doesn’t do the same thing.

This one is better, although you’ve ignored the need to cater for an empty altsep:

seps = [sep]
if altsep:
    seps.append(altsep)
sepIndex = p.rfind(seps)

Not as much of an improvement now.

Nineteendo · April 10, 2024, 11:14am

Regex is kind of overkill for a list of simple strings, and it requires more code (such as escaping characters, gettings the index from a re.Match object, etc).

That wasn’t my intention, I simply wasn’t happy with the current code (updated the thread description).

Sorry, I posted them in the wrong order:

i = p.rfind((sep, altsep)) + 1

i = p.find((sep, altsep), 1)
if i < 0:
    i = len(path)

It’s more clear in what it’s trying to achieve: split on every sep & altsep.

I didn’t, the arguments need to be updated (sep is a str or tuple here):

# genericpath.py
def _splitext(p, sep, extsep): ...

# ntpath.py
def splitext(p):
    p = os.fspath(p)
    if isinstance(p, bytes):
        return genericpath._splitext(p, (b'\\', b'/'), b'.')
    else:
        return genericpath._splitext(p, ('\\', '/'), '.')

# posixpath.py
def splitext(p):
    p = os.fspath(p)
    if isinstance(p, bytes):
        return genericpath._splitext(p, b'/', b'.')
    else:
        return genericpath._splitext(p, '/', '.')

storchaka · April 10, 2024, 11:14am

If str.startswith() and str.endswith() didn’t support tuples, I’m not even sure that the proposal to add support for tuples would be accepted now. s.endswith(('.py', '.pyc')) has too little advantage over s.endswith(‘.py’) or s.endswith(‘.pyc’)`.

For str.find() and str.split() the benefits may be greater, but these cases are rarer and are covered by the re module.

It is also not obvious how to implement them effectively in C – there are two stright ways (one is used in os.path implementations, and other in the re module), and which of them is faster depends on the nature of the date. In any case an efficient implementation will add many tens or hundreds lines of the C code to save few Python lines here and there. Not every few lines of Python code deserve adding a builtin function or method.

Nineteendo · April 10, 2024, 11:49am

I’m not seeing a definitive improvement here, even for all the extra code (but yes it’s faster for long strings):

import re

PATTERN = re.compile(r"\\|/")

def test1(p):
    seps = "\\/"
    i = len(p)
    while i and p[i-1] not in seps:
        i -= 1
    return i

def test2(p):
    match = PATTERN.search(p[::-1])
    i = len(p) - match.start() if match else 0
    return i

::test.bat
@echo off
echo 1 character && python -m timeit -s "import test2" "test2.test1('/a')" && python -m timeit -s "import test2" "test2.test2('/a')"
echo 10 characters && python -m timeit -s "import test2" "test2.test1('/' + 'a' * 10)" && python -m timeit -s "import test2" "test2.test2('/' + 'a' * 10)"
echo 100 characters && python -m timeit -s "import test2" "test2.test1('/' + 'a' * 100)" && python -m timeit -s "import test2" "test2.test2('/' + 'a' * 100)"

1 character 
2000000 loops, best of 5: 162 nsec per loop # while loop
500000 loops, best of 5: 457 nsec per loop # re.Pattern.search
# -> 2.82x slower
10 characters
500000 loops, best of 5: 678 nsec per loop # while loop
500000 loops, best of 5: 519 nsec per loop # re.Pattern.search
# -> 1.31x faster
100 characters
50000 loops, best of 5: 5.63 usec per loop # while loop
200000 loops, best of 5: 1.14 usec per loop # re.Pattern.search
# -> 4.94x faster

storchaka · April 10, 2024, 1:59pm

Well, you see that relative time of two algorithms depends on the input.

Nineteendo · April 22, 2024, 2:02pm

Adding tuple support to str.find() & str.rfind() would improve readability and performance.
Work in progress implementation: GitHub - nineteendo/cpython at extend-str-and-bytes
According to my benchmark, it’s superior to all alternatives:

import re

PATTERN = re.compile(r"\\|/")

def find1(p):
    seps = "\\/"
    i, n = 1, len(p)
    while i < n and p[i] not in seps:
        i += 1
    return i

def find2(p):
    match = PATTERN.search(p)
    i = match.start() if match else len(p)
    return i

def find3(p):
    sep = "\\"
    altsep = "/"
    i = p.find(sep)
    new_i = p.find(altsep)
    if new_i >= 0 and (new_i < i or i < 0):
        i = new_i
    if i < 0:
        i = len(p)
    return i

def find4(p):
    seps = ("\\", "/")
    i = p.find(seps)
    if i < 0:
        i = len(p)
    return i

def rfind1(p):
    seps = "\\/"
    i = len(p)
    while i and p[i-1] not in seps:
        i -= 1
    return i

def rfind2(p):
    match = PATTERN.search(p[::-1])
    i = len(p) - match.start() if match else 0
    return i

def rfind3(p):
    sep = "\\"
    altsep = "/"
    i = max(p.rfind(sep), p.rfind(altsep)) + 1
    return i

def rfind4(p):
    seps = ("\\", "/")
    i = p.rfind(seps) + 1
    return i

find
1 character
2000000 loops, best of 5: 119 nsec per loop # while loop
1000000 loops, best of 5: 357 nsec per loop # regex
1000000 loops, best of 5: 203 nsec per loop # 2 finds
2000000 loops, best of 5: 113 nsec per loop # tuple
10 characters 
500000 loops, best of 5: 615 nsec per loop # while loop
1000000 loops, best of 5: 395 nsec per loop # regex
1000000 loops, best of 5: 214 nsec per loop # 2 finds
2000000 loops, best of 5: 123 nsec per loop # tuple
100 characters 
50000 loops, best of 5: 5.48 usec per loop # while loop
500000 loops, best of 5: 769 nsec per loop # regex
1000000 loops, best of 5: 218 nsec per loop # 2 finds
2000000 loops, best of 5: 126 nsec per loop # tuple
rfind
1 character 
2000000 loops, best of 5: 164 nsec per loop # while loop
500000 loops, best of 5: 470 nsec per loop # regex
2000000 loops, best of 5: 164 nsec per loop # 2 finds
2000000 loops, best of 5: 102 nsec per loop # tuple
10 characters 
500000 loops, best of 5: 692 nsec per loop # while loop
500000 loops, best of 5: 518 nsec per loop # regex
2000000 loops, best of 5: 172 nsec per loop # 2 finds
2000000 loops, best of 5: 111 nsec per loop # tuple
100 characters 
50000 loops, best of 5: 5.74 usec per loop # while loop
200000 loops, best of 5: 1.01 usec per loop # regex
1000000 loops, best of 5: 280 nsec per loop # 2 finds
1000000 loops, best of 5: 219 nsec per loop # tuple

Nineteendo · April 22, 2024, 2:15pm

I updated the tests to make them equivalent.

Nineteendo · April 24, 2024, 6:54am

17 thousand people for find (but they need to know which word is matched): regex - What's the most efficient way to find one of several substrings in Python? - Stack Overflow
In the worst case scenario it’s

3.16x (1 char), 3.21x (10 chars), 6.10x (100 chars) faster for str.find()
4.61x (1 char), 4.67x (10 chars), 4.61x (100 chars) faster for str.rfind()

Nineteendo · April 24, 2024, 7:17am

I also tried to implement this using tailmatch, but that was much slower.
And I don’t see a simple way to do it for str.split(), but that could be worth it.

dg-pb · May 23, 2024, 8:08am

Any progress on this?

I have encountered cases where re felt like an overkill and it would have been nice to have such tuple support.

If it turned out to be the most efficient solution - even better. Which I believe will, as re inevitable has some overhead.

Improving performance of re of course would be of great benefit. But this in comparison seems to be a much lower hanging fruit and would be preferable option even in case of the same performance.

It has been a while since I had such encounters with strings so can not recall many of my own use-cases for this, but one I do clearly:

SEP = ('[', ']', '(', ')', '|')
sep_re = '({})'.format('|'.join(SEP))
split = re.split(sep_re, string)

It would have been nice to be able to do this instead:

string.split(SEP, keepsep=True)

From my POV, if a problem involves hard-coded fixed strings only, it is basic enough to be able to solve without regular expressions.

Not saying that some big extensions should be made to string toolkit, but only that some extensions and improvements can be justified and “it can be done using re” is not necessarily a decisive opposition.

Nineteendo · May 23, 2024, 9:42am

Nope, I deleted the branch as there was radio silence.

I have already proven we can make str.rfind(SEPS) 4.6x faster than re in the worst case using a for loop in C. But we would need helper functions if we want to improve the best case.

If you have ideas, feel free to share them.