This is part of String Search Overview, Coverage and API.
I propose the following changes to string methods. These would:
a) Provide functional completeness to a large degree
b) Bring string API a bit closer to re
c) eliminate r.*
naming inconsistency.
1. str.count
Add maxcount
argument. So that new api is:
str.count(self, sub[, start[, end]], /, maxcount=-1)
This functionality is already in C, so all what is left is exposing it. There many use cases for str.count(...) <cmp> int
Regex for CPython repo or github: \.count\(.*\).*[><=]
Very low cost change with non-negligible benefits.
2. str.split - POSTPONED
add keepsep
argument, so that new api is:
str.split(self, /, sep=None, maxsplit=-1, keepsep=False)
a.k.a. re.split
when separator is grouped.
3. str.replace - POSTPONED
Return number of replacements. Either:
a) Add str.replacen
(a.k.a. re.subn
)
b) str.replace(..., withcount=False)
4. str.finditer (re.finditer
equivalent)
API equivalent to str.find
, but returns an iterator of starting positions of non-overlapping matches.
important component for further work on re
module.
Apart from the main reason, str.finditer
is also useful for cases when parsing text lines, but there is no need to split them all. Currently, calling str.find
many times is simply too slow to be worth it, although should be the optimal thing to do.
5. Reversion
Currently there are methods prefixed with r
: str.rfind
, str.rstrip
, …
However, r
means different things. There are 2 cases:
a) r
= right (vs left)
b) r
= reversed (vs forward)
Although they might seem the same, they are not. In case of b) it is either-either relationship. Either forward or reversed, where in case of a) it can be both - right/left/both.
So my proposal is instead to leave r
for - lstrip
/ rstrip
/ strip
and add rev=False
flag to all search related methods. These are all of the methods mentioned above + str.index
.
And to slowly deprecate str.rfind
, str.rsplit
, …
Benefits of this are:
a) API more in line with more general conventions. E.g. re.search
would have a flag / an argument to indicate reversed search, as opposed to introduce re.rsearch
. E.g. regex · PyPI
b) More compact and consistent API. Given reversed search is used much less frequently, a (by default disabled) flag for it is arguably a more compact exposition as opposed to separate method.
c) Elimination of possible confusion in meaning of r
, where 2 different concepts share the same API.
Finally, this is a natural simplifying change following direction agnostic implementation of string search. See gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython · GitHub
To put into context
- Part 1 of this series is still waiting for someone brave enough to pick up a review: gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython · GitHub.
- This one is Part 2.
- Part 3 is improvements in
re
module.
These 3 would bring a certain level of completeness/consistency in both functionality and computational efficiency for string search in Python.