String API changes

dg-pb · September 5, 2024, 9:47am

This is part of String Search Overview, Coverage and API.

I propose the following changes to string methods. These would:
a) Provide functional completeness to a large degree
b) Bring string API a bit closer to re
c) eliminate r.* naming inconsistency.

1. str.count

Add maxcount argument. So that new api is:

str.count(self, sub[, start[, end]], /, maxcount=-1)

This functionality is already in C, so all what is left is exposing it. There many use cases for str.count(...) <cmp> int

Regex for CPython repo or github: \.count\(.*\).*[><=]

Very low cost change with non-negligible benefits.

2. str.split - POSTPONED

add keepsep argument, so that new api is:

str.split(self, /, sep=None, maxsplit=-1, keepsep=False)

a.k.a. re.split when separator is grouped.

3. str.replace - POSTPONED

Return number of replacements. Either:
a) Add str.replacen (a.k.a. re.subn)
b) str.replace(..., withcount=False)

4. str.finditer (`re.finditer` equivalent)

API equivalent to str.find, but returns an iterator of starting positions of non-overlapping matches.

important component for further work on re module.

Apart from the main reason, str.finditer is also useful for cases when parsing text lines, but there is no need to split them all. Currently, calling str.find many times is simply too slow to be worth it, although should be the optimal thing to do.

5. Reversion

Currently there are methods prefixed with r: str.rfind, str.rstrip, …

However, r means different things. There are 2 cases:
a) r = right (vs left)
b) r = reversed (vs forward)

Although they might seem the same, they are not. In case of b) it is either-either relationship. Either forward or reversed, where in case of a) it can be both - right/left/both.

So my proposal is instead to leave r for - lstrip / rstrip / strip and add rev=False flag to all search related methods. These are all of the methods mentioned above + str.index.

And to slowly deprecate str.rfind, str.rsplit, …

Benefits of this are:
a) API more in line with more general conventions. E.g. re.search would have a flag / an argument to indicate reversed search, as opposed to introduce re.rsearch. E.g. regex · PyPI
b) More compact and consistent API. Given reversed search is used much less frequently, a (by default disabled) flag for it is arguably a more compact exposition as opposed to separate method.
c) Elimination of possible confusion in meaning of r, where 2 different concepts share the same API.

Finally, this is a natural simplifying change following direction agnostic implementation of string search. See gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython · GitHub

To put into context

Part 1 of this series is still waiting for someone brave enough to pick up a review: gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython · GitHub.
This one is Part 2.
Part 3 is improvements in re module.

These 3 would bring a certain level of completeness/consistency in both functionality and computational efficiency for string search in Python.

NeilGirdhar · September 5, 2024, 11:45am

Have you done any research on real code that would use these features? I guess the first one is a clear win if performance matters?

Edit: FWIW, I really like this proposal’s simpler flag-based interface than the multiple method interface we have now.

bwoodsend · September 5, 2024, 12:03pm

What exactly would the keepsep for str.split() do? Given:

"foo|bar||pop|".split("|", keepsep=True)

would I get ["foo", "|", "bar", "|", "|", "pop", "|"] or ["foo|", "bar||", "pop|"] or something else?

Nineteendo · September 5, 2024, 12:06pm

I’m not sure about deprecating rfind(), rsplit() and rindex(). Having a dedicated function is very convenient.

Presumably keepsep would work like this (not particularly useful when the separator is a single string):

>>> "foo|bar||pop|".split("|", keepsep=True)
['foo', '|', 'bar', '|', '', '|', 'pop', '|', '']

dg-pb · September 5, 2024, 12:12pm

I have been looking at these for a while now and I did for some, maybe not for all. And will do it again as I move forward.

From my POV ones that could potentially be overarched by insufficient usage are: str.split(keepsep) and str.replace(withcount).

str.count(maxcount) seems to have a fairly extensive usage.

However, the rest are only partially motivated by use cases. Namely, finditer and rev.

Having finditer would allow implementing efficient fallbacks and other optimizations to re.

And rev is a general improvement. E.g. It is easier to pass on a flag when writing bi-directional search abstraction as opposed to selecting a method. I.e.:

def search_plus(..., rev=False):
    ...
    return <?>.find(..., rev=rev)

# vs
def search_plus(..., rev=False):
    ...
    func = str.rfind if rev else str.find
    return func(<?>, ...)

… among benefits indicated above.

dg-pb · September 5, 2024, 12:21pm

That is orthogonal to rev flag really.

Personally, I would not keep them in my own code base, just for the sake of having less methods to maintain. And to remove conflicting concepts.

But if people are used to these conveniences I see no issue in keeping these. Although would still suggest a very very slow deprecation. I don’t think they are that much used in practice.

Yes, it would be the same as:

sep = '|'
re_sep = '(' + re.escape(sep) + ')'
re.compile(re_sep).split(string)

barry-scott · September 5, 2024, 1:23pm

Removing them will break existing code.

dg-pb · September 5, 2024, 1:47pm

Yeah so the other 2 aren’t that much used. Although I will keep these noted, but for now I think its best to leave these 2 aside:

str.split(keepsep) - non-regex usages of re.split("(

/re\.split\([\"\']\([^\\\?\+\*\.\^\|]*\)/ Language:Python   1.9 k

str.replacen - non-regex usages of re.subn

/re\.subn\([^\\\?\+\*\.\^\|]*\)/ Language:Python            3.8 k

Also, there are false positives.

So although I still think these might be worthwhile given relative effort required and simply for general completeness, I think these are best to be left for times when there is nothing more important to do.

So 3 points left:

str.count - small & easy change
str.finditer - important component for further work on re module.
rev flag - architectural change following direction agnostic implementation of string search.

Also, str.finditer is also useful for cases when parsing text lines, but there is no need to split them all. Currently, calling str.find many times is simply too slow to be worth it, although should be the optimal thing to do.

zuo · September 5, 2024, 5:04pm

Personally, I like these two (i.e., they’d have +1 from me):

str.count(..., maxcount=-1)
str.finditer(...)

(except that perhaps finditer is not the best name – I’d rather opt for ifind – because the former could be confusing, considering what is generated by re.finditer(): strings or tuples thereof, rather than indexes).

For the rest parts of the proposal I’am at most +0.

Additionally, what I would find useful would be iterator-returning variants of str.split()/str.rsplit() (let’s assume they’d be named isplit/risplit – as they seem more readable that splititer/rsplititer – though, obviously, I’d OK with other ideas for their names…):

>>> parts = "    Ala ma \t Kota\n".isplit()
>>> next(parts)
'Ala'
>>> next(parts)
'ma'
>>> next(parts)
'Kota'
>>> next(parts, None) is None
True

>>> parts = "    Ala ma \t Kota\n".isplit(maxsplit=2)
>>> next(parts)
'Ala'
>>> next(parts)
'ma \t Kota\n'
>>> next(parts, None) is None
True

>>> parts = "    Ala ma \t Kota\n".isplit("a")
>>> next(parts)
'    Al'
>>> next(parts)
' m'
>>> next(parts)
'  \t Kot'
>>> next(parts)
'\n'
>>> next(parts, None) is None
True

etc.

zuo · September 5, 2024, 5:24pm

PS When it comes to naming, neither finditer nor ifind/rifind, nor isplit/risplit satisfy me.

Perhaps a more promising approach would be to add .iter() callables on top of those method objects?

E.g.: .find.iter(), .rfind.iter(), .split.iter(), .rsplit.iter() (and maybe others ^[1]) – doing the same as .find(), .rfind(), .split(), .rsplit() (respectively), but in a lazy manner (returning an iterator).

probably: .index.iter(), .rindex.iter() and .splitlines.iter() ↩︎

dg-pb · September 5, 2024, 5:56pm

re.search returns re.Match object, thus re.finditer returns an iterator of re.Match objects.

In the same way str.find returns int and str.finditer would return an iterator of int.

I am not attached to naming, although I feel finditer would be easiest to remember, given it is used in re.

ifind is not a good option. prefix i is often used in different cases. E.g. glob vs iglob. It means “returning iterator instead of sequence”, while this pattern does not apply here.

Regarding, .find.iter(), etc. I would seriously consider this if I was designing things from nothing. However, given current practices this would be a very hard sell. Such patterns can more often be found in 3rd party libraries, but not CPython itself. Correct me if I am wrong though, would be interested to see if there is something similar in CPython.

zuo · September 5, 2024, 7:11pm

You are right, I confused re.finditer() with re.findall().

zuo · September 5, 2024, 7:13pm

When it comes to having a callable being an attribute of another callable, there is a precedent in the stdlib: itertools.chain() with itertools.chain.from_iterable().

dg-pb · September 5, 2024, 7:27pm

Yeah, there are few more of similar instances in that space as far as I remember. However, these are regarding the input and not the output.

And from what I have seen these are on the side of special cases as opposed to standard practice. After all, itertools components masquerade as simple functions without methods.

chepner · September 5, 2024, 8:21pm

Just to note, chain is a type and from_iterable is a class method. Which is not to say that find couldn’t be turned into an iterable type, but having the current function have another function as an attribute is decidedly unorthodox.

avi.gross · September 5, 2024, 9:51pm

Barry,

Yes, removing a function people have been using does indeed break existing code. But in this case, I wonder if there is a way to avoid that and yet deprecate future use.

For each function, you can create a function with the same name that merely passes along all arguments followed by your new argument of rev=TRUE as below:

def rsplit(*args):
  return( split(*args, rev=TRUE) )

Obviously, this is less efficient with an extra layer of function calls and perhaps the possibility of other inconsistencies. If the old version is really wanted, you could rename it to rsplitOLD() and anyone wanting to use it would make a change in one place in their code to import that under the original name or some other similar device.

I am NOT saying this feature is necessary, albeit it seems reasonable if done in a way that minimizes disruption.

A^{v_i}

avi.gross · September 5, 2024, 10:10pm

Neil,

As I suspect the time when there is nothing more important to do will arrive about an eon after everyone has switched to another language that only an AI can program in, I want to raise a question.

It boils down to whether sometimes rather than make changes for some kind of completeness, it may make sense to just create a new module and leave the old one alone. The new one need not be official and may contain some duplication.

As an example, what if any module out there could be supplement as needed not by changing “str” but creating an “strSupp” or “str2” or some consistent similar name whose meaning is that it contains functionality not in the original.

You could use this namespace to do anything from adding lesser-used functions, to doing an exercise where you supply every function you think will make this set of tools complete. As an example from another language, someone created every imaginable function to display subsets of a date/time object in all kinds of order. I mean:

ymd_hms()
ymd_hm()
ymd_h()
dmy_hms()
...
mdy_hms()
...
ydm_hms()
...

Clearly, adding an exhaustive set of such functions in which many will rarely or ever be used, is not ideal in a main module but perhaps reasonable in a secondary.

So perhaps some changes could not be deferred indefinitely but suggested to be placed in an alternate module that can be imported and used and perhaps eventually be shown to be worthwhile and maybe then copied or moved into the main unit.

Of course, those planning on how to use available people/resources may not want to support this unless others volunteer to do it.

Or, is there already something like this out there?

dg-pb · September 5, 2024, 10:30pm

Not a big issue, 1 layer of function call in C is noise compared to function call in Python.

Initially separate reverse methods will need to stay anyways (as per your example).