Add tuple support to more str functions

dg-pb · October 9, 2024, 1:00am

If this was implemented, it is fairly straight forward to adapt this to str.split. It is in my plan, obviously with this as a prerequisite. I would most likely look at it together with other String API changes.

I also think this is necessary, otherwise needing to work it out outside kills performance benefit for a lot of cases (especially ones of smaller size). I have experimented with Python version for a bit and it is convenient to return tuple[pos, i_needle] in case input is a tuple of needles.

The algorithm uses str.find under the hood - so it is as risky as using str.find.

Nineteendo · October 9, 2024, 12:02pm

We could add a return_sub argument returning None or (pos, sub), but I’m not sure if more functionality would be desired (longest or shortest match, all matches, …). On the other hand a regex always returns the first substring:

>>> import re
>>> re.match("foo|foobar", "foobar")
<re.Match object; span=(0, 3), match='foo'>

It probably doesn’t make sense for str.startswith() and str.endswith() though.

Yeah, it’s choosing between consistency with str.find() or with Sequence.index() (which already isn’t fully compatible). If it’s really a problem we can leave it out.

Note that performance for a single substring hasn’t changed (maybe a few nanoseconds faster per call), in that case the old algorithm is still called. str.find(subs) is conceptually just a wrapper around str.find(sub) with less overhead.

Even though the documentation recommends to use re for advanced pattern matching, I’m afraid some people will still try passing more than a couple sub strings to these functions leading to sub-optimal performance. Maybe we should cap the number of substrings at 20 to prevent misuse?

For a tuple with one substring it just calls str.find(sub), so I can’t really use existing benchmarks and no one has offered to help with writing more.

larry · December 22, 2024, 10:44am

Howdy howdy. I just wanted to throw my two cents in. My comments aren’t directly applicable to this issue / PR; however, the idea of adding “tuple support” to str.split has come up, and that’s what I want to talk about.

I’ve implemented my own string processing routines that support multiple separator strings, in my big library. The crown jewel is multisplit, which is like str.split but supporting multiple separators. It also allows fine-grained control over how it splits and what it returns. It’s implemented using re.split but it’s a lot easier to use than re.split. My goal was to make it the “Swiss army knife” of string splitting routines–the last string splitting routine you’d ever need. It’s powerful enough that it can be used to reimpliment a bunch of str methods: partition, rpartition, rsplit, split, and splitlines. (And with a little more work, you could probably reimplement lstrip, rstrip, and strip using multisplit too.)

I was motivated to write multisplit out of years of frustration with str.split. Many’s the time I’ve wanted to split a string on multiple separators, and I always found it frustrating that str.split didn’t support it. Having written multisplit, I now use it all the time–it’s great for simple parsing, when you don’t feel like doing battle with regular expressions. (Which, for me, is nearly always.)

The most important lesson learned from working with multisplit is the value returned–or, preferably, yielded. I contend str.split maintains an important invariant. For any two strings X and Y:

V = X.split(Y)
S = Y.join(V)
assert S == X

This assertion is always true. You can always reconstitute the original string using the separator string and the result of the split. (This is also an invariant of str.partition, which I think is an excellent API design. I now try to design my string-splitting functions to also maintain this invariant.)

If you simply added support for multiple separators to str.split, but didn’t change the return value, you’d break this invariant. Each time there was a split, you wouldn’t know which of the separators was present in the original string, so you wouldn’t know what to put there when you glued it back together.

My solution with multisplit was to return the separators too. The trouble was, this was a new function, so at the time I didn’t know what form you’d want it to take. So I added four different styles of returning the separators, which admittedly was kind of an API experiment. (That’s three styles of returning the separators, and a fourth where we throw the separators away.) I figured the only way to figure out what I wanted was to start using the function in real code.

I can tell you now which one is easily the most useful: I called it AS_PAIRS. When you select that, multisplit yields 2-tuples:

(segment, separator)

The segment is the text not containing any separators, and the second value is the separator it used to split the string on the right side of segment. And, like str.split, if you have two adjacent separators, multisplit will happily yield a tuple where segment is an empty string. Also, I guarantee that separator is always an empty string in the last 2-tuple yielded; this produces some funny-looking results at times, but I’ve convinced myself this really is the behavior you want.

AS_PAIRS is also the ur-return value, in that it can be used to construct any of the other style of return values multisplit supports. I also support a style called ALTERNATING, where multisplit yields individual alternating segment and separator strings. And the default “return the separators” mode just appends the separators to the segments. (This last one isn’t very useful, unless you’re reimplementing str.splitlines and you want to reimplement its keepends feature too.) If you have AS_PAIRS style output, you can easily convert it to ALTERNATING style or “append the separators” style. If you have ALTERNATING it’s slightly harder to produce the other two; if you have “append mode” it’s way way harder.

AS_PAIRS is conveniently unambiguous; for every value returned, you know whether it’s a separator or a non-separator string. And of course, it maintains the invariant: if you concatenate all the strings together you reconstitute the original string.

Knowing what the separator was is also useful in and of itself. In fact it can be the whole point of splitting the string. For example, I used multisplit to implement a simple “balanced delimiter parser” for big called split_delimiters, which makes it easy to parse simple .rc files and shell scripts and such. You tell split_delimiters all the delimiters you want it to recognize, and I use that as the list of separators I split on. Then when I parse a text, I call multisplit with that list of delimiters, and I use AS_PAIRS mode. This lets me leap past all the uninteresting stuff and just examine the delimiters–the uninteresting stuff is in the segment, and the delimiter is conveniently isolated in the separator. This quickly became my preferred approach for writing these little mini-parsers.

My point with all this: if you consider adding “tuple support” to str.split, I hope you change the return value so it returns something like my AS_PAIRS 2-tuples.

Nineteendo · December 22, 2024, 11:54am

Thanks for your reply Larry. I have thought about str.split(tuple), but I’m not sure how this can be implemented efficiently. It might be possible to use str.find(tuple) returning (pos, end_pos), but I can’t really predict the performance in advance. I would add a keep_seps argument though to alternate, in line with (sep) in re.

I’m also not very motivated to spent more time in this area, because so far I haven’t gotten much support (maybe the discussion is just too long) and the implementation is far from trivial (benchmarks, tests, documentation, etc). I would rather work on something fun that has a bigger chance of being accepted.