Howdy howdy. I just wanted to throw my two cents in. My comments arenāt directly applicable to this issue / PR; however, the idea of adding ātuple supportā to str.split
has come up, and thatās what I want to talk about.
Iāve implemented my own string processing routines that support multiple separator strings, in my big library. The crown jewel is multisplit
, which is like str.split
but supporting multiple separators. It also allows fine-grained control over how it splits and what it returns. Itās implemented using re.split
but itās a lot easier to use than re.split
. My goal was to make it the āSwiss army knifeā of string splitting routinesāthe last string splitting routine youād ever need. Itās powerful enough that it can be used to reimpliment a bunch of str
methods: partition
, rpartition
, rsplit
, split
, and splitlines
. (And with a little more work, you could probably reimplement lstrip
, rstrip
, and strip
using multisplit
too.)
I was motivated to write multisplit
out of years of frustration with str.split
. Manyās the time Iāve wanted to split a string on multiple separators, and I always found it frustrating that str.split
didnāt support it. Having written multisplit
, I now use it all the timeāitās great for simple parsing, when you donāt feel like doing battle with regular expressions. (Which, for me, is nearly always.)
The most important lesson learned from working with multisplit
is the value returnedāor, preferably, yielded. I contend str.split
maintains an important invariant. For any two strings X
and Y
:
V = X.split(Y)
S = Y.join(V)
assert S == X
This assertion is always true. You can always reconstitute the original string using the separator string and the result of the split. (This is also an invariant of str.partition
, which I think is an excellent API design. I now try to design my string-splitting functions to also maintain this invariant.)
If you simply added support for multiple separators to str.split
, but didnāt change the return value, youād break this invariant. Each time there was a split, you wouldnāt know which of the separators was present in the original string, so you wouldnāt know what to put there when you glued it back together.
My solution with multisplit
was to return the separators too. The trouble was, this was a new function, so at the time I didnāt know what form youād want it to take. So I added four different styles of returning the separators, which admittedly was kind of an API experiment. (Thatās three styles of returning the separators, and a fourth where we throw the separators away.) I figured the only way to figure out what I wanted was to start using the function in real code.
I can tell you now which one is easily the most useful: I called it AS_PAIRS
. When you select that, multisplit
yields 2-tuples:
(segment, separator)
The segment
is the text not containing any separators, and the second value is the separator
it used to split the string on the right side of segment
. And, like str.split
, if you have two adjacent separators, multisplit
will happily yield a tuple where segment
is an empty string. Also, I guarantee that separator
is always an empty string in the last 2-tuple yielded; this produces some funny-looking results at times, but Iāve convinced myself this really is the behavior you want.
AS_PAIRS
is also the ur-return value, in that it can be used to construct any of the other style of return values multisplit
supports. I also support a style called ALTERNATING
, where multisplit
yields individual alternating segment
and separator
strings. And the default āreturn the separatorsā mode just appends the separators to the segments. (This last one isnāt very useful, unless youāre reimplementing str.splitlines
and you want to reimplement its keepends
feature too.) If you have AS_PAIRS
style output, you can easily convert it to ALTERNATING
style or āappend the separatorsā style. If you have ALTERNATING
itās slightly harder to produce the other two; if you have āappend modeā itās way way harder.
AS_PAIRS
is conveniently unambiguous; for every value returned, you know whether itās a separator or a non-separator string. And of course, it maintains the invariant: if you concatenate all the strings together you reconstitute the original string.
Knowing what the separator was is also useful in and of itself. In fact it can be the whole point of splitting the string. For example, I used multisplit
to implement a simple ābalanced delimiter parserā for big called split_delimiters
, which makes it easy to parse simple .rc
files and shell scripts and such. You tell split_delimiters
all the delimiters you want it to recognize, and I use that as the list of separators I split on. Then when I parse a text, I call multisplit
with that list of delimiters, and I use AS_PAIRS
mode. This lets me leap past all the uninteresting stuff and just examine the delimitersāthe uninteresting stuff is in the segment
, and the delimiter is conveniently isolated in the separator
. This quickly became my preferred approach for writing these little mini-parsers.
My point with all this: if you consider adding ātuple supportā to str.split
, I hope you change the return value so it returns something like my AS_PAIRS
2-tuples.