Howdy howdy. I just wanted to throw my two cents in. My comments arenāt directly applicable to this issue / PR; however, the idea of adding ātuple supportā to str.split has come up, and thatās what I want to talk about.
Iāve implemented my own string processing routines that support multiple separator strings, in my big library. The crown jewel is multisplit, which is like str.split but supporting multiple separators. It also allows fine-grained control over how it splits and what it returns. Itās implemented using re.split but itās a lot easier to use than re.split. My goal was to make it the āSwiss army knifeā of string splitting routinesāthe last string splitting routine youād ever need. Itās powerful enough that it can be used to reimpliment a bunch of str methods: partition, rpartition, rsplit, split, and splitlines. (And with a little more work, you could probably reimplement lstrip, rstrip, and strip using multisplit too.)
I was motivated to write multisplit out of years of frustration with str.split. Manyās the time Iāve wanted to split a string on multiple separators, and I always found it frustrating that str.split didnāt support it. Having written multisplit, I now use it all the timeāitās great for simple parsing, when you donāt feel like doing battle with regular expressions. (Which, for me, is nearly always.)
The most important lesson learned from working with multisplit is the value returnedāor, preferably, yielded. I contend str.split maintains an important invariant. For any two strings X and Y:
V = X.split(Y)
S = Y.join(V)
assert S == X
This assertion is always true. You can always reconstitute the original string using the separator string and the result of the split. (This is also an invariant of str.partition, which I think is an excellent API design. I now try to design my string-splitting functions to also maintain this invariant.)
If you simply added support for multiple separators to str.split, but didnāt change the return value, youād break this invariant. Each time there was a split, you wouldnāt know which of the separators was present in the original string, so you wouldnāt know what to put there when you glued it back together.
My solution with multisplit was to return the separators too. The trouble was, this was a new function, so at the time I didnāt know what form youād want it to take. So I added four different styles of returning the separators, which admittedly was kind of an API experiment. (Thatās three styles of returning the separators, and a fourth where we throw the separators away.) I figured the only way to figure out what I wanted was to start using the function in real code.
I can tell you now which one is easily the most useful: I called it AS_PAIRS. When you select that, multisplit yields 2-tuples:
(segment, separator)
The segment is the text not containing any separators, and the second value is the separator it used to split the string on the right side of segment. And, like str.split, if you have two adjacent separators, multisplit will happily yield a tuple where segment is an empty string. Also, I guarantee that separator is always an empty string in the last 2-tuple yielded; this produces some funny-looking results at times, but Iāve convinced myself this really is the behavior you want.
AS_PAIRS is also the ur-return value, in that it can be used to construct any of the other style of return values multisplit supports. I also support a style called ALTERNATING, where multisplit yields individual alternating segment and separator strings. And the default āreturn the separatorsā mode just appends the separators to the segments. (This last one isnāt very useful, unless youāre reimplementing str.splitlines and you want to reimplement its keepends feature too.) If you have AS_PAIRS style output, you can easily convert it to ALTERNATING style or āappend the separatorsā style. If you have ALTERNATING itās slightly harder to produce the other two; if you have āappend modeā itās way way harder.
AS_PAIRS is conveniently unambiguous; for every value returned, you know whether itās a separator or a non-separator string. And of course, it maintains the invariant: if you concatenate all the strings together you reconstitute the original string.
Knowing what the separator was is also useful in and of itself. In fact it can be the whole point of splitting the string. For example, I used multisplit to implement a simple ābalanced delimiter parserā for big called split_delimiters, which makes it easy to parse simple .rc files and shell scripts and such. You tell split_delimiters all the delimiters you want it to recognize, and I use that as the list of separators I split on. Then when I parse a text, I call multisplit with that list of delimiters, and I use AS_PAIRS mode. This lets me leap past all the uninteresting stuff and just examine the delimitersāthe uninteresting stuff is in the segment, and the delimiter is conveniently isolated in the separator. This quickly became my preferred approach for writing these little mini-parsers.
My point with all this: if you consider adding ātuple supportā to str.split, I hope you change the return value so it returns something like my AS_PAIRS 2-tuples.