Str.split: allow grouping separators under an optional parameter

Currently, we have two different behaviors for str.split depending on the sep value:

  • If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings.
  • If sep is not specified or is None, runs a consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

While both of the behaviors are useful, we limit the second one (with grouping separators) only to whitespaces, disallowing its usage with custom separators, even though the logic is already present in the code.

Proposal

I suggest adding an optional group_sep bool parameter, which will be used as follows:

  • By default, group_sep = False.
  • If sep is not specified or is None, group_sep has no effect (no changes to the existing behavior).
  • If sep is given, group_sep = False will preserve the current behavior.
  • If sep is given, group_sep = True will treat consecutive sep characters as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing sep characters (mimicking the current behavior for non-specified sep).

The above will preserve the existing behavior if group_sep is not specified, and will allow grouping separators by explicitly setting this parameter to True.

Implementation

The needed logic is already defined in STRINGLIB(split_whitespace). To reuse it, we could convert this function to one that accepts an is_separator function as a parameter, and then declare STRINGLIB(split_whitespace) and STRINGLIB(split_char_group) (or whatever) as its wrappers, passing the right parameter value.

4 Likes

Do you have real world examples of where this would be useful, or is this simply generalisation for the sake of it?

Personally, I have needed both of the current behaviours pretty often, but have never, to my recollection, needed the proposed new behaviour.

You can use filter() to drop out empty strings.

>>> ':foo:::bar::'.split(':')
['', 'foo', '', '', 'bar', '', '']
>>> list(filter(None, ':foo:::bar::'.split(':')))
['foo', 'bar']
>>> ':foo:::bar::'.replace(':', ' ').split()
['foo', 'bar']
7 Likes

Using filter() also has the benefits of ignoring leading/trailing separators and correctly returns an empty list if given an empty string.

1 Like

One workaround is to use re.findall with a negated character class:

re.findall(r'[^:]+', ':foo:::bar::') # ['foo', 'bar']

Note that with this approach spaces in the original input become separators rather than part of the values.