String.isplit (iterator-based split for strings)

bobkayakbob · March 5, 2021, 8:12pm

an iterator-based split functionality.
I can’t tell you how many times I see:

for substr in str.split():
    match = operation(substr)
    if match:
        break

with isplit, we needn’t split the whole string

brettcannon · March 5, 2021, 8:31pm

What’s the exact API you are proposing? Is it literally just to have lazy splitting, or did you want to incorporate filtering as well?

And you can already avoid splitting over the whole string by iterating and slicing on the string yourself.

start = 0
for index, char in enumerate(str):
    if char == " ":
        substr = str[start:index]
        if match := operation(substr):
            break
        else:
            start = index + 1

bobkayakbob · March 5, 2021, 9:54pm

probably named isplit, takes the same args as split, does the same thing izip did to zip – just creates an iterator that yields the same values, just not in an explicit collection (just an iterable)
so mainly lazy splitting, filtering can happen as a different thing (thru filter or the like)

 for substr in wholestr.isplit():
    operation(substr)

the above you present is nice, but having it in the language itself would be best, just like how split() isn’t implemented as above in everyone’s codebase

steven.daprano · March 6, 2021, 6:02am

An iterator-based version of split has been requested many times, if you
search the archives of the Python-List and Python-Ideas I’m sure you
will find many requests for it.

I think that its never succeeded because the discussions gets bogged
down in the usual Python-Ideas conservativeness and lack of anyone
willing to do the actual work.

Unfortunately the Python culture can sometimes be remarkably
conservative. (I should know, I’m often one of the nay-sayers.) Without
a core developer willing to push an issue through to completion, even
straight-forward feature requests with clear interest and support can
just fade away due to neglect or lack of obvious direction. This is one
of them.

Here was a simple, straightforward feature request for an iterator
version of str.split that immediately got derailed into issues like
string views and whether or not os.listdir should return an iterator,
and then closed as Rejected.

https://bugs.python.org/issue17343

Here are a couple of examples on StackOverflow:

Are we talking about a backwards incompatible change to str.split, or
a backwards compatible new method?
If a change to split, that would require a long deprecation period.
But if a new method, cue the bikeshedding: itersplit, iter_split,
isplit, something else?
str.split alone, or also include str.splitlines?
Should it support the full split API(s)?

string.split(sep=None, maxsplit=-1)

string.splitlines(keepends=False)
Why not just use re.finditer?

I think the answer to number 5 is “Of course!” but the question needs to
be asked in case there is some non-obvious reason why we would not
support the existing API.

pepoluan · March 6, 2021, 6:56am

TBH this is the first time I’ve ever heard of re.finditer(), and I have to thank everyone here for this fine little nugget.

Python had so many bells and whistles that can optimize one’s programs so much … but there haven’t been much discussed about them. Most tutorials I found just regurgitates the same things: Comprehensions, map() & filter(), enumerate(), and so on.

Beautiful little helpers such as partition() (and its twin, rpartition()), re.finditer(), sys.version_info, hex(), and so on are rarely discussed, oftentimes resulting in someone (me) slapping their forehead and saying, “Where have you been my whole life…”

I think for the next PyCon, someone needs to do an inventory of stdlib, scan all open repos on Git{Hub|Lab}|Source{Hut|Forge} and see which functions are severely underused, and do a talk about these little helpful functions.

apalala · March 6, 2021, 11:31am

This is the version of isplit() I wrote for internal use. It preserves the semantics of split() except that it doesn’t handle encodings different from UTF-8 (split() doesn’t do them correctly either):

def isplit(text, sep=None, maxsplit=-1):
    """
    A low-memory-footprint version of:

        iter(text.split(sep, maxsplit))

    see also:
      https://zyte.atlassian.net/browse/BV-9866
      https://bugs.python.org/issue17343
    """

    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep is not None and type(sep) != type(text):
        raise TypeError(f'must be {type(text).__name__} or None, not {type(sep).__name__}')
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    sep = sep.decode() if isinstance(sep, bytes) else sep
    rsep = re.escape(sep) if sep is not None else r'\s+'
    regex = fr'(?ms)(?:^|{rsep})((?:(?!{rsep}).)*)'
    regex = regex if isinstance(text, str) else regex.encode()

    for n, p in enumerate(re.finditer(regex, text)):
        if 0 <= maxsplit <= n:
            yield p.string[p.start(1):]
            return
        yield p.group(1)

apalala · March 6, 2021, 11:33am

This is the version of finditer() (and related like, findfirst() and findalliter()) that I wrote for internal use:

def first(iterable, default=_undefined):
    """Return the first item of *iterable*, or *default* if *iterable* is
    empty.

        >>> first([0, 1, 2, 3])
        0
        >>> first([], 'some default')
        'some default'

    If *default* is not provided and there are no items in the iterable,
    raise ``ValueError``.

    :func:`first` is useful when you have a generator of expensive-to-retrieve
    values and want any arbitrary one. It is marginally shorter than
    ``next(iter(iterable), default)``.

    """
    # NOTE: https://more-itertools.readthedocs.io/en/stable/_modules/more_itertools/more.html#first
    try:
        return next(iter(iterable))
    except StopIteration:
        # I'm on the edge about raising ValueError instead of StopIteration. At
        # the moment, ValueError wins, because the caller could conceivably
        # want to do something different with flow control when I raise the
        # exception, and it's weird to explicitly catch StopIteration.
        if default is _undefined:
            raise ValueError('first() was called on an empty iterable, and no '
                             'default value was provided.')
        return default


def findalliter(pattern, string, flags=0):
    '''
        like finditer(), but with return values like findall()

        implementation taken from cpython/Modules/_sre.c/findall()
    '''
    for m in re.finditer(pattern, string, flags=flags):
        default = string[0:0]
        g = m.groups(default=default)
        if len(g) == 1:
            yield g[0]
        elif g:
            yield g
        else:
            yield m.group()


def findfirst(pattern, string, flags=0, default=_undefined):
    """
    Avoids using the inefficient findall(...)[0], or first(findall(...))
    """
    return first(findalliter(pattern, string, flags=flags), default=default)

EpicWink · March 7, 2021, 1:37am

Perhaps we should add to the documentation of str.split: “for more advanced functionality, see re.finditer”

apalala · March 7, 2021, 11:03pm

re.finditer() returns match objects.

The semantics of split() and findall() are much more amicable, but they return list.

gwerbin · March 10, 2021, 7:06pm

I like the idea of isplit and isplitlines. There’s already precedent for this naming convention in multiprocessing.

Arguably we should have had ifilter and imap in the Python 3 stdlib alongside filter and map, but that’s ancient history by now

brettcannon · March 10, 2021, 9:26pm

map(), filter() and functools.reduce() all became lazy in Python 3.

gwerbin · March 10, 2021, 10:39pm

I was being facetious. In an alternate reality, we might have had both map() (returning a list) and imap (returning the lazy iterable we currently have).

bobkayakbob · March 11, 2021, 8:06pm

Thanks all.
re.finditer is probably closest to what I’d use as a in-my-codebase hackaround.
I very strongly agree with Laurie O’s suggestion of explicitly adding re.finditer to the str.split docs (if isplit isn’t something that will be implemented).

irish_beast · November 7, 2023, 3:52am

I read this with interest. On micropython platforms with 64 kB RAM iters are almost mandatory or at least stupid not to use.

I’ve written a library for micropython with isplitstr

“”"Lightweight, memory efficient enum & str.isplit implementation

micropython, and cpython:

isplitstr (str.isplit)

dut = isplitstr('/usr/bin/local/.././myloc/./hello.py', '/')

list(dut)                   # ['', 'usr', 'bin', 'local', '..', '.', 'myloc', '.', 'hello.py']
dut[1:4]                    # ['usr', 'bin', 'local']
for word in dut:            # 9 lazy iterations, last word empty str
    print(word)

'myloc' in dut              # True
dut.index('myloc']          # 3

dut.normalise()             # ['', 'usr', 'bin', 'myloc', 'hello.py']
dut.normpath()              # '/usr/bin/myloc/hello.py'
dut.normpath(0, -1)         # '/usr/bin/myloc'

pathjoin is not exactly a method and its return must be sliced [a:b:c]
But it can be preceded by a call to change the joinstr default: ‘/’

dut.pathjoin[:-1]           # '/usr/bin/myloc'
dut.pathjoin('\\')[:]       # '\usr\bin\myloc\hello.py'
dut.pathjoin                # <object isplitstr>

alexprengere · December 5, 2023, 8:09am

I also regularly need the lazy splitting of strings and bytes.
The re.finditer solution works well to match the str.split/str.lsplit semantics, but to emulate the str.rsplit, we would need the ability to have regex match starting from the end of the string.
This is available in the regex module with the re.REVERSE flag, but unfortunately this is not available in the stdlib re module (it is mentioned in this 21 years old issue ).
So overall I tend to use repeated str.find/str.rfind calls, with manually keeping of indices, but I agree that either adding str.isplit/str.irsplit, or adding re.REVERSE would be nice additions to “solve” this.

EDIT: adding the code I use (this does not support split()/rsplit() with no argument)

def isplit(s, sep):
    """Lazy version of s.split(sep)

    >>> list(isplit("", ","))
    ['']
    >>> list(isplit("AAA", ","))
    ['AAA']
    >>> list(isplit("AAA,", ","))
    ['AAA', '']
    >>> list(isplit("AAA,BBB", ","))
    ['AAA', 'BBB']
    >>> list(isplit("AAA,,BBB", ",,"))
    ['AAA', 'BBB']
    """
    seplen = len(sep)
    if seplen == 0:
        raise ValueError("empty separator")

    start = 0
    while True:
        index = s.find(sep, start)
        if index == -1:
            yield s[start:]
            return
        yield s[start:index]
        start = index + seplen


def irsplit(s, sep):
    """Lazy version of s.rsplit(sep)

    >>> list(irsplit("", ","))
    ['']
    >>> list(irsplit("AAA", ","))
    ['AAA']
    >>> list(irsplit("AAA,", ","))
    ['', 'AAA']
    >>> list(irsplit("AAA,BBB", ","))
    ['BBB', 'AAA']
    >>> list(irsplit("AAA,,BBB", ",,"))
    ['BBB', 'AAA']
    """
    seplen = len(sep)
    if seplen == 0:
        raise ValueError("empty separator")

    end = len(s)
    while True:
        index = s.rfind(sep, 0, end)
        if index == -1:
            yield s[:end]
            return
        yield s[index + seplen : end]
        end = index