Add a `count` keyword to str.index and str.find to find the n-th occurence of an element in a string

Current signature:
str.index(sub[, start[, end]])

New suggested signature:
str.index(sub[, start[, end[, count]]])

Str.index returns the countth occurence. By default count is of course 1.

EDIT: Removed getting the nth column from a TSV example as it derailed the discussion.

An example is chunking FASTQ files. Which have entries that consist of 4 lines each. It would be so nice to be able to do something like this:

binary_data = remainder + file.read(128*1024 - len(remainder))
newlines = binary_data.count(b"\n")
number_of_records = newlines // 4
fastq_block_end = binary_data.index("\n", count=number_of_records * 4) + 1  # Currently only possible with custom C/Cython code
yield binary_data[:fastq_block_end]
remainder = binary_data[fastq_block_end:]

First off, your proposed method is not reliable; a comma could be part of a quoted field value rather than a separator.

Second, the standard library already provides a CSV parser, not to mention higher-level options like Pandas.

Third:

$ python -m timeit --setup "s = 'ham,spam,eggs,jam,lamb'" -- "s.split(',', maxsplit=3)[2]"
1000000 loops, best of 5: 224 nsec per loop

My computer was nothing special when I bought it, 9 years ago. Is this not fast enough?

Fourth: intuition about performance can be wrong. Here, using maxsplit actually slows it down if there aren’t a lot of extra fields:

$ python -m timeit --setup "s = 'ham,spam,eggs,jam,lamb'" -- "s.split(',')[2]"
2000000 loops, best of 5: 176 nsec per loop

(Those timing results are with 3.8; 3.11 is a bit faster still.)

FWIW, 3.12 is adding an old itertools recipe for this sort of thing.

Looks like a great job for string splitting. First fracture the entire string, then index within that; much more reliable than counting characters every time.

I should have never brought up parsing a single column from a CSV as an example because this muddies the discussion apparently. The point is that I have use cases, also outside csv parsing for getting the n-th occurence of a substring or character. EDIT: I have removed that example from the original post.

Currently it requires multiple python statements to get that n-th occurence while it could be very simple by adding a count keyword to find and index. If there are other people who see use for this, it is worth it to be implemented. If not, so be it. I can write my own C-extensions just fine. This is quite trivial to implement and if I can make other people happy by implementing this in the core language I can do it.

FWIW, 3.12 is adding an old itertools recipe for this sort of thing.

That does not work in this case. The itertools recipe only works when the intervals are fixed size. In this case they are not because the records are variable sized. So it requires search methods to find out what the size of the chunks is.

My proposal is to split on lines (perhaps lazily, by searching for the next newline on demand and yielding a line), and then batch groups of 4 and get the one you want from each batch.

(or, since you seemingly just want the nth of each group of lines, just islice the line iterator.)

I am talking about FASTQ format here. We have files with literally more than a 4_000_000_000 lines. Chunking it efficiently and distributing these chunks among compute cores is a great way to reduce wall clock time.
Splitting it up in 1_000_000_000 individual records each consisting of 4 individual lines in some sort of generator function is not efficient. That is why we use singular blocks of binary data that are then more adequately parsed in the individual threads. Efficient searching for record delimiters is quite useful here.
At these sort of file sizes, speed really does matter and having a simple count keyword so that I do not have to loop multiple find statements is really useful for speed reasons.

The count parameter would have to basically be implemented that way though, so I don’t know how much you’d really gain.

Since this is a method for both bytes and str this would be implemented in C and be much faster than the equivalent Python loop.

1 Like

If I just need the sequence, I parse fastq files with itertools.islice:

with gzip.open(fastq_path, "rt") as fh:
    for seq in itertools.islice(fh, 1, None, 4):
        who_cares_about_qscores(seq)

If you want to group the four lines, the batched tool does in fact work. The key is to read the file as text, not bytes.

with gzip.open(fastq_path, "rt") as fh:
    for h, seq, _, qscores in itertools.batched(fh, 4):
        me_I_care_about_qscores(seq, qscores)

If you need faster reading of the file I would write (or use) something in another language.

Moving this to help instead of ideas as it’s a discussion about how to parse a data format efficiently.

Adding count here doesn’t really do anything, it still incurs a full scan of the data up to that point to determine where count is met, even in C. If you’re parsing a specific data format, you’ll always be better off using a purpose-made tool. In this case islice or another itertools or more-itertools recipe would be much more efficient, or there’s probably specific libraries for fastq already. Your current implementation has a memory issue as well, you’re creating huge new strings constantly with the remainder = str[slice] each iteration.

@davidism, this is NOT a discussion about how to parse a data format efficiently.

The case is simple.

  1. Finding the nth occurence of a substring or character in python is slow. Finding the first or last is fast (using find or rfind respectively).
  2. Finding the nth occurence can potentially be very fast by implementing a single keyword.
  3. I can implement that, I am offering my time.
  4. I see use cases for that.

My question: do other people see use cases. So I can implement this idea. If not, I wont do it. If so I will do it.

Instead I get people arguing that I can do my examples differently. Yes, I can, I know, I can, but that is not my question. I have an idea that I want to discuss.

If you need faster reading of the file I would write (or use) something in another language.

I have contributed to Cython parsers of the data formats I use. GitHub - marcelm/dnaio: Read and write FASTQ and FASTA efficiently from Python is the fastest parser of FASTQ in any language that I know of. (I have benchmarked it against parsers in other languages as well). This is off-topic. Sometimes I just write a one-off scripts, and some problems I run into could be solved faster and sometimes easier by adding this count keyword. That does not mean I cannot solve my problems without the keyword. I can, but that is not the point.

Adding count here doesn’t really do anything, it still incurs a full scan of the data up to that point to determine where count is met, even in C.

Scanning for a single character multiple times in a block of memory is very fast in C. In a 128Kb chunk with say a count of 1000 occurrences this is much faster in C than having a python function called 1000 times. Python call overhead is very slow. In fact, for the discussed split method, the fastest part is actually finding the separators. The slow part is actually allocating the split strings on the heap. But again, that is not the point I am making here.

1 Like

do other people see use cases

I think the answer is no at this point, because what people keep seeing is that there are better ways to implement whatever examples are given. You need to make the case for why it’s essential that this gets in, not us.

I fully agree that it is on me to prove that the idea is useful. However right now I feel I am not discussing the idea, rather that people think I have a problem that does need solving. As evidenced by you placing it in a completely different topic forum.

Furthermore it does not help that the example I give is for chunking data in to chunks of multiple records while I get feedback on how to do that for singular records. That is solving a different problem. And yes, for singular records the count keyword is not going to be useful. But that is precisely the reason why I give the example for blocks of multiple records.

The best way to show this would be to give timing results for a practical example, ideally a) including a comparison to a C implementation for reference and b) profiling to show that the overhead of creating temporary objects is causing a significant slowdown (as opposed to either I/O or the actual string search algorithm).

The idea seems plausible but not, so far, to others, compelling. I think this is partly because defined-format strings often have specialized handlers, such as for cvs, and partly it is unclear how often people in general will want the position of the nth occurrence of a substring and nothing else.

Do you know of any places where the proposal would make a clear improvement in the code?

Do note that additions have a non-zero continuing cost for education, maintenance, and execution, so we try to balance anticipated but sometimes hard to estimate costs and benefits.

1 Like

Do you know of any places where the proposal would make a clear improvement in the code?

I can think of one or two places in my code. Having said that, I looked through the stdlib to see if there was any use case for this, and I couldn’t find any.

The best way to show this would be to give timing results for a practical example, ideally a) including a comparison to a C implementation for reference and b) profiling to show that the overhead of creating temporary objects is causing a significant slowdown (as opposed to either I/O or the actual string search algorithm).

I did a comparison between find and split and found that split is actually more useful and faster for most use cases where the string is reasonably small. On large strings (or binary data) find is of course faster, but that is already more of a fringe use case and wanting the n-th occurrence rather than the first or last even more so.

Do note that additions have a non-zero continuing cost for education, maintenance, and execution, so we try to balance anticipated but sometimes hard to estimate costs and benefits.

Yes, my primary gripe with this would actually be from a consistency standpoint as list also has an index method with the same signature. So to be fully consistent that would also need this count keyword, or there would be inconsistency.

So all in all, not worth it to be added indeed. Thank you all for the feedback.