Add a `count` keyword to str.index and str.find to find the n-th occurence of an element in a string

rhpvorderman · August 8, 2023, 10:41am

Current signature:
str.index(sub[, start[, end]])

New suggested signature:
str.index(sub[, start[, end[, count]]])

Str.index returns the countth occurence. By default count is of course 1.

EDIT: Removed getting the nth column from a TSV example as it derailed the discussion.

An example is chunking FASTQ files. Which have entries that consist of 4 lines each. It would be so nice to be able to do something like this:

binary_data = remainder + file.read(128*1024 - len(remainder))
newlines = binary_data.count(b"\n")
number_of_records = newlines // 4
fastq_block_end = binary_data.index("\n", count=number_of_records * 4) + 1  # Currently only possible with custom C/Cython code
yield binary_data[:fastq_block_end]
remainder = binary_data[fastq_block_end:]

kknechtel · August 8, 2023, 11:01am

First off, your proposed method is not reliable; a comma could be part of a quoted field value rather than a separator.

Second, the standard library already provides a CSV parser, not to mention higher-level options like Pandas.

Third:

$ python -m timeit --setup "s = 'ham,spam,eggs,jam,lamb'" -- "s.split(',', maxsplit=3)[2]"
1000000 loops, best of 5: 224 nsec per loop

My computer was nothing special when I bought it, 9 years ago. Is this not fast enough?

Fourth: intuition about performance can be wrong. Here, using maxsplit actually slows it down if there aren’t a lot of extra fields:

$ python -m timeit --setup "s = 'ham,spam,eggs,jam,lamb'" -- "s.split(',')[2]"
2000000 loops, best of 5: 176 nsec per loop

(Those timing results are with 3.8; 3.11 is a bit faster still.)

Ruben Vorderman:

It would be so nice to be able to do something like this:

binary_data = remainder + file.read(128*1024 - len(remainder))
newlines = binary_data.count(b"\n")
number_of_records = newlines // 4
fastq_block_end = binary_data.index("\n", count=number_of_records * 4) + 1  # Currently only possible with custom C/Cython code
yield binary_data[:fastq_block_end]
remainder = binary_data[fastq_block_end:]

FWIW, 3.12 is adding an old itertools recipe for this sort of thing.

Rosuav · August 8, 2023, 11:13am

Looks like a great job for string splitting. First fracture the entire string, then index within that; much more reliable than counting characters every time.

rhpvorderman · August 8, 2023, 12:04pm

I should have never brought up parsing a single column from a CSV as an example because this muddies the discussion apparently. The point is that I have use cases, also outside csv parsing for getting the n-th occurence of a substring or character. EDIT: I have removed that example from the original post.

Currently it requires multiple python statements to get that n-th occurence while it could be very simple by adding a count keyword to find and index. If there are other people who see use for this, it is worth it to be implemented. If not, so be it. I can write my own C-extensions just fine. This is quite trivial to implement and if I can make other people happy by implementing this in the core language I can do it.

rhpvorderman · August 8, 2023, 12:10pm

FWIW, 3.12 is adding an old itertools recipe for this sort of thing.

That does not work in this case. The itertools recipe only works when the intervals are fixed size. In this case they are not because the records are variable sized. So it requires search methods to find out what the size of the chunks is.

kknechtel · August 8, 2023, 12:15pm

My proposal is to split on lines (perhaps lazily, by searching for the next newline on demand and yielding a line), and then batch groups of 4 and get the one you want from each batch.

(or, since you seemingly just want the nth of each group of lines, just islice the line iterator.)

rhpvorderman · August 8, 2023, 12:36pm

I am talking about FASTQ format here. We have files with literally more than a 4_000_000_000 lines. Chunking it efficiently and distributing these chunks among compute cores is a great way to reduce wall clock time.
Splitting it up in 1_000_000_000 individual records each consisting of 4 individual lines in some sort of generator function is not efficient. That is why we use singular blocks of binary data that are then more adequately parsed in the individual threads. Efficient searching for record delimiters is quite useful here.
At these sort of file sizes, speed really does matter and having a simple count keyword so that I do not have to loop multiple find statements is really useful for speed reasons.

Rosuav · August 8, 2023, 2:01pm

The count parameter would have to basically be implemented that way though, so I don’t know how much you’d really gain.

rhpvorderman · August 8, 2023, 2:08pm

Since this is a method for both bytes and str this would be implemented in C and be much faster than the equivalent Python loop.

jamestwebber · August 8, 2023, 2:16pm

If I just need the sequence, I parse fastq files with itertools.islice:

with gzip.open(fastq_path, "rt") as fh:
    for seq in itertools.islice(fh, 1, None, 4):
        who_cares_about_qscores(seq)

If you want to group the four lines, the batched tool does in fact work. The key is to read the file as text, not bytes.

with gzip.open(fastq_path, "rt") as fh:
    for h, seq, _, qscores in itertools.batched(fh, 4):
        me_I_care_about_qscores(seq, qscores)

If you need faster reading of the file I would write (or use) something in another language.

davidism · August 8, 2023, 2:19pm

Moving this to help instead of ideas as it’s a discussion about how to parse a data format efficiently.

davidism · August 8, 2023, 2:25pm

Adding count here doesn’t really do anything, it still incurs a full scan of the data up to that point to determine where count is met, even in C. If you’re parsing a specific data format, you’ll always be better off using a purpose-made tool. In this case islice or another itertools or more-itertools recipe would be much more efficient, or there’s probably specific libraries for fastq already. Your current implementation has a memory issue as well, you’re creating huge new strings constantly with the remainder = str[slice] each iteration.

rhpvorderman · August 8, 2023, 2:35pm

@davidism, this is NOT a discussion about how to parse a data format efficiently.

The case is simple.

Finding the nth occurence of a substring or character in python is slow. Finding the first or last is fast (using find or rfind respectively).
Finding the nth occurence can potentially be very fast by implementing a single keyword.
I can implement that, I am offering my time.
I see use cases for that.

My question: do other people see use cases. So I can implement this idea. If not, I wont do it. If so I will do it.

Instead I get people arguing that I can do my examples differently. Yes, I can, I know, I can, but that is not my question. I have an idea that I want to discuss.

If you need faster reading of the file I would write (or use) something in another language.

I have contributed to Cython parsers of the data formats I use. GitHub - marcelm/dnaio: Read and write FASTQ and FASTA efficiently from Python is the fastest parser of FASTQ in any language that I know of. (I have benchmarked it against parsers in other languages as well). This is off-topic. Sometimes I just write a one-off scripts, and some problems I run into could be solved faster and sometimes easier by adding this count keyword. That does not mean I cannot solve my problems without the keyword. I can, but that is not the point.

Adding count here doesn’t really do anything, it still incurs a full scan of the data up to that point to determine where count is met, even in C.

Scanning for a single character multiple times in a block of memory is very fast in C. In a 128Kb chunk with say a count of 1000 occurrences this is much faster in C than having a python function called 1000 times. Python call overhead is very slow. In fact, for the discussed split method, the fastest part is actually finding the separators. The slow part is actually allocating the split strings on the heap. But again, that is not the point I am making here.

davidism · August 8, 2023, 2:41pm

do other people see use cases

I think the answer is no at this point, because what people keep seeing is that there are better ways to implement whatever examples are given. You need to make the case for why it’s essential that this gets in, not us.

rhpvorderman · August 8, 2023, 3:03pm

I fully agree that it is on me to prove that the idea is useful. However right now I feel I am not discussing the idea, rather that people think I have a problem that does need solving. As evidenced by you placing it in a completely different topic forum.

Furthermore it does not help that the example I give is for chunking data in to chunks of multiple records while I get feedback on how to do that for singular records. That is solving a different problem. And yes, for singular records the count keyword is not going to be useful. But that is precisely the reason why I give the example for blocks of multiple records.

kknechtel · August 8, 2023, 8:45pm

The best way to show this would be to give timing results for a practical example, ideally a) including a comparison to a C implementation for reference and b) profiling to show that the overhead of creating temporary objects is causing a significant slowdown (as opposed to either I/O or the actual string search algorithm).

tjreedy · August 8, 2023, 11:05pm

The idea seems plausible but not, so far, to others, compelling. I think this is partly because defined-format strings often have specialized handlers, such as for cvs, and partly it is unclear how often people in general will want the position of the nth occurrence of a substring and nothing else.

Do you know of any places where the proposal would make a clear improvement in the code?

Do note that additions have a non-zero continuing cost for education, maintenance, and execution, so we try to balance anticipated but sometimes hard to estimate costs and benefits.

rhpvorderman · August 9, 2023, 5:49am

Do you know of any places where the proposal would make a clear improvement in the code?

I can think of one or two places in my code. Having said that, I looked through the stdlib to see if there was any use case for this, and I couldn’t find any.

The best way to show this would be to give timing results for a practical example, ideally a) including a comparison to a C implementation for reference and b) profiling to show that the overhead of creating temporary objects is causing a significant slowdown (as opposed to either I/O or the actual string search algorithm).

I did a comparison between find and split and found that split is actually more useful and faster for most use cases where the string is reasonably small. On large strings (or binary data) find is of course faster, but that is already more of a fringe use case and wanting the n-th occurrence rather than the first or last even more so.

Do note that additions have a non-zero continuing cost for education, maintenance, and execution, so we try to balance anticipated but sometimes hard to estimate costs and benefits.

Yes, my primary gripe with this would actually be from a consistency standpoint as list also has an index method with the same signature. So to be fully consistent that would also need this count keyword, or there would be inconsistency.

So all in all, not worth it to be added indeed. Thank you all for the feedback.

Topic		Replies	Views
Extracting count with exact group of strings from a long row with symbols Python Help help	2	799	October 5, 2021
Why is "Count" not working Python Help	2	1653	April 9, 2021
Ennumerate categorical items of a column Python Help	1	475	February 20, 2020
Read visitor data from a csv file Python Help	4	579	November 26, 2023
Extracting number of LinkedIn search results Python Help help	1	319	April 20, 2023

Add a `count` keyword to str.index and str.find to find the n-th occurence of an element in a string

Related Topics