Str.count to allow counting of overlapping substrings

sr-murthy · June 15, 2025, 5:52pm

Substring counting could surely be made easier by allowing the existing str.count method to support counting of overlapping substrings, via an optional arg. e.g. overlap=True. This would mean that, for example, the problem of counting 'AAAA' in a string such as 'AAAAAAC' would be extremely simple:

>>> s = 'AAAAAAC'
>>> s.count('AAAA', overlap=True)
3

rather than having to write your own solution:

sum(1 for i in range(n - k + 1) if s[i: i + k] == b)

where s is a string, b the substring to be counted, and n and k are the lengths of s and b respectively.

To be clear: my proposal is that <str>.count(<substr>, overlapping=True) would result in a count of all occurrences of <substr> (non-overlapping and overlapping) in the given string. So that 'AAABAA'.count('AA', overlapping=True) will produce a value of 3 (2 overlapping occurrences at the start, and the one at the end of the string), whereas currently count will produce 2. But count without overlapping=True (overlapping=False, the default) will continue to work the same way, so there would be no side effects.

This problem comes up naturally in many contexts, and I’m aware of previous posters who’ve bumped into this problem when working on their own problems.

I think it would be a relatively simple but useful extension of str.count.

dg-pb · June 15, 2025, 6:49pm

Simple? maybe. Useful? Uncertain. Is this causing bottlenecks or is it simply for convenience?

I noted this and will consider it together with other potential extensions for string search when the time comes.

By the way, you could do it a bit better.
Especially if you are dealing with long strings with sparse hits.

def count_nonoverlaping(s, b):
    c = 0
    i = 0
    while 1:
        i = s.find(b, i)
        if i == -1:
            break
        c += 1
        i += 1
    return c
s = 'AAAAAAC'
count_nonoverlaping(s, 'AAAA')

sr-murthy · June 15, 2025, 7:02pm

I noted this and will consider it together with other potential extensions for string search when the time comes.

Thanks, it’s just an idea for consideration. I don’t claim that counting overlapping substrings is the most pressing problem, but it certainly comes up in different domains (bioinformatics, NLP, CS, information theory, to think of a few), and surely a built-in, standard solution is preferable to each person writing their own code. I see at least some prior discussion on this.

dg-pb · June 15, 2025, 7:21pm

In str.find/count domain, it rarely something is if it can be done with re. Didn’t even think of that at first - apparently I never had such need myself.

len(re.findall(r'(?=(AAAA))', 'AAAAAAC'))

sr-murthy · June 15, 2025, 7:28pm

Sure, you can do it with re, but it seems more natural to use something like str.count(overlap=True). As you’ve mentioned str.find I wonder if you could have str.findall that gives you all starting indices of the substring (in an overlapping way): an example:

>>> 'AAACAAG'.findall('AA')
[0, 1, 4]

-1 would be returned in case of no hits. This is also worth considering, in my view, as an alternative to changing str.count.

In CS we take substrings to be any contiguous subsequence of characters of a string, so I’ve never understood why str.count was implemented to only count non-overlapping occurrences. I’m sure there is a reason for it, but in any case, it does seem natural to extend it in the way I’ve proposed.

mikeshardmind · June 15, 2025, 7:39pm

While I dont have a strong opposition to this, given the problem space, I would still suggest using a library, both for the kinds of optimizations unlikely to be made in CPython, and for having access to it today.

In particular, I’d recommend anyone doing frequent string-based operations, especially those in bioinformatics, look into use of stringzilla. It’s heavily optimized for basically this specific use case.

If it were going to be added, I do think it might be better to have functions in the string module than expand the method API of the builtin str type

dg-pb · June 15, 2025, 7:40pm

Not likely. People would get confused why re.findall has such inconsistency with str.findall.

I think the str.count(..., overlapping=True) would be ok if this turned out to be useful enough. But this would be weird because this would be 2 steps ahead - there is not str.findall/finditer, but there is a count which counts overlapping matches. So it would be a bit strange that substring counting is so much more advanced than string search on which substring counting is based on.

So I wouldn’t hold my breath. It is ok candidate for extension, but it will have to wait for its time. Also, maybe others will chip in in the meantime - it would be helpful to gauge if there is any actual need for this.

sr-murthy · June 15, 2025, 7:42pm

thanks, stringzilla looks fast, which is important.

sr-murthy · June 15, 2025, 7:45pm

Thanks for your consideration. Would I be able to submit a PR? No problem if this is typically done internally, but it would be nice to.

Stefan2 · June 15, 2025, 9:46pm

You forgot to mention another reason for it: soooo many people using

for i in range(n - k)

instead of the correct

for i in range(n - k + 1)

sr-murthy · June 15, 2025, 10:07pm

Thanks, typo corrected. The iterative find solution is much better, of course.

dg-pb · June 16, 2025, 7:42am

In theory you can do whatever you want.

But as I said, currently it isn’t a good time for such and you would be wasting your time. The first item that needs to go in is https://github.com/python/cpython/pull/120025 .

This will be a certain completion of underlying search engine and prepare a bit better ground for further extensions, such as str.finditer/findall. PRs in this area before this is merged would introduce unnecessary complications/conflicts/etc. So it is simply not practical.

But it might take some time as this is not priority (same as this proposal) and also it is a bit difficult to find willing reviewers.

So my best advice would be patience. If this is something that is desirable and the time is right I am sure you could do a PR - just have notifications turned on for this thread and double check once in 6 months / a year. It might take some time and also there is no certainty that this will go through in the end.

sr-murthy · June 16, 2025, 8:07am

Not a problem. I would only submit a PR if there was a sponsor, otherwise not, and with prior agreement.

jsbueno · June 17, 2025, 2:13pm

Taking the considerations on the thread about this being somewat a niche input,
on the other hand, it is an obvious nicety improvement to the current API, and one that is a “write once” thing, not increasing the burden of maintenance of evolving - (to the contrary, even if more optimal algorithms for non-overlapping count are subsequently found, those would be a nice low-hanging fruit for first-time contributors to tackle)

I’d be a full +1 for this.

(And, actually, there is no little interest in this - I have more than 2000 answers written a Q.A. platform, and my one line answer about counting substring occurrences in Python is the most voted one - by far. And in there, about 10% of the readers are looking for a overlapping-string-counting capability, as can be inferred from the voting/comments)

Stefan2 · June 17, 2025, 3:43pm

Do you mean for an overlapping method? You’re talking about this, right?

jsbueno · June 17, 2025, 4:51pm

yes and yes.