Substring counting could surely be made easier by allowing the existing str.count method to support counting of overlapping substrings, via an optional arg. e.g. overlap=True. This would mean that, for example, the problem of counting 'AAAA' in a string such as 'AAAAAAC' would be extremely simple:
>>> s = 'AAAAAAC'
>>> s.count('AAAA', overlap=True)
3
rather than having to write your own solution:
sum(1 for i in range(n - k + 1) if s[i: i + k] == b)
where s is a string, b the substring to be counted, and n and k are the lengths of s and b respectively.
To be clear: my proposal is that <str>.count(<substr>, overlapping=True) would result in a count of all occurrences of <substr> (non-overlapping and overlapping) in the given string. So that 'AAABAA'.count('AA', overlapping=True) will produce a value of 3 (2 overlapping occurrences at the start, and the one at the end of the string), whereas currently count will produce 2. But count without overlapping=True (overlapping=False, the default) will continue to work the same way, so there would be no side effects.
This problem comes up naturally in many contexts, and I’m aware of previous posters who’ve bumped into this problem when working on their own problems.
I think it would be a relatively simple but useful extension of str.count.
Simple? maybe. Useful? Uncertain. Is this causing bottlenecks or is it simply for convenience?
I noted this and will consider it together with other potential extensions for string search when the time comes.
By the way, you could do it a bit better.
Especially if you are dealing with long strings with sparse hits.
def count_nonoverlaping(s, b):
c = 0
i = 0
while 1:
i = s.find(b, i)
if i == -1:
break
c += 1
i += 1
return c
s = 'AAAAAAC'
count_nonoverlaping(s, 'AAAA')
I noted this and will consider it together with other potential extensions for string search when the time comes.
Thanks, it’s just an idea for consideration. I don’t claim that counting overlapping substrings is the most pressing problem, but it certainly comes up in different domains (bioinformatics, NLP, CS, information theory, to think of a few), and surely a built-in, standard solution is preferable to each person writing their own code. I see at least some prior discussion on this.
In str.find/count domain, it rarely something is if it can be done with re. Didn’t even think of that at first - apparently I never had such need myself.
Sure, you can do it with re, but it seems more natural to use something like str.count(overlap=True). As you’ve mentioned str.find I wonder if you could have str.findall that gives you all starting indices of the substring (in an overlapping way): an example:
>>> 'AAACAAG'.findall('AA')
[0, 1, 4]
-1 would be returned in case of no hits. This is also worth considering, in my view, as an alternative to changing str.count.
In CS we take substrings to be any contiguous subsequence of characters of a string, so I’ve never understood why str.count was implemented to only count non-overlapping occurrences. I’m sure there is a reason for it, but in any case, it does seem natural to extend it in the way I’ve proposed.
While I dont have a strong opposition to this, given the problem space, I would still suggest using a library, both for the kinds of optimizations unlikely to be made in CPython, and for having access to it today.
In particular, I’d recommend anyone doing frequent string-based operations, especially those in bioinformatics, look into use of stringzilla. It’s heavily optimized for basically this specific use case.
If it were going to be added, I do think it might be better to have functions in the string module than expand the method API of the builtin str type
Not likely. People would get confused why re.findall has such inconsistency with str.findall.
I think the str.count(..., overlapping=True) would be ok if this turned out to be useful enough. But this would be weird because this would be 2 steps ahead - there is not str.findall/finditer, but there is a count which counts overlapping matches. So it would be a bit strange that substring counting is so much more advanced than string search on which substring counting is based on.
So I wouldn’t hold my breath. It is ok candidate for extension, but it will have to wait for its time. Also, maybe others will chip in in the meantime - it would be helpful to gauge if there is any actual need for this.
This will be a certain completion of underlying search engine and prepare a bit better ground for further extensions, such as str.finditer/findall. PRs in this area before this is merged would introduce unnecessary complications/conflicts/etc. So it is simply not practical.
But it might take some time as this is not priority (same as this proposal) and also it is a bit difficult to find willing reviewers.
So my best advice would be patience. If this is something that is desirable and the time is right I am sure you could do a PR - just have notifications turned on for this thread and double check once in 6 months / a year. It might take some time and also there is no certainty that this will go through in the end.
Taking the considerations on the thread about this being somewat a niche input,
on the other hand, it is an obvious nicety improvement to the current API, and one that is a “write once” thing, not increasing the burden of maintenance of evolving - (to the contrary, even if more optimal algorithms for non-overlapping count are subsequently found, those would be a nice low-hanging fruit for first-time contributors to tackle)
I’d be a full +1 for this.
(And, actually, there is no little interest in this - I have more than 2000 answers written a Q.A. platform, and my one line answer about counting substring occurrences in Python is the most voted one - by far. And in there, about 10% of the readers are looking for a overlapping-string-counting capability, as can be inferred from the voting/comments)