Naive sub string search algorithm

I am writing a naive string search algorithm where my function takes two arguments, a longer string, as well as a smaller string. I am trying to get the funciton to return the number of times (count total) that the the sub string appears in the longer string. In my best effort script (below), I am using two while loops. When I execute the function, the loop runs perpetually. I’ve tried repositioning the return operatator farther in and farther out and I can’t get it working.

For starters, I came up with this function:

def string_seek(long_string, sub_string):
    if sub_string in long_string:
        return True
    return False

When I run the script and pass in variables like this: string_seek("wowomgzomg","omg"), Python returns True. So that works but I need something which would return 2 because “omg” appears twice inside “wowomgzomg”. Having said that, here is my best attempt at completing that task:

def string_seek(long, short):
    i = 0
    j = 0
    count = 0
    while i < len(long):
        i += 1
        while j < len(short):
            if short[j] != long[i+j]:
                break
            if j == len(short)-1:
                count += 1
                continue
        return count

No dice.

I tried re-writing the function this time with for loops:

def string_seek(long, short):
    i = 0
    j = 0
    count = 0
    for index, i in enumerate(long):
        index += 1
        for j in short:
            if short[j] != long[i+j]:
                break
            if j == len(short)-1:
                count += 1                
    return count

Also not where I want to be.

What might you people recommend I try next? Comments and advice welcome.

Hi !

I am trying to get the function to return the number of times (count total) that the the sub string appears in the longer string.

I would recommend you to take a look at the re.findall function (re — Regular expression operations — Python 3.12.1 documentation), that does exactly what you’re trying to achieve ! Example:

import re
len(re.findall("omg", "wowomgzomg"))
# returns 2

If you want to keep going the “naive” way, here’s one very basic option:

l = len(sub_string)
count = 0
for i in range(len(long_string) - l + 1):
    s = long_string[i: i+l]
    count += int(s == sub_string)

There is an ambiguity in “the number of times that a substring appears in the longer string”: should overlapping matches also be counted or not? For instance, if your needle is “OMGOMG” and your haystack is “WOMGOMGOMG” should the function return 1 or 2?
The easiest way to find overlapping matches, is to use the regex library:

import regex
regex.findall("OMGOMG", "WOMGOMGOMG", overlapped=True)

The regex library is a plug-in replacement for re. It supports everything that re supports, plus a lot more. And it is faster. (Even the official Python re docs refer to it.)

1 Like

Are you using this to learn about algorithms and coding or do you just need a workable solution?

It’s not necessary to reinvent the wheel. The count method of the string gives the number of non-overlapping matches. If you want overlapping matches, there are many possible approaches that use existing tools to simplify the calculation. See:

Nothing in this code could cause j to increase, so the inner loop is infinite.

If it did exit, the outer loop could only run once, because return count is inside that loop and is unconditional.

When you write for j in short for the inner loop, this means that each time through the loop, j will be an element of short (i.e., single-character string), not an integer index. To have an index you’d need to use the same sort of trick as you did with the outer loop. See also:

This is a very important distinction. I am writing a naive sub string search function to learn algorithmic design. While it is interesting to see other people share concise workable solutions using regex like Antoine and @hansgeunsmeyer, I am more interested in learning algorithmic design.

This is elegant. It is simple, nearly flat, explicit, and readable. So it ticks quite a few boxes outlined in Tim Peters’ Zen of Python. The only thing throwing me off in this implementation of a naive substring search algorithm is the use of single characters as variable names. For a beginner like me, I struggle with attributing semantic meanings to l, i, and s. So I attempted to refactor this same algorithm but this time with more meaningful variable names. Here it is:

def string_seek(long_string, sub_string):
    sub_length = len(sub_string)
    count = 0
    for item in range(len(long_string) - sub_length + 1):
        s = long_string[item: item+sub_length]
        count += int(s == sub_string)
    return count

print(string_seek("wowomgzomgomg","omg"))

This algorithm works. The only remaining issue is where the s variable is declared. I don’t know what s is supposed to stand for. It can’t be for string because there are more than one strings at play here. @WeisLeDocto : How might you (or anyone else reading this) refactor that variable to give it some more semantic meaning to better convey what is going on there? Without clarification on what s stands for, I don’t really understand fully why this algorithm works so well.

@kknechtel : Thank you for sharing these SO answers. I found one script I really liked:

def count_substring(main_string,sub_string):
    main_len = len(main_string)
    sub_len = len(sub_string)
    iteration = 0
    counter = 0
    while iteration < main_len:
        if main_string[iteration] == sub_string[0]:
            if main_string[iteration:iteration+sub_len] == sub_string:
                counter += 1
        iteration += 1

    return counter

print(count_substring("ABCDCDCCDC","CDC")) # returns 3

At first I struggled with understanding why this one works the way that it does as well. But after playing around with it in my debugger, what threw me off originally was that on the third iteration of the while loop, the iteration value is 2 and the conditional is tested and evaluates to True for the first time and so Python descends down to the next line (also a conditional) where the main string is sliced by 2 to 2+3 which interpolates to CDC. Therefore it matches “CDC” (the sub_string) and so increments the counter. That is clear now. I have included this explanation here as a note for my future reference.

While writing an actual app, using the Python standard library and built-ins including count method makes sense and this is what I would do it most circumsances. But in this case I am taking a course on Udemy about algorithmic software design and writing a sub string search algorithm is the challenge at hand. I’ve implemented the count method (reinventing the while) for the sake of programming in general and for fun.

Good to hear that my answer could be of help to you!

How might you (or anyone else reading this) refactor that variable to give it some more semantic meaning to better convey what is going on there?

Indeed, I wrote the code quickly and did not bother finding meaningful names to my variables. That’s not a good practice and I avoid it in production code. Here’s how I would rename the variables:

def string_seek(string, pattern):
    pattern_length = len(pattern)
    count = 0
    for index in range(len(string) - pattern_length + 1):
        sub_string = string[index: index+pattern_length]
        count += int(sub_string == pattern)
    return count

print(string_seek("wowomgzomgomg","omg"))

string and pattern are the names used in the re module, so they’re directly understandable to anyone who has used re before. I think it is also quite clear in natural language what searching for a pattern in a string would mean.
I would replace item with index, because this variable represents an index.
And replace s with sub_string, because this variable is a subset of the string variable. In the example, during the first loop its value is 'wow', then in the second loop 'owo', then 'wom', etc.

In my editor I implemented string search using an algorithm that has a nice twist to it.

When looking for a “needle” in “a hay stack with needs and needy before needle appears” it first looks for the “n” to match then it check the last character (the “e”) before the ones in between. This was found to speed up search becuase its common to have prefixes match but less common for last char to match.

I C/C++ this is a measurable improvement, not sure it how it will benchmark in python if you do not do single character matches.

Might not be worth it unless you have very long common prefixes. A little test, where the “optimized” version takes about twice as much time as the other (Attempt This Online!):

from timeit import repeat

setup = '''
s = 'a' * 1000
t = 'a' * 999 + 'b'
'''
codes = [
    's == t',
    's[-1] == t[-1] and s == t',
]

for c in codes:
    t = min(repeat(c, setup))
    print(t, c)

Sample output:

0.04607243463397026 s == t
0.08840696513652802 s[-1] == t[-1] and s == t

With ten times longer strings, the optimized version is about twice as fast as the other. At length 4000, both are about equally fast.