Naive sub string search algorithm

enoren5 · January 13, 2024, 1:22pm

I am writing a naive string search algorithm where my function takes two arguments, a longer string, as well as a smaller string. I am trying to get the funciton to return the number of times (count total) that the the sub string appears in the longer string. In my best effort script (below), I am using two while loops. When I execute the function, the loop runs perpetually. I’ve tried repositioning the return operatator farther in and farther out and I can’t get it working.

For starters, I came up with this function:

def string_seek(long_string, sub_string):
    if sub_string in long_string:
        return True
    return False

When I run the script and pass in variables like this: string_seek("wowomgzomg","omg"), Python returns True. So that works but I need something which would return 2 because “omg” appears twice inside “wowomgzomg”. Having said that, here is my best attempt at completing that task:

def string_seek(long, short):
    i = 0
    j = 0
    count = 0
    while i < len(long):
        i += 1
        while j < len(short):
            if short[j] != long[i+j]:
                break
            if j == len(short)-1:
                count += 1
                continue
        return count

No dice.

I tried re-writing the function this time with for loops:

def string_seek(long, short):
    i = 0
    j = 0
    count = 0
    for index, i in enumerate(long):
        index += 1
        for j in short:
            if short[j] != long[i+j]:
                break
            if j == len(short)-1:
                count += 1                
    return count

Also not where I want to be.

What might you people recommend I try next? Comments and advice welcome.

WeisLeDocto · January 13, 2024, 1:46pm

Hi !

I am trying to get the function to return the number of times (count total) that the the sub string appears in the longer string.

I would recommend you to take a look at the re.findall function (re — Regular expression operations — Python 3.12.1 documentation), that does exactly what you’re trying to achieve ! Example:

import re
len(re.findall("omg", "wowomgzomg"))
# returns 2

If you want to keep going the “naive” way, here’s one very basic option:

l = len(sub_string)
count = 0
for i in range(len(long_string) - l + 1):
    s = long_string[i: i+l]
    count += int(s == sub_string)

hansgeunsmeyer · January 13, 2024, 2:46pm

There is an ambiguity in “the number of times that a substring appears in the longer string”: should overlapping matches also be counted or not? For instance, if your needle is “OMGOMG” and your haystack is “WOMGOMGOMG” should the function return 1 or 2?
The easiest way to find overlapping matches, is to use the regex library:

import regex
regex.findall("OMGOMG", "WOMGOMGOMG", overlapped=True)

The regex library is a plug-in replacement for re. It supports everything that re supports, plus a lot more. And it is faster. (Even the official Python re docs refer to it.)

barry-scott · January 13, 2024, 9:41pm

Are you using this to learn about algorithms and coding or do you just need a workable solution?

kknechtel · January 14, 2024, 6:47pm

It’s not necessary to reinvent the wheel. The count method of the string gives the number of non-overlapping matches. If you want overlapping matches, there are many possible approaches that use existing tools to simplify the calculation. See:

Nothing in this code could cause j to increase, so the inner loop is infinite.

If it did exit, the outer loop could only run once, because return count is inside that loop and is unconditional.

When you write for j in short for the inner loop, this means that each time through the loop, j will be an element of short (i.e., single-character string), not an integer index. To have an index you’d need to use the same sort of trick as you did with the outer loop. See also:

enoren5 · January 16, 2024, 12:35pm

This is a very important distinction. I am writing a naive sub string search function to learn algorithmic design. While it is interesting to see other people share concise workable solutions using regex like Antoine and @hansgeunsmeyer, I am more interested in learning algorithmic design.

Antoine Weisrock:

If you want to keep going the “naive” way, here’s one very basic option:
l = len(sub_string)
count = 0
for i in range(len(long_string) - l + 1):
    s = long_string[i: i+l]
    count += int(s == sub_string)

This is elegant. It is simple, nearly flat, explicit, and readable. So it ticks quite a few boxes outlined in Tim Peters’ Zen of Python. The only thing throwing me off in this implementation of a naive substring search algorithm is the use of single characters as variable names. For a beginner like me, I struggle with attributing semantic meanings to l, i, and s. So I attempted to refactor this same algorithm but this time with more meaningful variable names. Here it is:

def string_seek(long_string, sub_string):
    sub_length = len(sub_string)
    count = 0
    for item in range(len(long_string) - sub_length + 1):
        s = long_string[item: item+sub_length]
        count += int(s == sub_string)
    return count

print(string_seek("wowomgzomgomg","omg"))

This algorithm works. The only remaining issue is where the s variable is declared. I don’t know what s is supposed to stand for. It can’t be for string because there are more than one strings at play here. @WeisLeDocto : How might you (or anyone else reading this) refactor that variable to give it some more semantic meaning to better convey what is going on there? Without clarification on what s stands for, I don’t really understand fully why this algorithm works so well.

@kknechtel : Thank you for sharing these SO answers. I found one script I really liked:

def count_substring(main_string,sub_string):
    main_len = len(main_string)
    sub_len = len(sub_string)
    iteration = 0
    counter = 0
    while iteration < main_len:
        if main_string[iteration] == sub_string[0]:
            if main_string[iteration:iteration+sub_len] == sub_string:
                counter += 1
        iteration += 1

    return counter

print(count_substring("ABCDCDCCDC","CDC")) # returns 3

At first I struggled with understanding why this one works the way that it does as well. But after playing around with it in my debugger, what threw me off originally was that on the third iteration of the while loop, the iteration value is 2 and the conditional is tested and evaluates to True for the first time and so Python descends down to the next line (also a conditional) where the main string is sliced by 2 to 2+3 which interpolates to CDC. Therefore it matches “CDC” (the sub_string) and so increments the counter. That is clear now. I have included this explanation here as a note for my future reference.

enoren5 · January 16, 2024, 12:48pm

While writing an actual app, using the Python standard library and built-ins including count method makes sense and this is what I would do it most circumsances. But in this case I am taking a course on Udemy about algorithmic software design and writing a sub string search algorithm is the challenge at hand. I’ve implemented the count method (reinventing the while) for the sake of programming in general and for fun.

WeisLeDocto · January 16, 2024, 1:26pm

Good to hear that my answer could be of help to you!

How might you (or anyone else reading this) refactor that variable to give it some more semantic meaning to better convey what is going on there?

Indeed, I wrote the code quickly and did not bother finding meaningful names to my variables. That’s not a good practice and I avoid it in production code. Here’s how I would rename the variables:

def string_seek(string, pattern):
    pattern_length = len(pattern)
    count = 0
    for index in range(len(string) - pattern_length + 1):
        sub_string = string[index: index+pattern_length]
        count += int(sub_string == pattern)
    return count

print(string_seek("wowomgzomgomg","omg"))

string and pattern are the names used in the re module, so they’re directly understandable to anyone who has used re before. I think it is also quite clear in natural language what searching for a pattern in a string would mean.
I would replace item with index, because this variable represents an index.
And replace s with sub_string, because this variable is a subset of the string variable. In the example, during the first loop its value is 'wow', then in the second loop 'owo', then 'wom', etc.

barry-scott · January 16, 2024, 5:01pm

In my editor I implemented string search using an algorithm that has a nice twist to it.

When looking for a “needle” in “a hay stack with needs and needy before needle appears” it first looks for the “n” to match then it check the last character (the “e”) before the ones in between. This was found to speed up search becuase its common to have prefixes match but less common for last char to match.

I C/C++ this is a measurable improvement, not sure it how it will benchmark in python if you do not do single character matches.

Stefan2 · January 16, 2024, 6:22pm

Might not be worth it unless you have very long common prefixes. A little test, where the “optimized” version takes about twice as much time as the other (Attempt This Online!):

from timeit import repeat

setup = '''
s = 'a' * 1000
t = 'a' * 999 + 'b'
'''
codes = [
    's == t',
    's[-1] == t[-1] and s == t',
]

for c in codes:
    t = min(repeat(c, setup))
    print(t, c)

Sample output:

0.04607243463397026 s == t
0.08840696513652802 s[-1] == t[-1] and s == t

With ten times longer strings, the optimized version is about twice as fast as the other. At length 4000, both are about equally fast.

Topic		Replies	Views
Needing help with a while loop function call Python Help	6	1932	November 11, 2021
Writing a Linear Search algorithm - malformed string representation Python Help	16	1052	February 18, 2024
Python string find() examples Python Help	1	702	October 17, 2022
Sliding window question -- how to optimize the loop usage Python Help	4	1420	April 1, 2022
Basic binary search algorithm - wrangling with while loops and their conditions (x2) Python Help	27	1464	February 21, 2024

Naive sub string search algorithm

Related Topics