Removing all but one instance of a phrase from a string

I’m trying to code a function that takes a sentence and a phrase as parameter and eliminates all but the first instance of the phrase in the sentence.

Note, by sentence I mean a string comprised of one or more words, not a list of words.

At present my code is as follows:

def delete_repeated_phrase(sentence, phrase):

    # first count the number of instances of a phrase in the sentence, if no occurences, return the original sentence
    if sentence.count(phrase) == 0:
        return sentence

    # get phrase length
    phrase_len = len(phrase)
        
    # reverse the string because we want to remove the phrase from end of sentence to start of sentence
    reversed_sentence = "".join(reversed(sentence))
    reversed_phrase = "".join(reversed(phrase))
    
    # while there remains more than 1 instance of phrase in sentence
    while reversed_sentence.count(reversed_phrase) > 1:
        
        #slice string to remove the first occurence ogf the phrase
        index = reversed_sentence.find(reversed_phrase)

        if index == 0:
            reversed_sentence = reversed_sentence[phrase_len + 1:]
           
        else:
            reversed_sentence = reversed_sentence[0:index] + reversed_sentence[index + 1 + phrase_len:]

    sentence = "".join(reversed(reversed_sentence))
    return sentence

Usage:

the_sentence = ‘the cat the cat the cat the cat the cat the cat the cat’
seek_phrase = ‘the cat’
print(f’{delete_repeated_phrase( the_sentence, seek_phrase)}')

When I originally posted I missed the fact it was doing exactly what I intended - removing all but one instance of the phrase.

Having said that I’ve still learned a good few things from this post and the replies that have followed, so thanks to everyone that has responded.

I assume that you split the sentence of SPACE.
That means you have a list of words.
E.g [‘a’, ‘test’, ‘of’, ‘my’, ‘code’]
When you look for the phrase ”my code” it contains a SPACE.
But the list has that as two elements not one.

Does that help?

Hi, I’ve not split it at all, I’ve just used str.find() and string slicing.

Removing from the end is a good idea, but works better for items from lists you iterate over.
You want to remove all occurrences of phrase except for the first one. That means that after you’ve found the first occurence you can remove all other occurrences.
You know that you can find in a string from a specific starting point, right?

Yes, but forgive my ignorance - I’m not sure how that helps me solve the problem I’m faced with.

“help Hello help hello hello help”.find(“hello help”)

returns 22

“help Hello help hello hello help”.find(“hello helps”)

returns -1

“help [Hello help] hello hello help”.find(“[Hello help]”)

returns 5

So str.find() does correctly find substrings that include spaces within another string. That being the case I’m struggling to see why my code isn’t finding and eliminating substrings containing spaces.

In a generalised case I want to solve two problems (with two different functions).

  1. remove all but the first occurence of a phrase within a sentence
  2. remove only the last occurence of a phrase within a sentence

in both cases accepting that the phrase my be ‘a string of words’ or a ‘[bracketed string of words that_may_or_may_not_contain_spaces-or-other-delimiters]’

Can you please provide an example that doesn’t work? Otherwise it becomes a lot harder for us to debug.

Sure. Call the function in my first post as follows:

string = 'help Hello help hello hello help'
seek_phrase =  'hello help'
print(delete_repeated_phrase( string, seek_phrase))

it’ll return:

‘help Hello help hello hello help’

which is the string originally provided, despite there being a match.

Yes, exactly, there is one match, at the very end. find is case sensitive, so Hello help wont be found.

Yip I get that, however I just realised my own stupidity. The code is written so as to eliminate all but the first instance of phrase, hence it returning the original string because it contains only one instance of the phrase. In other words, it’s working as intended. Doh! Sorry for wasting everyone’s time.

Looking more closely at the code, given there can never be > 1 instance if str.find() returns 0, that part of the code is redundant and could be revised as:

def delete_repeated_phrase(sentence, phrase):

    # first count the number of instances of a phrase in the sentence, if no occurences, return the original sentence
    if sentence.count(phrase) == 0:
        return sentence

    # get phrase length
    phrase_len = len(phrase)
        
    # reverse the string because we want to remove the phrase from end of sentence to start of sentence
    reversed_sentence = "".join(reversed(sentence))
    reversed_phrase = "".join(reversed(phrase))
    
    # while there remains more than 1 instance of phrase in sentence
    while reversed_sentence.count(reversed_phrase) > 1:
        
        #slice string to remove the first occurence of the phrase
        index = reversed_sentence.find(reversed_phrase)

        reversed_sentence = reversed_sentence[0:index] + reversed_sentence[index + 1 + phrase_len:]


    new_sentence = "".join(reversed(reversed_sentence))
    return new_sentence

Where did you learn to reverse a string like that? That’s very inefficient. Better use sentence[::-1] like everybody else.

That’s also misleading, I thought sentence was a list or so based on that code. You really should add a usage example in the first post, not just later.

The hint is in my username. Self taught on the fly, no formal training. Thanks for the tip.

Understood, will do so in future. Thank you.

1 Like

I am on Mobile right now, so I can’t write the code right, but a way shorter and faster solution is going to be to not reverse the string at all. Just find the first match, and use replace with a start parameter to replace all later matches with the empty string.

Just to be sure CPython doesn’t optimize that (it could, since str.join could recognize a string iterator), I benchmarked:

sentence = 'help Hello help hello hello help'
1.65 μs  "".join(reversed(sentence))
0.20 μs  sentence[::-1]

sentence = 'help Hello help hello hello help' * 1000
 938.52 μs  "".join(reversed(sentence))
  48.56 μs  sentence[::-1]
Benchmark script
from timeit import repeat

setup= '''
sentence = 'help Hello help hello hello help' * 1000
'''

codes = '''\
"".join(reversed(sentence))
sentence[::-1]
'''.splitlines()

for code in codes:
    t = min(repeat(code, setup, number=10**2)) * 10**4
    print(f'{t:7.2f} μs ', code)

Attempt This Online!

What about this time? I just wasted time being confused by that, and further others might still.

1 Like

And (peak) memory usage:

sentence = 'help Hello help hello hello help' * 10**6
288.0 MB  "".join(reversed(sentence))
 32.0 MB  sentence[::-1]
Script
import tracemalloc as tm

sentence = 'help Hello help hello hello help' * 10**6

codes = '''\
"".join(reversed(sentence))
sentence[::-1]
'''.splitlines()

for code in codes * 2:
    tm.start()
    eval(code)
    mem = tm.get_traced_memory()[1]
    tm.stop()
    print(f'{mem/1e6:5.1f} MB ', code)

Attempt This Online!

1 Like

So am I. No reason not to code :-). There are plenty of sites for that.

If you want to code the algorithm I described, go ahead. I am not going to do that till I get back home. “Mobile” is in fact a summary of multiple reasons why I am not going to code it right now.

Which operation takes precedence between removing occurrences and preserving the first occurrence?

For example, when phrase = 'aa' and sentence = 'aaa',

  1. do you want to get 'a' as the result of removing the second occurrence, even though it partially removed the first occurrence,
  2. or do you want to get 'aa' as the result of removing the second occurrence while preserving the first occurrence?
string = 'aaa aaa'
seek_phrase =  'aa'
print(string.find(seek_phrase)) 

returns 0, which is as intended. Same outcome if string = ‘aaa’

What I meant is in

when removing all but the first instance of the phrase, does removing happen after preserving the first instance, or does one remove and then the first instance gets preserved (whole)?

the way I’ve coded it the first instance is always left untouched as the removal only happens

while reversed_sentence.count(reversed_phrase) > 1: