Search an english word in a string without delimiters

Angelus · June 5, 2023, 8:36pm

Hello,
im a beginner and i don’t find the solution.
I need to find if in a string without delimiters there is an english word.
For example i have the string “BONJOURHELLOCIAO”
i want to extract the word HELLO
I tried like that but it doesn’t work as my string should not have delimiters, i do not know how to do “contains”

def contains_english_word(text):
# create a dictionnary for the english langage
english_dict = enchant.Dict(“en_US”)

# split the text
words_in_text = text.split()

# check if english word
for word in words_in_text:
    if english_dict.check(word):
        return True

return False

if contains_english_word(resultat2):
print(" OK")
else:
print(“KO”)

franklinvp · June 5, 2023, 8:51pm

Are you searching for a specific word each time, or for the existence of any of the words in certain dictionary?

Angelus · June 5, 2023, 9:14pm

yes i am searching in an english dictionnary. If i have the results “BONJOURHAPPYCIAO” in just need to know an english word exits

elis.byberi · June 5, 2023, 9:17pm

You have to use a Trie data structure to find the English words in a string. You can use a dict, but Trie is memory efficient.

Create a Trie with all English words.
Here is an example (not tested):

string = "BONJOURHELLOCIAO"
for i in range(len(string)):  # start with first char
    current_word = ""
    for char in string[i:]:  # check all next chars
        current_word += char
        if trie.search(current_word):
            print(current_word)  # each iteration prints a word with the same base

        else:
            if not current_word:  # not a single word
                continue

            print(current_word)
            current_word = ""  # no more words with the same base

LuckyLeKing · June 5, 2023, 10:22pm

Hello,
Your function works (I think), I had also thought of doing it like that but you can’t make a table/dictionary with all the English words. Our problem is here

elis.byberi · June 5, 2023, 10:42pm

Modern English has moved towards an analytic language, there may be over 600.000 English words. I don’t thing it is a performance problem, if you use dict or set. Old English was an inflectional language

You may use a Trie or a DAWG data structure, which have O(len(word)) key look up and very efficient memory usage.

Rosuav · June 5, 2023, 11:58pm

You can’t? /usr/share/dict/words would like to have a word, or rather a hundred thousand of them.

franklinvp · June 6, 2023, 1:02pm

The problem of multiple pattern matching is well studied.

You don’t need to match each substring, or start from the beginning of the trie when a matching fails.

See Aho-Corasick, or Commentz-Walter algorithm.

You can see a Python implementation of the first one here.

elis.byberi · June 6, 2023, 1:35pm

OP does not know the English word in advance. He wants to know if an English word exists in a given string.

I need to find if in a string without delimiters there is an english word.

Time Complexity for Aho-Corasick Algorithm is O(n+l+z) , where, n is the length of text, l is the length of words to be searched and z is the number of matches. In our example it would be O(16+600.000+Z) for a single string search. If string is an article or a whole book, it would become unfeasible to run this algorithm.

Example:
If len(string) == 1.000.000, time complexity is O(1.000.000+600.000+Z)

franklinvp · June 6, 2023, 1:47pm

None of your claims make any sense. Take your time.

Irrelevant. Not an assumption or the algorithms.

And they can stop the search after the first match has happened, if they want to only check if some word from the dictionary appears.

If you want to practice your asymptotic time complexity, check what do two nested FOR would have to do, and every time entering the trie from the root. Also, don’t add to a(n immutable) str. Every time a new str needs to be created and later destroyed.

for i in range(len(string)):  # start with first char
    current_word = ""
    for char in string[i:]:
        current_word += char
        if trie.search(current_word):

elis.byberi · June 6, 2023, 2:33pm

Time complexity of my algorithm given a string with length 16 is O(136*S) where S is time complexity to search in a Trie. It does not depend on number of words to be searched, in this case the English words.

Aho-Corasick is suitable when text to search is large, and number of words to be searched is small.

That’s my understanding of the algorithm usage. Criticism are welcome, I won’t take it personally. I am open to learn the algorithm better, but I couldn’t make use of this algorithm even though it was implemented in C, because it was so slow for a large number of words to be searched.

This is the search algorithm of Aho-Corasick algorithm that you mentioned:

…worst case is O(len(text) * len(self.words)); O(16*600.000)!!

for i in range(len(text)):
    current_state = self.__find_next_state(current_state, text[i])
    if self.out[current_state] == 0: continue
    for j in range(len(self.words)):   # 600.000 words!!!
        if (self.out[current_state] & (1<<j)) > 0:
            word = self.words[j]
            result[word].append(i-len(word)+1)

franklinvp · June 6, 2023, 3:25pm

@Angelus Here is the Python implementation of Aho-Corasick from above with some modifications/corrections.

"""Testing Aho-Corasick algorithm for multiple pattern matching."""

from collections import defaultdict

# Just because the list of words I used to test had words with letters not from a to z.
from unidecode import unidecode 


class AhoCorasick:
    """Aho-Corasick algorithm for multiple pattern matching"""
    def __init__(self, words):
        __words = []
        __max_word_length = 0
        for _word in words:
            _word = unidecode(_word.strip().lower())
            # Remove this IF. This is only to remove some small words
            # and words with characters not from a to z that appear
            # in the particular collection of words that I used.
            # You might also want to change the line
            #     self.max_characters = 26
            # if you are going to use other characters.
            if (
                not _word.isalpha()
                or _word == 'x'
                or _word == 'y'
                or _word == 'a'
                or _word == 'ad'
                or _word == 'd'
                or _word == 'dd'
            ):
                continue
            __this_word = _word.lower()
            __words.append(__this_word)
            __max_word_length += len(__this_word)
        self.max_states = __max_word_length
        self.max_characters = 26
        self.out = [0] * (self.max_states + 1)
        self.fail = [-1] * (self.max_states + 1)
        self.goto = [
            [-1] * self.max_characters for _ in range(self.max_states + 1)
        ]
        self.words = __words
        print(f'{self.max_states=}')
        print(f'{self.max_characters=}')
        print(f'{len(self.words)=}')
        self.states_count = self.__build_matching_machine()

    def __build_matching_machine(self):
        states = 1
        for _i, _word in enumerate(self.words):
            current_state = 0
            for character in _word:
                _ch = ord(character) - 97
                if self.goto[current_state][_ch] == -1:
                    self.goto[current_state][_ch] = states
                    states += 1
                current_state = self.goto[current_state][_ch]
            self.out[current_state] |= (1 << _i)
        for _ch in range(self.max_characters):
            if self.goto[0][_ch] == -1:
                self.goto[0][_ch] = 0
        queue = []
        for _ch in range(self.max_characters):
            if self.goto[0][_ch] != 0:
                self.fail[self.goto[0][_ch]] = 0
                queue.append(self.goto[0][_ch])
        while queue:
            state = queue.pop(0)
            for _ch in range(self.max_characters):
                if self.goto[state][_ch] != -1:
                    failure = self.fail[state]
                    while self.goto[failure][_ch] == -1:
                        failure = self.fail[failure]
                    failure = self.goto[failure][_ch]
                    self.fail[self.goto[state][_ch]] = failure
                    self.out[self.goto[state][_ch]] |= self.out[failure]
                    queue.append(self.goto[state][_ch])
        return states

    def __find_next_state(self, current_state, next_input):
        answer = current_state
        _ch = ord(next_input) - 97
        while self.goto[answer][_ch] == -1:
            answer = self.fail[answer]
        return self.goto[answer][_ch]

    def search_words(self, text, stop_at_first=False):
        """Seatch the text for all words we know."""
        print(f'{stop_at_first=}')
        text = text.lower()
        current_state = 0
        _result = defaultdict(list)
        for _i, _character in enumerate(text):
            current_state = self.__find_next_state(
                current_state,
                _character
            )
            if self.out[current_state] == 0:
                continue
            for j, _word in enumerate(self.words):
                if (self.out[current_state] & (1 << j)) > 0:
                    _result[_word].append(_i - len(_word) + 1)
                    if stop_at_first:
                        return _result
        return _result


if __name__ == "__main__":
    with open('text_to_search.txt', 'r', encoding='utf-8') as infile_text:
        TEXT = unidecode(infile_text.read().strip())
        print(f'{len(TEXT)=}')
    # TEXT = "ahishers"
    with open('/usr/share/dict/words', 'r', encoding='utf-8') as infile_words:
        aho_corasick = AhoCorasick(infile_words)
        result = aho_corasick.search_words(TEXT, stop_at_first=False)
        for word in result:
            for i in result[word]:
                print("Word", word, "appears from", i, "to", i + len(word) - 1)

It takes a blink to search all occurrences, inside a text of 666768 characters, of all words in a list of 74733 words. Printing all occurrences is what takes the most time. You can change stop_at_first=False to True to stop at the first match.

elis.byberi · June 6, 2023, 4:05pm

You are using single characters as words to be searched.
/usr/share/dict/words contains single characters. This returns false positives, because characters are same in many languages.

Also, try to find word “zygotes” in a TEXT string, where “zygotes” appears at the end of the TEXT string.

Create a long TEXT string with 700.000 “a” characters, and “zygotes” at the end:
TEXT = "aaaaaaaaaaaaaaaaaaaaa....zygotes"

Show me how long does it take to find it.

MRAB · June 6, 2023, 4:30pm

It took 6 seconds. When I switched to 700_000 random letters it took 7 seconds.

elis.byberi · June 6, 2023, 4:35pm

Make sure to set stop_at_first=False to False, to avoid false positives.
Test this code:

if __name__ == "__main__":

    TEXT = "a" * 700000 + "zygotes"
    print(TEXT[-7:])
    with open('/usr/share/dict/words', 'r', encoding='utf-8') as infile_words:
        aho_corasick = AhoCorasick(infile_words)
        result = aho_corasick.search_words(TEXT, stop_at_first=False)
        for word in result:
            for i in result[word]:
                print("Word", word, "appears from", i, "to", i + len(word) - 1)

Angelus · June 7, 2023, 3:13pm

hello,thank you for your time
when i test i have the following error

i my txt file contains only
HELLO
BONJOUR

franklinvp · June 7, 2023, 4:00pm

Note that you might need to adapt/enhance that code. It is a starting point, not to use out of the box.

It was made assuming that the alphabet consists only on the characters from a to z. Check it you either want to consider a different alphabet, or if you need to clean your input from characters that you don’t need, like \n or spaces. For the first you could make the class input your custom alphabet, and replace those places where they compute ord(character) - 97 by the appropriate ordinal in the input alphabet, etc. The latter maybe is as simple as applying strip() on your input to get rid of endline characters.

Angelus · June 7, 2023, 8:36pm

hello yes indeed you are rigth, if there is endline characters it doesn’t work
hello
bonjour
=>ko
but hellobonjour =>OK

big thank you for your time but be awared i do not know python at all
i need to search in a string if en english word exists, so i need to use an english dictionnary. IS is possible to improve the script by using enchant library ?

regards

bverheg · June 8, 2023, 9:41am

Here is a small script that searches for words from an enchant Dict (but it could easily be adapted to use any word list). It uses a brute force method, but is more than fast enough if your string is not extremely long.
You can specify minimum and maximum word length, and maximum number of words to find. The default minimum length is set to 2, to skip single characters. Some examples and the output are included at the end. The last example finds words from this script itself (if you have saved it as findwords.py).

"""Check if a string contains any words from an enchant Dict

"""
import enchant
wordlist = enchant.Dict("en_US")

def findwords(text, wordlist, minlength=2, maxlength=-1, maxwords=-1):
    textlength = len(text)
    if maxlength < 0:
        maxlength = textlength
    cntwords = 0
    for length in range(minlength, maxlength + 1):
        prevcnt = cntwords
        last1 = textlength - length + 1
        for start in range(last1):
            s = text[start:start+length]
            if wordlist.check(s):
                cntwords += 1
                print(f"{cntwords}: '{s}' at position {start} ({length})")
                if maxwords > 0 and cntwords >= maxwords:
                    return
        if cntwords == prevcnt:
            return

text = "BONJOURHELLOCIAO"

findwords(text, wordlist, minlength=3, maxwords=1)
# 1: 'OUR' at position 4 (3)

findwords(text, wordlist, minlength=4)
# 1: 'HELL' at position 7 (4)
# 2: 'LOCI' at position 10 (4)
# 3: 'CIAO' at position 12 (4)
# 4: 'HELLO' at position 7 (5)

findwords(open('findwords.py', 'r').read(), wordlist, minlength=7, maxlength=7)
# 1: 'contain' at position 21 (7)
# 2: 'enchant' at position 48 (7)
# 3: 'enchant' at position 73 (7)
# 4: 'enchant' at position 92 (7)

Angelus · June 9, 2023, 3:12pm

hello,
thank you for your time, it works perfectly.
How can i have the information that the string contains several words in the same line and not in several lines ?
for example i would like

2 words : ‘HELL’ at position 7 (4) and ‘LOCI’ at position 10 (4)