Multiple Word Anagram Solver

cascadediver · February 3, 2024, 10:57pm

Good day all. I am working on a multi-word anagram solver but am running into a road block. I am only able to execute this is there is only a single word result.

from itertools import permutations

# Load a dictionary (a simple list of words)
with open('SimpleList.txt', 'r', encoding='utf-8') as file:
    dictionary = set(word.strip().lower() for word in file)

def generate_anagrams(words):
    anagrams = set()
    for perm in permutations(words):
        anagram = ''.join(perm)
        if all(part in dictionary for part in split_into_words(anagram)):
            anagrams.add(anagram)
    return anagrams

def split_into_words(s):
    # This function splits a string into words of length 5
    return [s[i:i+5] for i in range(0, len(s), 5)]

def score_anagram(anagram):
    return len(anagram)

def anagram_solver(input_string):
    words = input_string.lower().replace(" ", "")
    anagrams = generate_anagrams(words)
    sorted_anagrams = sorted(anagrams, key=score_anagram, reverse=True)
    return sorted_anagrams

# Example usage for a single string of letters without spaces
input_string = "hsiftac"
result = anagram_solver(input_string)

# Print the results
formatted_result = [' '.join(split_into_words(result_word)) for result_word in result]
print(formatted_result)

In this example the input_string is “catfish” backwards.
SimpleList.txt has only 5 words in it: CAT, BONE, FISH, CATFISH, BONEFISH
So the results from “hsiftac” should be: cat fish, fish cat, and catfish

I am realizing I need the split_into_words to be variable on not a specific number

Any suggestions?

Thank you

BowlOfRed · February 3, 2024, 11:44pm

You are asking permutations() to return all possible permutations of the input string of the same length and then trying to whittle that down later to smaller words.

I think simpler would be to ask permutations() to run for all the lengths from 1 to the size. Then you can match directly.

from itertools import permutations

dictionary = {"cat", "fish", "catfish", "bonefish"}
input_word = "hsiftac"

for r in range(1, len(input_word) + 1):
    for perm in permutations(input_word, r):
        word = "".join(perm)
        if word in dictionary:
            print(word)

cat
fish
catfish

If your input string has duplicate letters, then you can get duplicate matches. You could test for that so you don’t report duplicates.

hansgeunsmeyer · February 4, 2024, 12:36am

In generate_anagrams you require all parts of the split to be in the dictionary, but this won’t work if the split always generates out-of-dictionary fragments (as it does).

One trick may be to drop the “split_into_words” and let permutations do all the work - by adding k (k=1, 2, …) spaces to the words string:

def generate_anagrams(word, k=1):
    anagrams = set()
    for p in permutations(word + " " * k):
        a = "".join(p).split()
        if all(x in dictionary for x in a):
            anagrams.add(" ".join(a))  
    return anagrams

This will return ‘fish cat’, ‘cat fish’, ‘catfish’ as results.

If the input phrase becomes too long, the script will drag to a slow crawl, though, since permutations has O(n!) runtime complexity.

cascadediver · February 4, 2024, 12:40am

I have a list of about 10,000 words in a foreign language. I plan to enter a small phrase, sentence, or a few words as the input text and see if there are any anagrams of it. I would need all possible outcomes from the word list printed and each outcome needs to use all of the letters from the input text.

hansgeunsmeyer · February 4, 2024, 12:45am

If the dictionary has only 10K items - then it may make more sense to iterate over the dictionary.
Then after processing each letter (including possible spaces), you tick that off as used if it has a match in the the input phrase and if there is no match, you continue with the next word. This should be faster, I guess, then iterating over all permutations.

kknechtel · February 4, 2024, 12:47am

Even that won’t be enough. Say for example you have fciasth; you have all the letters needed to spell “cat” and “fish”, but there’s nowhere that you can split the string so that the letters of “fish” are on one side of the split and the letters of “cat” are on the other side.

This approach is fundamentally misguided - there are too many ways to partition the string and then try to make “complete” anagrams out of the remainder. Instead, try to find “incomplete” anagrams, and then use recursion. The idea is, if you have fciasth, you can determine that this is enough letters to make cat, fish or catfish possibly with letters left over; for each of those possibilities, you determine what letters are left over, recursively figure out the multi-word anagrams of those leftover words, and use those to make new sets of possibilities. So when you recurse from cat for example, you discover that the remaining letters can make fish (and no other possibilities), so you determine that overall, you could make the list ['cat', 'fish']. Similarly, when you recurse from catfish, you fins that there are no more letters, so you are done already; the list ['catfish'] is one possible result.

This isn’t easy to write, but it’s hard to give more guidance than that without just writing it

The overall program - especially considering good ways to optimize it for practical situations - is not at all trivial. There’s quite a bit of information already out there, though.

hansgeunsmeyer · February 4, 2024, 1:01am

There are various more or less complicated ways of handling this. One simple optimization – especially for a small dict of 10K items – may be as follows. Map all words in the dict to the sorted list of their letters (“catfish” → “acifhst”). Then make the reverse mapping (“acifhst” → list of all words with those letters). Then do one dict lookup for the sorted letter string of the target phrase.
Tricky part here is only if you also want to allow extra spaces or not (but you could add or not add those to the target phrase).
Nice aspect of this solution is that the preprocessing only needs to happen once. Generating anagrams after that runs in O(1).
To make this work so that target phrase “catfish” also generates “cat fish” and “fish cat”, you’d need to add more preprocessing though (using a trie or suffix trie would probably be fastest then).