Bugs with Reading Large List

I’m fairly new to Python and I’m trying to write a program that will automatically come up with a list of words that will complete the New York Times Spelling Bee game. For this project I have a .txt file containing a large number of words (about 450,000). The Python script is in theory supposed to take every word on the list containing a certain letter. From there it would remove items from the list if they contain a letter that is not used in the game. In the end it should print out a list of these words for the user to enter.

My problem is that as I am trying to remove extra items from the list, I’ll get the error:

IndexError: list index out of range

Although I get this error, I don’t see the problem with my code. If someone could help me that would be great. Thanks!

Here is my code:

letterList = ['a', 'b', 'c', 'd', 'e', 'f']
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
centerLetter = 'g'

loopNumber = 0

while True:
    position = alphabet.index(letterList[loopNumber])
    alphabet.pop(position)
    loopNumber = loopNumber + 1
    if loopNumber == 6:
        break
position = alphabet.index(centerLetter)
alphabet.pop(position)

loopNumber = 0
loopNumber2 = 0
loopNumber3 = 0

file1 = open("WordList.txt")
flag = 0
index = 0

listOfLines = []
string1 = centerLetter

while True:
    for line in file1:
        index += 1
        if string1 in line:
            flag = 1
            listOfLines.append(line)
            if index == 370103:
                break
    break
stuffToPop = []
while True:
    loopNumber3 = 0
    loopNumber2 = 0
    while True:
        listOfLines2 = listOfLines
        if alphabet[loopNumber] in listOfLines[loopNumber2]:
            print(loopNumber2)
            stuffToPop.append(loopNumber2)
            loopNumber3 = loopNumber3 + 1
        loopNumber2 = loopNumber2 + 1
        if loopNumber2 == len(listOfLines2) + 1:
            break
    loopNumber = loopNumber + 1
    if loopNumber == 19:
        break
loopNumber = 0
while True:
    listOfLines.pop(loopNumber)
    loopNumber = loopNumber + 1
    print("\n \n \n \n")
    print("Lists of words")
    print(listOfLines)

I’m fairly new to Python and I’m trying to write a program that will
automatically come up with a list of words that will complete the New
York Times Spelling Bee game. For this project I have a .txt file
containing a large number of words (about 450,000). The Python script
is in theory supposed to take every word on the list containing a
certain letter. From there it would remove items from the list if they
contain a letter that is not used in the game. In the end it should
print out a list of these words for the user to enter.

My problem is that as I am trying to remove extra items from the list, I’ll get the error:

IndexError: list index out of range

Please paste the complete traceback. This pinpoints the line in your
code where the error occurred, rather than requiring us to scan your
entire programme.

Comments inline below.

Although I get this error, I don’t see the problem with my code. If someone could help me that would be great. Thanks!

Here is my code:

letterList = ['a', 'b', 'c', 'd', 'e', 'f']
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
centerLetter = 'g'

loopNumber = 0

while True:
   position = alphabet.index(letterList[loopNumber])
   alphabet.pop(position)
   loopNumber = loopNumber + 1
   if loopNumber == 6:
       break

There’s a shorter way to write this loop, which avoids knowing the magic
number “6”:

for letter in letterList:
    position = alphabet.index(letterList[loopNumber])
    alphabet.pop(position)

This just iterates over letterList, assigning each letter to “letter”.

position = alphabet.index(centerLetter)
alphabet.pop(position)

loopNumber = 0
loopNumber2 = 0
loopNumber3 = 0

file1 = open(“WordList.txt”)
flag = 0
index = 0

listOfLines =
string1 = centerLetter

while True:
for line in file1:
index += 1
if string1 in line:
flag = 1
listOfLines.append(line)
if index == 370103:
break
break

It isn’t clear why you have a while-loop here at all. It always runs
exactly once (courtesy of the “break” at the bottom. You can drop the
while-loop entirely and just go:

for line in file1:
    index += 1
    if string1 in line:
        flag = 1
        listOfLines.append(line)
        if index == 370103:
            break

stuffToPop =
while True:
loopNumber3 = 0
loopNumber2 = 0
while True:
listOfLines2 = listOfLines

This may not do what you think. I imagine you want a copy of
listOfLines in listOfLines2. However, Python variables are references to
objects. So the result of the above is that listOfLines2 references the
exact same list as listOfLines. If you modify listOfLines2, listOfLines
is also modified because it is the same object (the original list).

Try this instead:

listOfLines2 = listOfLines.copy()

since lists come with a handy copy method for just this kind of thing.

   if alphabet[loopNumber] in listOfLines[loopNumber2]:
       print(loopNumber2)
       stuffToPop.append(loopNumber2)
       loopNumber3 = loopNumber3 + 1
   loopNumber2 = loopNumber2 + 1
   if loopNumber2 == len(listOfLines2) + 1:
       break

I suspect this is your IndexError problem.

This allows loopNumber2 to be used when it is equal to
len(listOfLines2). But because lists (and other sequences) in Python
count from 0, the valid indices of listOfLines2 run from 0 through to
len(listOfLines2)-1. So the iteration with
loopNumber2=len(listOfLines2) will try to access
listOfLines[loopNumber2], which is also listOfLines[len(listOfLines2)]
on that loop iteration. That is out of bounds.

You if-statement wants to test if loopNumber2 == len(listOfLines2):,
dropping the +1.

But a better way to do this looks like this:

for loopNumber2, line in enumerate(listOfLines):
    if alphabet[loopNumber] in line:
        print(loopNumber2)
        stuffToPop.append(loopNumber2)

This iterates over enumerate(listOfLines), a built in function which
itself iterates over listOfLines and yields (index,item) pairs for
each item in listOfLines. So it yields:

0, listOfLines[0]
1, listOfLines[1]
2, listOfLines[2]
... and so on ...

which we assign to loopNumber2,line in the for-loop. This means you
(a) don’t need to do fiddly loopNumber2=loopNumber2+1 stuff and (b)
don’t need to test loopNumber2 against the length of the list - the
for-loop will run the correct number of times all on its own, because
the list itself defines what the loop iterates over.

[…snip…]

loopNumber = 0
while True:
listOfLines.pop(loopNumber)
loopNumber = loopNumber + 1
print("\n \n \n \n")
print(“Lists of words”)
print(listOfLines)

Are you sure you want to do this while-True? It will run forever!

Cheers,
Cameron Simpson cs@cskk.id.au

I wrote:

There’s a shorter way to write this loop, which avoids knowing the
magic
number “6”:

for letter in letterList:
position = alphabet.index(letterList[loopNumber])
alphabet.pop(position)

This just iterates over letterList, assigning each letter to “letter”.

That should read:

for letter in letterList:
    position = alphabet.index(letter)
    alphabet.pop(position)

Sorry,
Cameron Simpson cs@cskk.id.au

I’ve done what you want for me to do with the code, but I still get the error List index out of range error for line 40 which is the for loopNumber2, line in enumerate(listOfLines): line of code.

I chose to do a loop here mostly because I didn’t want to deal with stopping the loop at the right time, so I figured once I got the error List index out of range I would scroll up to the large gap in the text, and take all the words below that.

Here is my current code:

letterList = ['a', 'b', 'c', 'd', 'e', 'f']
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
centerLetter = 'g'

loopNumber = 0

for letter in letterList:
    position = alphabet.index(letterList[loopNumber])
    alphabet.pop(position)
    loopNumber = loopNumber + 1
position = alphabet.index(centerLetter)
alphabet.pop(position)

loopNumber = 0
loopNumber2 = 0
loopNumber3 = 0

file1 = open("WordList.txt")
flag = 0
index = 0

listOfLines = []
string1 = centerLetter

for line in file1:
    index += 1
    if string1 in line:
        flag = 1
        listOfLines.append(line)
        if index == 370103:
            break
stuffToPop = []
while True:
    loopNumber3 = 0
    loopNumber2 = 0
    for loopNumber2, line in enumerate(listOfLines):
        if alphabet[loopNumber] in line:
            print(loopNumber2)
            stuffToPop.append(loopNumber2)
        loopNumber2 += 1
    loopNumber = loopNumber + 1
    if loopNumber == 19:
        break
loopNumber = 0
while True:
    listOfLines.pop(loopNumber)
    loopNumber = loopNumber + 1
    print("\n \n \n \n")
    print("Lists of words")
    print(listOfLines)

Thanks!

Also, I would like to point out that occasionally the code will print a lot of values of “LoopNumber2,” and then it will stop running, seemingly passing through the last loop without doing anything. I cannot seem to find any reason that it would do this, and when I run the code again I get a List index out of range error, and it does not do it again for another couple times.

According to NYT, rules of the game are:

  • Words must contain at least 4 letters.
  • Words must include the center letter.
  • Our word list does not include words that are obscure, hyphenated, or proper nouns.
  • No cussing either, sorry.
  • Letters can be used more than once.

Let’s say I have data in suitable datastructure. How would I query it to find
suitable words?

  • ignore all words which are less than 4 letters
  • select only words which have center letter
  • find words which (unique) letters are subset of hive letters

So based on ‘desired query’ I could design file import into suitable data
structure. For starters it would be good idea to ignore all shorter words. Then I could build nested dictionary out of qualifying words. For example I could collect under
key (letter) all words what contain this letter. These words would be in dictionary where
key is word itself and value is set of letters in words. Something like: 'a': {'spam': {'s', 'p', 'a', 'm'}, 'maps': {'m', 'a', 'p', 's'} }

Word actual length is important only in shorter side - for ignoring words with less
than 4 letters. However, as letters can be used more than once words shouldn’t have
upper lenght limit of letters (for example ‘letter’ has length of 6 but
only four unique letters).

When there is center letter and hive of letters one can just query letter which
gives all words which contain this letter and then iterate over the words to
determine whether unique letters of words are subset of hive letters.

Below is simple implementation above described idea. As I have no actual file I used list, if reading from actual file one must keep in mindline endings:

words = ['spam', 'ham', 'bacon', 'eggs', 'maps']

data = dict()

for word in words:
    if 4 <= len(word):
        letters = set(word)
        for letter in letters:
            try:
                data[letter][word] = letters
            except KeyError:
                data[letter] = {word: letters}


hive = 'amps'
center = 'a'

for word, letters in data[center].items():
    if letters.issubset(hive):
        print(word)

# will print:
spam
maps

If I would play it every day then I would save data structure created into file (json?) or even into many files (one for every letter) so there wouldn’t be need to build data structure every time. I would also refactor the code to use functions.This datastructure will be large (every qualifying word appears as many times as it has unique letters) but only portion of the words must be iterated over.

Quoting my copy of the New York Times magazine of January 9, 2022, page 48, the rules of the Spelling Bee state:

How many common words of 5 or more letters can you spell using the letters in the hive? Every answer must use the center letter at least once. Letters may be reused in a word. At least one word will use all 7 letters. Proper names and hyphenated words are not allowed. Score 1 point for each answer, and 3 points for a word that uses all 7 letters.

Maybe there is another edition that has different rules.

In any case, @jack_newport, I would suggest writing two functions in order to better organize your code. One of them, qualifies(word, center, peripherals), would return True if all of the following conditions apply:

  • The word has at least 5 letters.
  • The word contains the center letter.
  • The word only contains letters that are either in the center or the periphery.

Otherwise, it would return False.

The other function, qualifies_and_has_all(word, center, peripherals), would return True if the following conditions apply:

  • The word qualifies according to the previous function.
  • The word contains all of the 7 letters at least once.

Otherwise, it would return False.

Then loop through your list of about 450,000 words, using both functions to test each word. If a function returns True, either display the word, or add it to an appropriate list for later processing or display.

Here’s an example of how the first function might be written:

def qualifies(word, center, peripherals):
    letters = center + peripherals
    if len(word) < 5:
        return False
    if center not in word:
        return False
    for ch in word:
        if ch not in letters:
            return False
    return True