Finding the frequency of words in a text document

Hi,

I have tried this code:
What I have tried to do is ask the computer to find all words in my text which are greater than three letters, hence my finding=re.findall, then I have attempted to ask the computer to tell me how many of each of these three letter words there are.

file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read).count
#I do not think there will be any words with more than 100 letters so I 
#put that as my maximum
count = 0
read.split()
for word in finding:
    word.count()
print(word)

however, in this I get the error of TypeError: 'builtin_function_or_method' object is not iterable. So then I tried putting the read.split() above the finding =re.findall(r'\b\w{3,100}\b',read.split()) but I got an error of TypeError: expected string or bytes-like object

Please can someone advise how to tweak this?

1 Like

Look at line 3 in the code. You assign finding to the count method, instead of to its result

Note that you also redefine count by setting it to zero, while later you try to use it as a method again

Hi,

I wasn’t 100% sure of what you meant but I tried this instead:

file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read)
count = 0
for i in read:
    if i in finding:
        count += 1
sort_three_letters = sorted(count.items(),key=lambda i:i.count())
len(count)
print(count)

If I just print(finding) if reproduces every single three letter word, which is great but not what I want.
I want to find the freuquency of all the three or more letter words e.g: the:5, played:7 ect but I don’t get anything in my output.

read() reads in the entire file and creates a string. So for i in read is going to iterate over all the characters in the file.
I’m guessing you want to iterate over the words in the file, because in the next line you want to see if the word is in the result of re.findall (which is a list of words that you are able to print out as you mention).

When you increment the counter, you probably want to keep track of which word you’re incrementing the counter for.
Currently it’s just a general counter which also doesn’t have an attribute items(). A mapping of words to counters would have. Once you have that, it should be easy to print the words and the times they occur.

1 Like

Hi,

I think I sort of understood what you are saying so I have tried to create a dictionary. This was my code below however I got an TypeError: unhashable type: 'list' error.

dict_1 = {}
file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read)
for i in finding:
    if i in dict_1:
        dict_1[finding] += 1
    else:
        dict_1[finding] = 1
print(dict_1)

finding is a list of words. You don’t want to put the list in dict_1, you want to put the word into it:

dict_1 = {}
file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read)
for word in finding:
    if word in dict_1:
        dict_1[word] += 1
    else:
        dict_1[word] = 1
print(dict_1)

A shorter way is to use Counter from the collections module. You just pass it the list of words.

Oh my gosh thank you so much!
I did some research and I knew that was the error but was so unsure of hwo to fix it.

Now that I have done that, I would like to sort the frequency of these words into chronological order.
I tried writing this:

sorting = sorted(dict_1.items(),key = lambda word:word.count())

In my head this means I want to sort all of the dictionary items into an order where the words of three letters or more that appear most frequently come first in the dictionary i.e:
‘the’: 70
‘and’: 65
troupe: 23
ect

dict_1.items() gives you a series of tuples consisting of keys (in this case, your word) and values (which is your count). So, for your key function, change it to:

lambda word: word[1]

Here, word[1] refers to the second part of the tuple, where the value is stored.

If you want it in descending order, also pass reverse=True to the sorted function.

Thank you!

I have now tried to ask the computer to find all the frequencies of however many letters the words have.
I.e, I want it to now tell me how many words in my text have three, four, five, letters ect

dict_2 = {}
file1 = open('file2.txt','r')
reads = file1.read()
findings = re.findall (int('\w{3,100}',reads)

for word in findings:
    if word in dict_2:
        dict_2[length] += 1
    else:
        dict_2[length] = 1 
sortings = sorted(dict_2.items(), key = lambda length:length[1], reverse = True)
print(sortings)

However, I get an invalid syntax and I’m not even sure I’m asking it to do the right thing.