Find the frequency of words of more than three letter

Icantcode · April 29, 2024, 9:41am

I want to find the frequency of all the words in my text file that are more than three letter so I can make a distribution curve of them.
This is my code so far:

import re
for word in open('play (4).txt','r'):
    re.findall len(word>=3).count

Can someone look over it and advise me for where to go with it?

FelixLeg · April 29, 2024, 9:49am

This is not a correct Python code.

If you really want to use regular expressions, then my advice is to first remove all 3 letter words, then count the rest

word = re.sub(r"\b\w{3}\b", "", word)

Now you can count the rest

TomRitchford · April 29, 2024, 10:12am

Posting code that isn’t even Python doesn’t convince us that you’ve actually tried to do this before asking us.

dist = {}
with open('play (4).txt','r') as fp:
    for word in fp.read().split():
        if len(word) > 3:
            dist[len(word)] = 1 + dist.get(len(word), 0)

The result, dist is a dictionary from lengths of words to the number of words with that length.

Icantcode · April 29, 2024, 12:22pm

Hi,

I just had an idea for the code I could potentially run in python, I have to do a few things before I can actually use the code but wanted to be thinking ahead. I haven’t actually run the code that I said in python yet but thought I might be along the right lines.

For the code that you have given - would I then print(word)?

c-rob · April 29, 2024, 1:39pm

Are you required to use regex or can you use any other method?
You didn’t say in your first post. Your first post should contain your requirements.

Icantcode · April 29, 2024, 1:42pm

I do not have to use regex, I just have been using it for a while so I am becoming slightly more confident with using it.

c-rob · April 29, 2024, 1:47pm

Here’s another solution that works for me. By looking at the different solutions you can get an idea how this works. BTW, I recommend you do a full Python tutorial.

import sys # Needed for sys.exit().
from os.path import exists
import re # For regex.

file1 = 'play (4).txt' # Get one file.
# See if the file actually exists. 
if not exists(file1):
    print(f"ERROR: File {file1} does not exist")
    sys.exit() # Exit program.

filein = open(file1, 'r')
textin = filein.read()
filein.close()
# Split on non-word characters which is \W.
textlist = re.split('\W', textin) # Turn our file into a list.
cnt = 0
for word in textlist: # Loop through each word.
    if len(word)>3: 
        cnt += 1
        
print(f"Words with at least 4 ctrs: {cnt}")

r'''This is my file contents:
This is the play 4 file [with brackets]. 
And with {braces}.
[Some more brackets].
'''

Icantcode · April 29, 2024, 2:03pm

Thank you for sharing

Why do you open the file for reading then close it?
does the re.split, split each word into a string which therefore creates a list of all the strings/ words in the document?
Is the count basically so the computer counts every time it sees a string of more than 3/4/5 ect letters?

Will this print each of the words with more than three letters in them in chronolical order like a dictionary e.g: would it print:
the: 28
find: 22
testing: 20
ect

c-rob · April 29, 2024, 2:58pm

Because we already have the data in the variable textin. The whole file contents is in textin as a string separated by \n (CRLF for your OS).

Yes, this is a regular expression split on non-word characters. Word characters are \w, non-word characters are \W. This method does not include punctuation in your words so you are only counting letters.

Yes.

No, I just showed you some steps, I will let you do some research to do the rest. If you run the program you will see what it does.

Icantcode · April 30, 2024, 5:14pm

Hi,

I have tried this code:
What I have tried to do is ask the computer to find all words in my text which are greater than three letters, hence my finding=re.findall, then I have attempted to ask the computer to tell me how many of each of these three letter words there are.

file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read).count
#I do not think there will be any words with more than 100 letters so I 
#put that as my maximum
count = 0
read.split()
for word in finding:
    word.count()
print(word)

however, in this I get the error of TypeError: 'builtin_function_or_method' object is not iterable. So then I tried putting the read.split() above the finding =re.findall(r'\b\w{3,100}\b',read.split()) but I got an error of TypeError: expected string or bytes-like object

Please can someone advise how to tweak this?

c-rob · May 1, 2024, 9:48am

What tutorial did you watch or read to learn about regex?

The re.findall in your program is not the right syntax. Did you read the docs at re — Regular expression operations — Python 3.11.8 documentation? What did you learn?

In the upper left of that page is a hamburger menu (3 horizontal lines) which is the TOC for that page so you can jump to different parts of the documentation.