Find the frequency of words of more than three letter

I want to find the frequency of all the words in my text file that are more than three letter so I can make a distribution curve of them.
This is my code so far:

import re
for word in open('play (4).txt','r'):
    re.findall len(word>=3).count

Can someone look over it and advise me for where to go with it?

This is not a correct Python code.

If you really want to use regular expressions, then my advice is to first remove all 3 letter words, then count the rest

word = re.sub(r"\b\w{3}\b", "", word)

Now you can count the rest :wink:

Posting code that isn’t even Python doesn’t convince us that you’ve actually tried to do this before asking us. :confused:

dist = {}
with open('play (4).txt','r') as fp:
    for word in fp.read().split():
        if len(word) > 3:
            dist[len(word)] = 1 + dist.get(len(word), 0)

The result, dist is a dictionary from lengths of words to the number of words with that length.

Hi,

I just had an idea for the code I could potentially run in python, I have to do a few things before I can actually use the code but wanted to be thinking ahead. I haven’t actually run the code that I said in python yet but thought I might be along the right lines.

For the code that you have given - would I then print(word)?

Are you required to use regex or can you use any other method?
You didn’t say in your first post. Your first post should contain your requirements.

I do not have to use regex, I just have been using it for a while so I am becoming slightly more confident with using it.

Here’s another solution that works for me. By looking at the different solutions you can get an idea how this works. BTW, I recommend you do a full Python tutorial.

import sys # Needed for sys.exit().
from os.path import exists
import re # For regex.

file1 = 'play (4).txt' # Get one file.
# See if the file actually exists. 
if not exists(file1):
    print(f"ERROR: File {file1} does not exist")
    sys.exit() # Exit program.

filein = open(file1, 'r')
textin = filein.read()
filein.close()
# Split on non-word characters which is \W.
textlist = re.split('\W', textin) # Turn our file into a list.
cnt = 0
for word in textlist: # Loop through each word.
    if len(word)>3: 
        cnt += 1
        
print(f"Words with at least 4 ctrs: {cnt}")

r'''This is my file contents:
This is the play 4 file [with brackets]. 
And with {braces}.
[Some more brackets].
'''

Thank you for sharing

  1. Why do you open the file for reading then close it?
  2. does the re.split, split each word into a string which therefore creates a list of all the strings/ words in the document?
  3. Is the count basically so the computer counts every time it sees a string of more than 3/4/5 ect letters?

Will this print each of the words with more than three letters in them in chronolical order like a dictionary e.g: would it print:
the: 28
find: 22
testing: 20
ect

Because we already have the data in the variable textin. The whole file contents is in textin as a string separated by \n (CRLF for your OS).

Yes, this is a regular expression split on non-word characters. Word characters are \w, non-word characters are \W. This method does not include punctuation in your words so you are only counting letters.

Yes.

No, I just showed you some steps, I will let you do some research to do the rest. If you run the program you will see what it does.

Hi,

I have tried this code:
What I have tried to do is ask the computer to find all words in my text which are greater than three letters, hence my finding=re.findall, then I have attempted to ask the computer to tell me how many of each of these three letter words there are.

file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read).count
#I do not think there will be any words with more than 100 letters so I 
#put that as my maximum
count = 0
read.split()
for word in finding:
    word.count()
print(word)

however, in this I get the error of TypeError: 'builtin_function_or_method' object is not iterable. So then I tried putting the read.split() above the finding =re.findall(r'\b\w{3,100}\b',read.split()) but I got an error of TypeError: expected string or bytes-like object

Please can someone advise how to tweak this?

What tutorial did you watch or read to learn about regex?

The re.findall in your program is not the right syntax. Did you read the docs at re — Regular expression operations — Python 3.11.8 documentation? What did you learn?

In the upper left of that page is a hamburger menu (3 horizontal lines) which is the TOC for that page so you can jump to different parts of the documentation.