I want to find the frequency of all the words in my text file that are more than three letter so I can make a distribution curve of them.
This is my code so far:
import re
for word in open('play (4).txt','r'):
re.findall len(word>=3).count
Can someone look over it and advise me for where to go with it?
I just had an idea for the code I could potentially run in python, I have to do a few things before I can actually use the code but wanted to be thinking ahead. I haven’t actually run the code that I said in python yet but thought I might be along the right lines.
For the code that you have given - would I then print(word)?
Here’s another solution that works for me. By looking at the different solutions you can get an idea how this works. BTW, I recommend you do a full Python tutorial.
import sys # Needed for sys.exit().
from os.path import exists
import re # For regex.
file1 = 'play (4).txt' # Get one file.
# See if the file actually exists.
if not exists(file1):
print(f"ERROR: File {file1} does not exist")
sys.exit() # Exit program.
filein = open(file1, 'r')
textin = filein.read()
filein.close()
# Split on non-word characters which is \W.
textlist = re.split('\W', textin) # Turn our file into a list.
cnt = 0
for word in textlist: # Loop through each word.
if len(word)>3:
cnt += 1
print(f"Words with at least 4 ctrs: {cnt}")
r'''This is my file contents:
This is the play 4 file [with brackets].
And with {braces}.
[Some more brackets].
'''
Why do you open the file for reading then close it?
does the re.split, split each word into a string which therefore creates a list of all the strings/ words in the document?
Is the count basically so the computer counts every time it sees a string of more than 3/4/5 ect letters?
Will this print each of the words with more than three letters in them in chronolical order like a dictionary e.g: would it print:
the: 28
find: 22
testing: 20
ect
Because we already have the data in the variable textin. The whole file contents is in textin as a string separated by \n (CRLF for your OS).
Yes, this is a regular expression split on non-word characters. Word characters are \w, non-word characters are \W. This method does not include punctuation in your words so you are only counting letters.
Yes.
No, I just showed you some steps, I will let you do some research to do the rest. If you run the program you will see what it does.
I have tried this code:
What I have tried to do is ask the computer to find all words in my text which are greater than three letters, hence my finding=re.findall, then I have attempted to ask the computer to tell me how many of each of these three letter words there are.
file = open('file2.txt','r')
read = file.read()
finding = re.findall(r'\b\w{3,100}\b',read).count
#I do not think there will be any words with more than 100 letters so I
#put that as my maximum
count = 0
read.split()
for word in finding:
word.count()
print(word)
however, in this I get the error of TypeError: 'builtin_function_or_method' object is not iterable. So then I tried putting the read.split() above the finding =re.findall(r'\b\w{3,100}\b',read.split()) but I got an error of TypeError: expected string or bytes-like object
In the upper left of that page is a hamburger menu (3 horizontal lines) which is the TOC for that page so you can jump to different parts of the documentation.