Splitting sections in an imported text file

Icantcode · April 26, 2024, 12:07pm

Hi,

There are a few issues with my code here - please could someone look over it and help me find where I have made the mistake.

Here is what I have tried to do:

Import my text file which is a play
Then i want to get rid of all the stage directions, which in the play are written as where in between are the stage directions. I thought if I used the split function with the full stop in the middle that would remove all the sections in the play with sqaure brackets around them regardless of the content.
then I wanted to remove the lines 5 - 28 in the play (because that it just a character list).

I would therefore be left with just the speech in which I need to find the words of more than three letters which are used most frequently, which is the final thing which I have tried to do.

barry-scott · April 26, 2024, 2:52pm

Please read the pinned topic that explains how to post code, then edit your post to replace the screen shot. About the Python Help category

Icantcode · April 26, 2024, 3:15pm

Like this?:

import re
with open('play (2).txt','r') as a:
   b = a.read()
   c = re.split('[.]',b) 
   d = re.split(for line in b: '[5-28]',b )
   e = b.count(len(for word in b) >= 3)

barry-scott · April 26, 2024, 6:21pm

Yes that’s great. I can see the code clearly.

What do you expect this to do?
It looks very odd to me.

What do you expect this to do?
The for line in b: '[5-28]' will return a the string “[5-28]” the number of times there is a line in the file.

kknechtel · April 26, 2024, 6:49pm

If it were valid syntax, sure. This isn’t a proper generator expression, if that’s what you had in mind.

barry-scott · April 26, 2024, 7:58pm

I ran the fragment and it did what i said. I was not sure what to expect i must admit.

Icantcode · April 27, 2024, 1:50pm

The c = re.split(‘[ . ]’, b) I wanted to move any text in the play that was enclosed with square brackets.

then re.split(for line in b: ‘[5-28]’ , b) I wanted to remove lines 5 - 28.

How would I do this instead?

barry-scott · April 27, 2024, 2:15pm

First a warning, regualr expressions are very powerful, but take time to learn and use.

You could loop over each line and use re function to remove all the text enclosed in . See re — Regular expression operations — Python 3.8.18 documentation

I am guessing you do not know how to wrote regular expressions which re uses. I think this does the match you describe. r’\[[^\]+]\]’ says to match a [ followed by one of more characters that are not a ] and a final ]

If you read the lines of the file into a list you can slice the list.

with open('play (2).txt','r') as f:
    lines = f.readlines()

# remove lines 5-28
del lines[5:28]

Icantcode · April 27, 2024, 4:13pm

I tried this:

import re
with open('play (2).txt','r') as a:
    b = a.readlines()
    output = []
    for line in a:
        re.split('[.]',b)

but it didn’t change anything in my ‘play (2).txt’ file. It didn’t remove any of the text which was enclosed in square brackets throughout the file.

I also tried the del lines but I got an error?

jrivers · April 27, 2024, 5:01pm

It’d be better to use re.sub to remove text, rather than re.split. For writing the pattern:
Remember that [ and ] have a special meaning in regular expression, so to match the literal [ and ], they’ll need to be escaped as \[ and \].
The . in the pattern matches a single character. To match any number of characters, you’ll need to put a * afterwards, like this: .*
Also note that the * is greedy, so it will match all the way to the last ] in the line. Instead of using . to match any character, it’s better to do like Barry Scott suggested above, and use [^]] to match any character except ].

What Barry Scott suggested above should work. You say you got an error—what did the error message say?

I’d do this in two parts: first find all of the relevant words, then count them to get the most common ones. For finding the words, you can use regular expressions for that. To count them, there’s collections.Counter, which is meant just for that purpose.

re.split returns a list of strings. It doesn’t modify the original string. The original file won’t be changed unless you write to it—although, if you only want to count the most frequent words, changing it in memory is sufficient, and you can just write the counts to terminal at the end.

Icantcode · April 27, 2024, 5:23pm

Thank you for the responses!

I am just going to work on removing the first .
Will this:

import re
a = open('play (2).txt','r') 
b = a.readlines()
output = []
for line in a:
   re.sub('\[^\]',b) 
a.close()

c = open('play (2).txt','w')
d = c.write(output)
c.close()

change the text file so that all the text that exists in square brackets will be removed?

kknechtel · April 27, 2024, 5:31pm

Oh, I see what you mean. Yes, in isolation it would do that (rather, it would compute that string and discard the result). But it isn’t a valid expression, so it can’t be an argument for re.split.

kknechtel · April 27, 2024, 5:34pm

No.

First, if you want to write lines of output afterwards, you need to actually collect them first: output.append(re.sub('\[^\]',b)). This means to take the string that was given back to you from re.sub, and append it to the output list.

Second, you cannot .write a list to a file. If the file contains separate strings for each line that you want to write, you need to use .writelines instead.

Icantcode · April 27, 2024, 6:01pm

Hi,
I tried writing this code but I got that there was a
TypeError: sub() missing 1 required positional argument: ‘string’

import re
a = open('play (2).txt','r') 
b = a.readlines()
c = output.append(re.sub('\[^\]',b))
a.close()

d = open('play (2).txt','w')
e = c.writelines(c)
d.close()

kknechtel · April 27, 2024, 7:56pm

Sorry, missed a spot. re.sub is for replacing whatever matches. The way that we remove what matches, is to replace it with an empty string. You have to tell it that replacement.

The order is: the regex you’re using, then the replacement string, then the source string (where it looks for matches). So: re.sub('\[^\]', '', b).

Icantcode · April 28, 2024, 10:09am

I tried this - I think it has just removed everything in the txt file because I seem to get an error message of: TypeError: write() takes exactly one argument (0 given)

I wrote this for the code:

import re
a = open('play (2).txt','w')
d = a.write()
c = re.sub('\[^\]','',d)
a.close()

kknechtel · April 28, 2024, 12:38pm

Well, yes:

There is nothing between the (), therefore a.write received 0 arguments. You need to say what should be written to the file. (Hint: you should probably do this after the part that calculates the thing that you want to write.)

Icantcode · April 28, 2024, 3:04pm

Hi,

I’ve written this:

x = open('play (3).txt','w')
y = x.write(translate)
x.close()

#%%
#Which words of more than three letters are used most
#frequently? What is the distribution of word lengths overall?
import re
a = open('play (3).txt','w')
c = re.sub('\[^\]','',x)
a.write(c)
a.close()

but this gives an error of
TypeError: expected string or bytes-like object
and it also deletes every single part of the file rather than just the parts that are enclosed in the square brackets.

jrivers · April 28, 2024, 3:36pm

x = open('play (3).txt','w')

When you open a file for writing, it truncates it to the beginning, so any content that already exists is erased. The file will be empty until you write something to it.

Reading from a file copies data from it, on the disk, to your program. Writing copies some bytes from your program to the disk. So, to modify the existing contents, you need to read from the file first, and write back to the file afterwards.

Icantcode · April 28, 2024, 4:33pm

I tried this but got the same error? Sorry, I am so confused:

import re
a = open('play (3).txt','r')
d = a.read()
c = re.sub('\[^\]','',x)
print(c)
a.close()

q = open('play (3).txt','w')
w = q.write(c)
q.close()