Selecting text between tags

vgwosdz · March 16, 2022, 12:53pm

I am trying to write a simple program to extract text from an html page (for class, we are not allowed to use parsers).

This is my code:

*import re  # import regular expression module*

*f = open("dutch1.html", "r", encoding='utf8')  # open html file*
*f2 = open("dutch1_converted.txt", "w")  # open blank txt file to write cleaned html file to*

*f2.write = re.findall("<p>(.*?)</p>", str(f))  # select all text between html tags and save to txt file*

*f.close()*
*f2.close()*

The good thing is, it doesn’t give any errors. However, it doesn’tgive any output either…

The idea is for the program to find all text between the tags and collect it in the second file.
Could someone point me in the right direction?

Thanks!

soil · March 16, 2022, 3:58pm

I think, this is a more true expression:

import re # import regular expression module 
f = open("dutch1.html", "r", encoding='utf8') # open html file 
f2 = open("dutch1_converted.txt", "w") # open blank txt file to write cleaned html file to 
for i in re.findall("<p>(.*?)</p>", str(f.read())):
    f2.write(i) # select all text between html tags and save to txt file
f.close() 
f2.close()

-“Write” is a kind of function. So, it would be better to call it.
-To read a file which was passed to a variable called f, you can use “f.read()”.
-Also, re.findall() returns a list of strings. Not a string.(i am not sure if write() accepts lists, so it would be better to use a for loop)
Note: if code crashes,please let me know about it. My pc is not with me.

vgwosdz · March 16, 2022, 5:11pm

Thanks for your reply and explanation.

It doesn’t crash, but also gives no output. I checked and there is definitely data between the tags. It is just not recognised. Maybe I am doing something wrong in the way I am opening the origin file? Or the html format?

soil · March 16, 2022, 5:16pm

Maybe it’s because your file is an HTML. You may try renaming it as dutch1.txt . I will check the documentation if that doesn’t work

vgwosdz · March 16, 2022, 5:55pm

Nope… So this is what I have tried:

Saved the input file as txt: no succes;
Simple open the two files and write the text of the input file to the output file: I received a warning that the output file was incorrectly encoded as utf8, changed it, and this did exactly as I asked (obviously writing every line, not just between the tags);
Changed the encoding of the output file and tried the re.findall again: no output;

So I am assuming that the re.findall is not giving back any text. I know that using regular expressions for cleaning html files is not optimal, but as I am supposed to do this without use of a dedicated parser, I am at a loss at what to try more?

CAM-Gerlach · March 16, 2022, 7:39pm

You might be interested in this Stack Overflow answer about how to parse [X]HTML with regex.

In general, when you’re doing this in the real world rather than a school assignment like this one, for non-trivial HTML parsing of documents you do not control, it is best to just use an actual HTML parser. Python has a simple one built in, html.parser, and there are many very popular third-party ones, such as BeautifulSoup.

However, for a very simple and controlled use case, regex may be sufficient. Given this appears to be a school assignment, and you are required to do so, I’m going to assume this is the case.

It would be very helpful to have an example of the file you are trying to parse. However, I’ll do my best to guess here.

Yup, you are correct. I’ll explain the problems line by line in the original code, and in the modified version @soil posted, how they contribute to the output you’re seeing, and how to fix it.

Everything looks okay here, you’ve opened f for reading, and importantly specified encoding="utf-8", which ensures your input file is read with the correct text encoding.

One note: Instead of using assignment statements and a manual f.close(), you should use a with block to take care of that for you automatically (including if your program crashes). So, for example:

with open("dutch1.html", "r", encoding="utf-8") as f:
    input_contents = f.read()

However, that’s not causing the immediate issue. See the Python tutorial for more information.

Here, as above, you’re doing what your comment describes, which is more or less correct. To note, you need to use encoding="utf-8" here too, or your text will be written in an OS-dependent encoding (which may cause issues trying to read it in other programs or machines). However, since you aren’t getting any output at all, this doesn’t appear to be your immediate issue. Also, like the above, I suggest moving this into a with block below your re.findall(), and writing to it in one go, like this:

with open("dutch1_converted.txt", "w") as f2:
    f2.write(extracted_paragraph_text)

But again, this isn’t critical to your immediate problem, it will just avoid problems in the future.

Here’s where your problems are—this isn’t doing what you probably think it is, for multiple reasons:

As @soil implied, str(f) does not give you the file contents as a string; rather, it gives you information about the file object. To read the file contents from the file, you need to use f.read() (preferably inside a with block, as above).
As @soil also mentioned, what you’re doing above assigns the result of regex to the .write attribute (in this case, a method) of the file. What you want is to call the .write() method with the string you want to write, i.e. f2.write(your_output_string_here).
Furthermore, @soil was also correct in inferring that you need to pass a string to the .write() method, not a list of strings. You need to decide, based on the parameters of your assignment, if and how you want to separate the strings in your output file. The simplest solution is to just seperate them with linebreaks, and is fine if you don’t need to process them further individually (if you do, you’d want to consider a different character that didn’t appear in the data).

To do this, you can call the .join() method of strings with the seperator string (e.g. “\n\n” for two line breaks) as the object you’re calling it on, and the list of strings you want to join into one as the argument (e.g. your_findall_output, assuming you’ve assigned the output of findall to a variable of that name). So, you’d have \n\n.join(your_findall_output), which will return a single string with all of the paragraph blocks separated by two line breaks, which you can then pass to f.write().

Also, your regex will correctly match any <p> elements that have their opening tag, content and closing tag completely on one line, with no line breaks. However, because . does not match \n (line break), it will not match <p> blocks that have any line breaks anywhere in or between the opening and closing tags. So if, in your input file, you have:

<p>
  Some text.
</p>

or

<p>This is some text.
      This is some more text.</p>

...it will not match. You can resolve this by passing [`flags=re.DOTALL`](https://docs.python.org/3/library/re.html#re.DOTALL) to `re.findall()`, which makes `.` (dot) match all characters including `\n`.

(Also, FYI, the the `?` in the regex is unnecessary, as `*` already means zero or more matches, but not incorrect).

Putting all together, as a replacement for this line, you’d have something like:

html_input = f.read()
findall_matches = re.findall("<p>(.*)</p>", html_input, flags=re.DOTALL)
joined_output_string = "\n\n".join(findall_matches)
f2.write(joined_output_string)

or, all in one line (I seperated it out to make the logic easy to read and follow):

f2.write("\n\n".join(re.findall("<p>(.*)</p>", f.read(), flags=re.S)))

This should fix your immediate problem, though I still strongly suggest using with blocks around your input and output as discussed above and using the correct encoding for the output file. If you still run into issues, a simple strategy to debug it is to run your code step by step and inspect the output, or split it up into smaller chunks and use print() calls to print the output of each step, to figure out exactly where you are running into issues.

CAM-Gerlach · March 16, 2022, 8:00pm

There’s no need to use a for loop here, as the multiple writes are more complex and less efficient than a single write, and you’re not reading from the file line by line (nor can you, since <p> tags can span multiple lines), Instead, it is simpler and more efficient to just join the list into a string first with the desired characters, and write that in one go. What might be a little more memory-efficient is using re.finditer in place of re.findall() above, which only yields one result at a time and then writes it, but since you have to read the whole file into memory first anyway (and have both files open at once), and multiple writes means additional overhead and less efficient buffering, plus extra complexity, its unlikely to be worth it.

A few other specific comments on your approach here:

Your original comment didn’t have the original regex (which was likely the main issue leading to @vicky not getting any output), but I see you’ve fixed that. However, it inherits the issues I mention below that may still not yield any output.
This doesn’t add line breaks between paragraph blocks, so the output will likely be mangled. As such, you’d want to add \n\n to the concatted data (or better, just use the join approach I outline).
str() is redundant, as f.read() on a text-mode file will by definition return a string
The loop variable i name can be confusing, since it looks like a counter but is actually a string match. You might want to name it something more descriptive, like paragraph_text.

I’m not sure I understand your thinking here. Like most if not all I/O functions in Python, the file extension doesn’t matter to how the file is actually processed, no more than the rest of the file name; you’re already telling Python how you want the file to be interpreted by the function you’re calling to open and parse it. And int his case, open() simply treats it as an arbitrary text file, i.e. a string of characters in some text encoding, so at least to open()`, the format makes no difference anyway.

vgwosdz · March 16, 2022, 8:20pm

Thank you both for the very elaborate explanations. I will adjust my file later tonight (and post an update). I do understand the way the functions work al lot better now.

Indeed, I did try with BeautifulSoup (to be able to write the rest of the code) and that does work a lot better, but a part of the assignment was to clean up the html file. No worries, I will not copy code line for line, and give credit where credit is due ;-).

soil · March 16, 2022, 8:21pm

Thank you for your explanations, @CAM-Gerlach . And sorry for wrong-pointing, @vgwosdz . [I think I mustn’t reply these kind of things anymore😅.]

vgwosdz · March 16, 2022, 8:34pm

No problem, I learned a lot from your explanations as well. And I am alwyas grateful for feedback (best way to learn).

CAM-Gerlach · March 16, 2022, 8:47pm

I didn’t mean to scare you away, @soil . Your reply actually did identify the biggest issues that were causing @vgwosdz ‘s problem, and probably would have even more or less worked if not for the regex issues. It is often a good idea to make sure you understand what’s going on yourself before giving a definitive answer, but its usually not so bad to suggest a possible fix to an answered question so long as you’ve tested it and are clear that you’re not 100% sure that it is the best solution. Helping answer others’ questions is actually a great way to learn yourself, even if you don’t post your answer at first.

soil · March 16, 2022, 8:50pm

Okay, thanks for both your understanding -and your advice like sentences😁-. I will keep answering ,with more careful sentences and code,-don’t worry please.
Then, have a good night(or day, it changes according to your location :))

vgwosdz · March 17, 2022, 2:18pm

I adjusted my code accordingly. At first, I still had an issue, as the new file would compile every line between the first paragraph tag and the last.
Luckily, I found in the documentation links that this could be solved by putting the ‘?’ back in the code.

I ended up with this (I am stil reading up on the ‘with’ block and I don’t want to use code I don’t understand )

import re
f = open("dutch1.html", "r", encoding='utf8') # open html file
f2 = open("dutch1_converted.txt", "a+",encoding='utf8') # open blank txt file to write cleaned html file to

html_input = f.read()
findall_matches = re.findall("<p>(.*?)</p>", html_input, flags=re.DOTALL)
joined_output_string = "\n\n".join(findall_matches)
f2.write(joined_output_string)

f.close()
f2.close()

The end result is just what I was looking for. Thanks so much!

CAM-Gerlach · March 17, 2022, 4:36pm

Ah, that makes sense—I forgot about that greedy vs. non-greedy behavior when using DOTALL combined with non-multiline mode. You helped me learn/remember a thing!

That’s a very good thing, as that helps you learn—if you just copy and paste code without understanding it, you don’t really learn anything, and once you need to change something or re-use the same ideas elsewhere, you are in trouble.

Put simply (if not entirely completely), a with block allows you to acquire a resource (e.g. open a file) inside the block, and then it is released (e.g. closing a file) automatically when exiting the block no matter what, even if there is an error and your code doesn’t progress normally. This means you don’t have to worry about always cleaning up after yourself with manual f.close() statements everywhere (which are easy to forget, and don’t run if there’s an error, at least without using a try-finally).

Right now, you have this:

f = open("dutch1.html", "r", encoding="utf-8")
html_input = f.read()
f.close()

To make sure the file is closed even if one of the lines raises an error, you could do this (a try-finally block, if you’re not familiar, runs the code in the try block, and then always runs the finally block whether it exited the try block normally or raised an error:

try:
    f = open("dutch1.html", "r", encoding="utf-8")
    html_input = f.read()
finally:
    f.close()

Instead, you can do the same thing by opening the file in a with block and closing it automatically:

with open("dutch1.html", "r", encoding="utf-8") as f:
    html_input = f.read()

As a bonus, this also helps keep the time you have the file open as short as possible (currently, both files are kept open for the whole program above, when they are only needed for one line), which conserves system resources and reduces the chances of certain issues.

As a more advanced aside, with statements use what are called context managers, which do something when you enter them (in open’s case, open a file), and something when you exit them (for open, close the file). You can actually do a lot more with them than just opening files or other resources, but files are by far the most common place you’ll run into them as a beginner.

tjreedy · October 6, 2022, 4:47am

Beginning every line with ‘*’ is a syntax error; if you added the noise, don’t.
This statement just rebinds the write method to the findall iterator; it writes nothing.