You might be interested in this Stack Overflow answer about how to parse [X]HTML with regex.
In general, when youâre doing this in the real world rather than a school assignment like this one, for non-trivial HTML parsing of documents you do not control, it is best to just use an actual HTML parser. Python has a simple one built in, html.parser, and there are many very popular third-party ones, such as BeautifulSoup.
However, for a very simple and controlled use case, regex may be sufficient. Given this appears to be a school assignment, and you are required to do so, Iâm going to assume this is the case.
It would be very helpful to have an example of the file you are trying to parse. However, Iâll do my best to guess here.
Yup, you are correct. Iâll explain the problems line by line in the original code, and in the modified version @soil posted, how they contribute to the output youâre seeing, and how to fix it.
Everything looks okay here, youâve opened f
for reading, and importantly specified encoding="utf-8"
, which ensures your input file is read with the correct text encoding.
One note: Instead of using assignment statements and a manual f.close()
, you should use a with
block to take care of that for you automatically (including if your program crashes). So, for example:
with open("dutch1.html", "r", encoding="utf-8") as f:
input_contents = f.read()
However, thatâs not causing the immediate issue. See the Python tutorial for more information.
Here, as above, youâre doing what your comment describes, which is more or less correct. To note, you need to use encoding="utf-8"
here too, or your text will be written in an OS-dependent encoding (which may cause issues trying to read it in other programs or machines). However, since you arenât getting any output at all, this doesnât appear to be your immediate issue. Also, like the above, I suggest moving this into a with
block below your re.findall()
, and writing to it in one go, like this:
with open("dutch1_converted.txt", "w") as f2:
f2.write(extracted_paragraph_text)
But again, this isnât critical to your immediate problem, it will just avoid problems in the future.
Hereâs where your problems areâthis isnât doing what you probably think it is, for multiple reasons:
-
As @soil implied, str(f)
does not give you the file contents as a string; rather, it gives you information about the file object. To read the file contents from the file, you need to use f.read()
(preferably inside a with
block, as above).
-
As @soil also mentioned, what youâre doing above assigns the result of regex to the .write
attribute (in this case, a method) of the file. What you want is to call the .write()
method with the string you want to write, i.e. f2.write(your_output_string_here)
.
-
Furthermore, @soil was also correct in inferring that you need to pass a string to the .write()
method, not a list of strings. You need to decide, based on the parameters of your assignment, if and how you want to separate the strings in your output file. The simplest solution is to just seperate them with linebreaks, and is fine if you donât need to process them further individually (if you do, youâd want to consider a different character that didnât appear in the data).
To do this, you can call the .join()
method of strings with the seperator string (e.g. â\n\nâ for two line breaks) as the object youâre calling it on, and the list of strings you want to join into one as the argument (e.g. your_findall_output
, assuming youâve assigned the output of findall
to a variable of that name). So, youâd have \n\n.join(your_findall_output)
, which will return a single string with all of the paragraph blocks separated by two line breaks, which you can then pass to f.write()
.
-
Also, your regex will correctly match any <p>
elements that have their opening tag, content and closing tag completely on one line, with no line breaks. However, because .
does not match \n
(line break), it will not match <p>
blocks that have any line breaks anywhere in or between the opening and closing tags. So if, in your input file, you have:
<p>
Some text.
</p>
or
<p>This is some text.
This is some more text.</p>
...it will not match. You can resolve this by passing [`flags=re.DOTALL`](https://docs.python.org/3/library/re.html#re.DOTALL) to `re.findall()`, which makes `.` (dot) match all characters including `\n`.
(Also, FYI, the the `?` in the regex is unnecessary, as `*` already means zero or more matches, but not incorrect).
Putting all together, as a replacement for this line, youâd have something like:
html_input = f.read()
findall_matches = re.findall("<p>(.*)</p>", html_input, flags=re.DOTALL)
joined_output_string = "\n\n".join(findall_matches)
f2.write(joined_output_string)
or, all in one line (I seperated it out to make the logic easy to read and follow):
f2.write("\n\n".join(re.findall("<p>(.*)</p>", f.read(), flags=re.S)))
This should fix your immediate problem, though I still strongly suggest using with
blocks around your input and output as discussed above and using the correct encoding for the output file. If you still run into issues, a simple strategy to debug it is to run your code step by step and inspect the output, or split it up into smaller chunks and use print()
calls to print the output of each step, to figure out exactly where you are running into issues.