You might be interested in this Stack Overflow answer about how to parse [X]HTML with regex.
In general, when you’re doing this in the real world rather than a school assignment like this one, for non-trivial HTML parsing of documents you do not control, it is best to just use an actual HTML parser. Python has a simple one built in, html.parser, and there are many very popular third-party ones, such as BeautifulSoup.
However, for a very simple and controlled use case, regex may be sufficient. Given this appears to be a school assignment, and you are required to do so, I’m going to assume this is the case.
It would be very helpful to have an example of the file you are trying to parse. However, I’ll do my best to guess here.
Yup, you are correct. I’ll explain the problems line by line in the original code, and in the modified version @soil posted, how they contribute to the output you’re seeing, and how to fix it.
Everything looks okay here, you’ve opened
f for reading, and importantly specified
encoding="utf-8", which ensures your input file is read with the correct text encoding.
One note: Instead of using assignment statements and a manual
f.close(), you should use a
with block to take care of that for you automatically (including if your program crashes). So, for example:
with open("dutch1.html", "r", encoding="utf-8") as f:
input_contents = f.read()
However, that’s not causing the immediate issue. See the Python tutorial for more information.
Here, as above, you’re doing what your comment describes, which is more or less correct. To note, you need to use
encoding="utf-8" here too, or your text will be written in an OS-dependent encoding (which may cause issues trying to read it in other programs or machines). However, since you aren’t getting any output at all, this doesn’t appear to be your immediate issue. Also, like the above, I suggest moving this into a
with block below your
re.findall(), and writing to it in one go, like this:
with open("dutch1_converted.txt", "w") as f2:
But again, this isn’t critical to your immediate problem, it will just avoid problems in the future.
Here’s where your problems are—this isn’t doing what you probably think it is, for multiple reasons:
As @soil implied,
str(f) does not give you the file contents as a string; rather, it gives you information about the file object. To read the file contents from the file, you need to use
f.read() (preferably inside a
with block, as above).
As @soil also mentioned, what you’re doing above assigns the result of regex to the
.write attribute (in this case, a method) of the file. What you want is to call the
.write() method with the string you want to write, i.e.
Furthermore, @soil was also correct in inferring that you need to pass a string to the
.write() method, not a list of strings. You need to decide, based on the parameters of your assignment, if and how you want to separate the strings in your output file. The simplest solution is to just seperate them with linebreaks, and is fine if you don’t need to process them further individually (if you do, you’d want to consider a different character that didn’t appear in the data).
To do this, you can call the
.join() method of strings with the seperator string (e.g. “\n\n” for two line breaks) as the object you’re calling it on, and the list of strings you want to join into one as the argument (e.g.
your_findall_output, assuming you’ve assigned the output of
findall to a variable of that name). So, you’d have
\n\n.join(your_findall_output), which will return a single string with all of the paragraph blocks separated by two line breaks, which you can then pass to
Also, your regex will correctly match any
<p> elements that have their opening tag, content and closing tag completely on one line, with no line breaks. However, because
. does not match
\n (line break), it will not match
<p> blocks that have any line breaks anywhere in or between the opening and closing tags. So if, in your input file, you have:
<p>This is some text.
This is some more text.</p>
...it will not match. You can resolve this by passing [`flags=re.DOTALL`](https://docs.python.org/3/library/re.html#re.DOTALL) to `re.findall()`, which makes `.` (dot) match all characters including `\n`.
(Also, FYI, the the `?` in the regex is unnecessary, as `*` already means zero or more matches, but not incorrect).
Putting all together, as a replacement for this line, you’d have something like:
html_input = f.read()
findall_matches = re.findall("<p>(.*)</p>", html_input, flags=re.DOTALL)
joined_output_string = "\n\n".join(findall_matches)
or, all in one line (I seperated it out to make the logic easy to read and follow):
f2.write("\n\n".join(re.findall("<p>(.*)</p>", f.read(), flags=re.S)))
This should fix your immediate problem, though I still strongly suggest using
with blocks around your input and output as discussed above and using the correct encoding for the output file. If you still run into issues, a simple strategy to debug it is to run your code step by step and inspect the output, or split it up into smaller chunks and use
print() calls to print the output of each step, to figure out exactly where you are running into issues.