By Vicky via Discussions on Python.org at 25Mar2022 21:34:
Hi, I am trying to write a small program that will consecutively open
all the files with a .html extension in a location, do some work on
them and then save them using the same name but with .txt extension.
This is what I have. The code doesn’t give any errors but the new files are not being created, and I have the impression that nothing happens.
Not quite. I recommend you put in some print calls. If nothing else, a
print(file)
at the top of the loop will tell you which files are being
processed (or at least, are supposed to be being processed, as we will
see below).
The parser function in itself works (I have tried it on a single separate file and it works fine).
Personally, I would make parser()
work on a single file, and call it
inside a loop. It simplifies the function a lot.
However, to your actual code:
import re # de module regular expressions wordt geïmporteerd.
from pathlib import Path
import os
source_dir = Path('/VGw_LanguageDetector')
files = source_dir.glob('*.html')
def parser():
for file in files:
f = open('file', 'a+', encoding='utf8')
f2 = open('filename.txt', 'a+', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
The big problem here is that you’re opening fixed files names here, not
names from the loop. This:
f = open('file', 'a+', encoding='utf8')
opens a file called 'file'
, and not the file named by your variable
file
. The same for this:
f2 = open('filename.txt', 'a+', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
which opens the filename 'filename.txt'
, not a filename derives from
your variable file
.
Something like this:
f = open(file, 'a+', encoding='utf8')
f2 = open(file + '.txt', 'a+', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
I would be using more clear variable names, such as “html_filename” and
“text_filename”, for example:
for html_filename in files:
text_filename = html_filename + '.txt'
f = open(html_filename, 'a+', encoding='utf8')
f2 = open(text_filename, 'a+', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
That is your primary problem. However, there are a number of other
things to remark upon:
f = open(html_filename, 'a+', encoding='utf8')
This opens the file for append, in read/write mode. You just want to
read it:
f = open(html_filename, 'r', encoding='utf8')
This:
f2 = open(text_filename, 'a+', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
also opens for append, in read/write mode. Whereas I presume you just
want to write it:
f2 = open(text_filename, 'w', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
But there’s an even better way: exclusive write. This does not prevent
other writers, it just fails if the file already exists, avoiding
accidents:
f2 = open(text_filename, 'x', encoding='utf8') # creëert een nieuwe file om de output naar weg te schrijven
I notice you do not close f
. It will get closed when you open the next
file, as a side effect of f
no longer referring to the previous open
file, and therefore that file’s reference count becomes 0, and therefore
that file gets closed and freed from memory. But it is better to close
things reliables and promptly. The standard idom for this is like this:
with open(html_filename, 'r', encoding='utf8') as f:
html_input = f.read() # creatie van een string door de html file in te lezen in het programma
This opens the file for read, reads the text into html_input
, and then
closes it as soon as you exit the with-statement.
findall_matches = re.findall("<p>(.*?)</p>", html_input, flags=re.DOTALL) # selecteert enkel de leestekst
This is neat and simple. If you find yourself doing more complex HTML
parsing I recommend looking at the BeautifulSoup package, named
beautifulsoup4
: beautifulsoup4 · PyPI
Also, as general practice with regular expression strings, I recommend
using “raw strings”:
findall_matches = re.findall(r"<p>(.*?)</p>", html_input, flags=re.DOTALL) # selecteert enkel de leestekst
That opening r"
makes the expression a “raw string”, where backslashes
(\
) are not special. Since regular expressions are usually littered
with backslashes, such as \d
to match a digit, this avoids a lot of
painful doubling of backslashes.
joined_output_string = "\n".join(findall_matches) # plakt alle gevonden tekst samen
text = re.sub('<.*?>|" +"', "", joined_output_string) # vervangt html info en blanks door een spatie
I would be deferring the open of f2
until here, and again immediate
close:
with open(text_filename, 'x', encoding='utf8') as f2: # creëert een nieuwe file om de output naar weg te schrijven
f2.write(text) # wegschrijven van de txt-file
I hope this helps,
Cameron Simpson cs@cskk.id.au