Write data to same filename with different extension

vgwosdz · March 25, 2022, 9:24pm

Hi, I am trying to write a small program that will consecutively open all the files with a .html extension in a location, do some work on them and then save them using the same name but with .txt extension.

This is what I have. The code doesn’t give any errors but the new files are not being created, and I have the impression that nothing happens.

The parser function in itself works (I have tried it on a single separate file and it works fine). The path also seems to be working as a single file (filename.txt) is created, but it is empty.
Can anyone point me to what I am missing?

import re  # de module regular expressions wordt geïmporteerd.
from pathlib import Path
import os

source_dir = Path('/VGw_LanguageDetector')
files = source_dir.glob('*.html')

def parser():
    for file in files:
        f = open('file', 'a+', encoding='utf8')
        f2 = open('filename.txt', 'a+', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

        html_input = f.read()  # creatie van een string door de html file in te lezen in het programma

        findall_matches = re.findall("<p>(.*?)</p>", html_input, flags=re.DOTALL)  # selecteert enkel de leestekst
        joined_output_string = "\n".join(findall_matches)  # plakt alle gevonden tekst samen
        text = re.sub('<.*?>|" +"', "", joined_output_string)  # vervangt html info en blanks door een spatie
        f2.write(text)  # wegschrijven van de txt-file
        f2.close()

parser()

Thanks!

soil · March 25, 2022, 9:47pm

Hello, @vgwosdz . I think, this is not what you have exactly wanted:

f = open('file', 'a+', encoding='utf8')

This line creates a new file named “file”, because this is a string - not a variable name. But, as I understood, you want to open an existing file.
You can change that line as below:

f = open(file, 'a+', encoding='utf8')

And, you can also change this line:

f2 = open('filename.txt', 'a+', encoding='utf8')

Like this:

f2 = open(file[:-5]+".txt", 'a+', encoding='utf8')

Here, file[:-5] will return you the name of the file without its extension and create a file with the same name, but with a different extension. (“5” is the length of the string “.html”)

cameron · March 25, 2022, 10:01pm

By Vicky via Discussions on Python.org at 25Mar2022 21:34:

Hi, I am trying to write a small program that will consecutively open
all the files with a .html extension in a location, do some work on
them and then save them using the same name but with .txt extension.

This is what I have. The code doesn’t give any errors but the new files are not being created, and I have the impression that nothing happens.

Not quite. I recommend you put in some print calls. If nothing else, a
print(file) at the top of the loop will tell you which files are being
processed (or at least, are supposed to be being processed, as we will
see below).

The parser function in itself works (I have tried it on a single separate file and it works fine).

Personally, I would make parser() work on a single file, and call it
inside a loop. It simplifies the function a lot.

However, to your actual code:

import re  # de module regular expressions wordt geïmporteerd.
from pathlib import Path
import os

source_dir = Path('/VGw_LanguageDetector')
files = source_dir.glob('*.html')

def parser():
    for file in files:
        f = open('file', 'a+', encoding='utf8')
        f2 = open('filename.txt', 'a+', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

The big problem here is that you’re opening fixed files names here, not
names from the loop. This:

f = open('file', 'a+', encoding='utf8')

opens a file called 'file', and not the file named by your variable
file. The same for this:

f2 = open('filename.txt', 'a+', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

which opens the filename 'filename.txt', not a filename derives from
your variable file.

Something like this:

f = open(file, 'a+', encoding='utf8')
f2 = open(file + '.txt', 'a+', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

I would be using more clear variable names, such as “html_filename” and
“text_filename”, for example:

for html_filename in files:
    text_filename = html_filename + '.txt'
    f = open(html_filename, 'a+', encoding='utf8')
    f2 = open(text_filename, 'a+', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

That is your primary problem. However, there are a number of other
things to remark upon:

f = open(html_filename, 'a+', encoding='utf8')

This opens the file for append, in read/write mode. You just want to
read it:

f = open(html_filename, 'r', encoding='utf8')

This:

f2 = open(text_filename, 'a+', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

also opens for append, in read/write mode. Whereas I presume you just
want to write it:

f2 = open(text_filename, 'w', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

But there’s an even better way: exclusive write. This does not prevent
other writers, it just fails if the file already exists, avoiding
accidents:

f2 = open(text_filename, 'x', encoding='utf8')  # creëert een nieuwe file om de output naar weg te schrijven

I notice you do not close f. It will get closed when you open the next
file, as a side effect of f no longer referring to the previous open
file, and therefore that file’s reference count becomes 0, and therefore
that file gets closed and freed from memory. But it is better to close
things reliables and promptly. The standard idom for this is like this:

with open(html_filename, 'r', encoding='utf8') as f:
    html_input = f.read()  # creatie van een string door de html file in te lezen in het programma

This opens the file for read, reads the text into html_input, and then
closes it as soon as you exit the with-statement.

findall_matches = re.findall("<p>(.*?)</p>", html_input, flags=re.DOTALL)  # selecteert enkel de leestekst

This is neat and simple. If you find yourself doing more complex HTML
parsing I recommend looking at the BeautifulSoup package, named
beautifulsoup4: beautifulsoup4 · PyPI

Also, as general practice with regular expression strings, I recommend
using “raw strings”:

findall_matches = re.findall(r"<p>(.*?)</p>", html_input, flags=re.DOTALL)  # selecteert enkel de leestekst

That opening r" makes the expression a “raw string”, where backslashes
(\) are not special. Since regular expressions are usually littered
with backslashes, such as \d to match a digit, this avoids a lot of
painful doubling of backslashes.

joined_output_string = "\n".join(findall_matches)  # plakt alle gevonden tekst samen
text = re.sub('<.*?>|" +"', "", joined_output_string)  # vervangt html info en blanks door een spatie

I would be deferring the open of f2 until here, and again immediate
close:

with open(text_filename, 'x', encoding='utf8') as f2:  # creëert een nieuwe file om de output naar weg te schrijven
    f2.write(text)  # wegschrijven van de txt-file

I hope this helps,
Cameron Simpson cs@cskk.id.au

CAM-Gerlach · March 25, 2022, 10:20pm

Thanks for providing a clear and complete description of your problem, the expected and actual behavior you’re getting, and included your nicely-formatted code in a code block. This really helps us a lot in giving you a better answer to your question.

Also, seeing as you speak Dutch, you have an advantage on all of us in understanding Python (and if you don’t believe me, import this)

As I was typing this answer, @soil posted one that correctly identified the main issue, that you’re opening a file named file in the current directory and reading from that, rather than from the file you intend (named by the variable file), and [EDIT: I see @soil updated their comment to add this] appending each HTML file’s contents to a file named filename.txt in the current directory. In addition, @cameron 's reply, also posted while I was writing this, already addressed some of the other points I was making, regarding using the correct file modes for each operation (which would have allowed you to catch both errors much more easily), ensuring your files are closed and deferring open until needed (using with blocks), and considering an actual HTML parser like BS4.

However, both of their solutions to renaming the file in the write case don’t actually work, since file (or html_filename, in @cameron 's example) is not a string, but rather a pathlib.Path object (and if it was, @soil 's solution is rather hacky, difficult to read and potentially fragile, while @cameron 's would not first remove the .html extension, as specified in your problem statement and expected output).

This is a good thing (and you’re smart to use pathlib), since this actually allows for a simpler, easier and more robust solution. You can just use file.with_suffix('.txt') / html_filepath.with_suffix(".txt") to get the same path with a .txt extension instead of .html. i.e.

    text_filename = html_filename.with_suffix('.txt')

Also, since you’re using pathlib, you can actually use a simpler approach to open, read/write and close the file all in one line, without having to use a with block:

for html_filename in files:
    html_input = html_filename.read_text(encoding="utf-8")
    # Process HTML input into `text_output`
    html_filename.with_suffix(".txt").write_text(text_output, encoding="utf-8")

vgwosdz · March 27, 2022, 6:43pm

Thanks all! It’s probably not perfect yet, but I think I understand the “with” block better now. And this part of the code is working like a charm. This what I ended up with:

"""
Dit programma leest een html file in, verwijdert al de niet-tekst elementen en schrijft
het resultaat weg als een txt file.
"""

import re  # de module regular expressions wordt geïmporteerd.
from pathlib import Path
import this

source_dir = Path('C:/Users/gwovi/PycharmProjects/VickyGwosdz_LanguageDetector')
files = source_dir.glob('*.html')

for file in files:
    with open(file.with_suffix('.html'), 'r' ,encoding='utf-8') as f:html_input=f.read()

    findall_matches = re.findall("<p>(.*?)</p>", html_input, flags=re.DOTALL)  # selecteert enkel de leestekst
    joined_output_string = "\n".join(findall_matches)  # plakt alle gevonden tekst samen
    text = re.sub('<.*?>|" +"', "", joined_output_string)  # vervangt html info en blanks door een spatie

    with open(file.with_suffix('.txt'),'x', encoding='utf-8') as f2:
        f2.write(text)

CAM-Gerlach · March 27, 2022, 9:31pm

Great! Looks good overall. As mentioned above, now that you’ve learned a basic understanding of with blocks, and given your HTML files already have a .html suffix (as your glob ensures), you can replace

with just

    html_input = file.read_text(encoding='utf-8')

which automatically opens the file, reads the text and closes it.

Likewise, you can replace

with just

    file.with_suffix('.txt').write_text(text_output, encoding='utf-8')

Also,

You might not mean to include this in your final code

vgwosdz · March 28, 2022, 3:52am

I will leave it out, but for now it makes me smile every time I run a part of the code