Renaming and saving text files based on its content

SamCem · February 22, 2024, 7:37pm

Hi,
I’ve got hundreds of .html-Files of a Forum and I want to rename and save them the following way. Each File consists, among many other lines, of the following line:

<td align="left" valign="middle" class="nav" width="100%"><span class="nav"><a href="index.php?sid=a6ddafec8f3ed8a0a7cf4f8bf8273cff" class="nav">ONE</a><a href="./index.php" class="nav">TWO</a>&nbsp;&raquo;&nbsp;<a href="./viewforum.php" class="nav">THREE</a><a href="./viewtopic.php" class="nav">FOUR</a></span></td>

My goals: (THREE, FOUR are just a placeholders here and different in each file)

Rename each file to “FOUR.html”
save the file in directory: “Forum/THREE”

How is this possible in a smart way using Python on Windows 10?
Thank you very much!

elis.byberi · February 22, 2024, 7:46pm

The questions are:

How to rename a file in Python?
How to move a file in Python?

You would get plenty of search results if you try these queries in any search engine.

Rosuav · February 22, 2024, 7:54pm

It sounds like what you’re trying to do is parse HTML files to look for key pieces of information? If that’s the case, I recommend BeautifulSoup - it’s the easiest way to navigate a puddle of tags and find something useful in them.

elis.byberi · February 22, 2024, 8:56pm

I missed the HTML part of the question, as Rosuav pointed out.

You can even use regex to achieve what you want without parsing the HTML file.

Here is an example:

Code

import re

html = '''
<td align="left" valign="middle" class="nav" width="100%">
    <span class="nav">
        <a href="index.php?sid=a6ddafec8f3ed8a0a7cf4f8bf8273cff" class="nav">ONE</a>
        <a href="./index.php" class="nav">TWO</a>&nbsp;&raquo;&nbsp;
        <a href="./viewforum.php" class="nav">THREE</a>
        <a href="./viewtopic.php" class="nav">FOUR</a>
    </span>
</td>
'''

# Use regex to find the specified <a> tag and extract text until <
match = re.search(r'<a\s+href="./viewtopic.php"\s+class="nav">([^<]*)</', html)

# Print the extracted text
if match:
    print(match.group(1))
else:
    print("Pattern not found.")

Rosuav · February 22, 2024, 9:44pm

I advise against that as a matter of course. BS4 isn’t hard to use, just do the job properly.

bschubert · February 23, 2024, 1:26am

Obligatory:

cameron · February 23, 2024, 2:50am

ObXKCD: 208: Regular Expressions - explain xkcd

But I agree, regexps are not for HTML (though I myself have gone down that path). But it can be a quick’n’dirty way to scan a known page for expected content. Still, BS4 (beautifulsoup4 · PyPI) is eady to use and a FAR FAR better tool.