Renaming and saving text files based on its content

Hi,
I’ve got hundreds of .html-Files of a Forum and I want to rename and save them the following way. Each File consists, among many other lines, of the following line:

<td align="left" valign="middle" class="nav" width="100%"><span class="nav"><a href="index.php?sid=a6ddafec8f3ed8a0a7cf4f8bf8273cff" class="nav">ONE</a><a href="./index.php" class="nav">TWO</a>&nbsp;&raquo;&nbsp;<a href="./viewforum.php" class="nav">THREE</a><a href="./viewtopic.php" class="nav">FOUR</a></span></td>

My goals: (THREE, FOUR are just a placeholders here and different in each file)

  1. Rename each file to “FOUR.html”
  2. save the file in directory: “Forum/THREE”

How is this possible in a smart way using Python on Windows 10?
Thank you very much!

The questions are:

  1. How to rename a file in Python?
  2. How to move a file in Python?

You would get plenty of search results if you try these queries in any search engine.

It sounds like what you’re trying to do is parse HTML files to look for key pieces of information? If that’s the case, I recommend BeautifulSoup - it’s the easiest way to navigate a puddle of tags and find something useful in them.

2 Likes

I missed the HTML part of the question, as Rosuav pointed out.

You can even use regex to achieve what you want without parsing the HTML file.

Here is an example:

Code
import re

html = '''
<td align="left" valign="middle" class="nav" width="100%">
    <span class="nav">
        <a href="index.php?sid=a6ddafec8f3ed8a0a7cf4f8bf8273cff" class="nav">ONE</a>
        <a href="./index.php" class="nav">TWO</a>&nbsp;&raquo;&nbsp;
        <a href="./viewforum.php" class="nav">THREE</a>
        <a href="./viewtopic.php" class="nav">FOUR</a>
    </span>
</td>
'''

# Use regex to find the specified <a> tag and extract text until <
match = re.search(r'<a\s+href="./viewtopic.php"\s+class="nav">([^<]*)</', html)

# Print the extracted text
if match:
    print(match.group(1))
else:
    print("Pattern not found.")

I advise against that as a matter of course. BS4 isn’t hard to use, just do the job properly.

2 Likes

Obligatory:

1 Like

ObXKCD: 208: Regular Expressions - explain xkcd

But I agree, regexps are not for HTML (though I myself have gone down that path). But it can be a quick’n’dirty way to scan a known page for expected content. Still, BS4 (beautifulsoup4 · PyPI) is eady to use and a FAR FAR better tool.