Combining HTML file and associated folder into one directory

ericlindelldotnet · December 7, 2022, 7:06pm

POST AT DISCUSS.PYTHON.ORG
I have downloaded many webpages as HTML and associated folder. The folder has the same name as HTML file, except it has an additional “_files” at the end. For example,

sample.html
sample_files

I’d like to move both of these into a folder named sample

I’d like to permanently associate these two objects with each other by placing them in the same folder. A new folder will be created for each matching pair, taking the same name as the html file without the extension.
HTML file and _files directory should be in the same folder. If one is found in a folder but not the other, then no new directory should be created.

I have a bash script to put each html file in its own directory, but not the associated folder, as follows . . .

find . -type f -name '*.rtf' -exec sh -c '
  for f; do mkdir -p -- "${f%.*}" && mv -v -- "$f" "${f%.*}" ; done' _ {} +

I am learning python and find bash more difficult, I prefer a python solution (but either is acceptable).

I can os-walk through the top-level folder containing all the other directories and HTML files. But I get mixed up when it comes to placing these two objects in a parent folder. I seem to be operating on two levels at one time.

This is the code that didn’t work.

#!/usr/bin/env python
import os
import string
import shutil
from os.path import splitext
from pathlib import Path

##=## RUN FROM FOLDER WITH ITEMS TO BE COMPRESSED ##=##
for root, dirs, files in os.walk('/Volumes/HighSierra/Users/ericlindell/Documents/testTTS-combine/'):

    # CHECK ALL FILES IN FOLDER FOR MP3 EXTENSION
    for checkFileHTML in files:
      
        # IF IT IS AN HTML FILE
        if checkFileHTML.endswith('.HTML'):
            HTMLFile = checkFileHTML
            HTMLFileNoExt = checkFileHTML[:-5]

            # ITERATE OVER FOLDERS
            for checkDir in dirs:

                # FIND DIR ENDING _files
                if checkDir.endswith('_files'):
                    checkDirAbbrev = checkDir[:-6]
                    if checkDirAbbrev == HTMLFileNoExt:

                        os.makedirs(checkDirAbbrev)

                        # MOVE ZIP INTO NEWDIR
                        filePath = os.path.join(root, checkHTML)
                        shutil.move(filePath, HTMLFileNoExt)

                        # MOVE HTML FILE INTO NEWDIR
                        filePath = os.path.join(root, checkDirAbbrev)
                        shutil.move(filePath, checkDirAbbrev)

Any help much appreciated !!

MRAB · December 7, 2022, 7:23pm

This example code might help:

from os import scandir

files = set()
folders = set()

for entry in scandir('/Volumes/HighSierra/Users/ericlindell/Documents/testTTS-combine/'):
    if entry.is_dir() and entry.name.endswith('.HTML'):
        # Collect the name of the HTML file without the '.HTML'.
        files.add(entry.name.removesuffix('.HTML'))
    elif entry.is_file() and entry.name.endswith('_files'):
        # Collect the name of the folder without the '_files'.
        folders.add(entry.name.removesuffix('_files'))

# Which names occur in both the set of files and the set of folders?
common = files & folders

for name in common:
    file_name = name + '.HTML'
    folder_name = name + '_files'
    print(f'Found a file called {file_name} and a folder called {folder_name}')

ericlindelldotnet · December 8, 2022, 1:29am

The use of sets here is very helpful. It simplifies things.
I am trying to loop recursively over the top-level folder.
I think os.walk may be what I need to use.
scandir appears to return the directory of the top level.
Thanks for the suggestions !!

MRAB · December 8, 2022, 2:01am

A variation on that, which avoids having to reconstruct the file and folder names, is to use dicts:

from os import scandir

files = {}
folders = {}

for entry in scandir('/Volumes/HighSierra/Users/ericlindell/Documents/testTTS-combine/'):
    if entry.is_dir() and entry.name.endswith('.HTML'):
        # Collect the file.
        files[entry.name.removesuffix('.HTML')] = entry.name
    elif entry.is_file() and entry.name.endswith('_files'):
        # Collect the folder.
        folders[entry.name.removesuffix('_files')] = entry.name

# Which names occur in both the dict of files and the dict of folders?
common = files.keys() & folders.keys()

for name in common:
    print(f'Found a file called {files[name]} and a folder called {folders[name]}')

ericlindelldotnet · May 19, 2023, 9:51pm

This code is very helpful in identifying the html file and its associated folder. The original question asked how to create the new folder and then put them both inside of it.

My issue is one of levels. I don’t know how to operate on the current level and that of the parent folder at the same time.

What would be the final command in this script to accomplish this? Thanks

MRAB · May 20, 2023, 12:10am

Continuing on from my last post, try this (untested):

from os import mkdir, rename
from os.path import basename, join

parent_folder = '/Volumes/HighSierra/Users/ericlindell/Documents/testTTS-combine/'

for name in common:
    # Make the subfolder for the HTML file and associated folder.
    mkdir(join(parent_folder, name))

    # Move the HTML file into the subfolder.
    rename(files[name], join(parent_folder, basename(files[name])))

    # Move the associated folder into the subfolder.
    rename(folders[name], join(parent_folder, basename(folders[name])))

ericlindelldotnet · July 20, 2024, 4:23pm

Thanks again for a great reply, & sorry it’s taken me this long to respond.
I’ve tried this code and found that the comparison between HTML file and associated folder is not finding a match because the HTML is not being stripped, and neither is the _files at end of folder name.

I’ve also placed print statements in the code to try to locate the error. I’m a bit baffled by this code, which appears to be trying to remove html suffix but failing.
files[entry.name.removesuffix(‘.HTML’)] = entry.name

It’s also possible I didn’t combine your two posts correctly into one script. This is what I have

files = {}
folders = {}
for entry in scandir('/Users/ericlindell/testHtmlDirInSameFolder/HnDirs'):
    if entry.is_file() and entry.name.endswith('.html'):
        # Collect the file.
        files[entry.name.removesuffix('.HTML')] = entry.name
        filename = files[entry.name.removesuffix('.HTML')]
        print('filename is ', filename)
    elif entry.is_dir() and entry.name.endswith('_files'):
        # Collect the folder.
        folders[entry.name.removesuffix('_files')] = entry.name

# Which names occur in both the dict of files and the dict of folders?
common = files.keys() & folders.keys()

for name in common:
    print(f'Found a file called {files[name]} and a folder called {folders[name]}')

parent_folder = '/Users/ericlindell/testHtmlDirInSameFolder/'

for name in common: # Make subfolder for HTML file & folder, & move them in.
    mkdir(join(parent_folder, name))
    rename(files[name], join(parent_folder, basename(files[name])))
    rename(folders[name], join(parent_folder, basename(folders[name])))

thanks a million – it really helps !!

MRAB · July 20, 2024, 4:46pm

Your code has entry.name.endswith('.html'). Note the lowercase '.html'. That’s not the same as uppercase '.HTML'.

ericlindelldotnet · July 20, 2024, 6:43pm

Thanks. I had actually fixed that earlier, but I didn’t reflect that change to my post. Sorry if I wasted your time.

This is the feedback I’m getting now with a source folder containing one html file and its associated folder. Both are recognized, and the folder is being created, but they’re not going inside the folder.

$ python3 putHtmlAndDirIntoSameFolder.py
entry is <DirEntry ‘LufthansaHeist_files’>
entry is <DirEntry ‘LufthansaHeist.html’>
entry is <DirEntry ‘LufthansaHeist.html’>
Found a file called LufthansaHeist.html and a folder called LufthansaHeist_files
Traceback (most recent call last):
File “/Users/ericlindell/testHtmlDirInSameFolder/putHtmlAndDirIntoSameFolder.py”, line 32, in
rename(files[name], join(parent_folder, basename(files[name])))
FileNotFoundError: [Errno 2] No such file or directory: ‘LufthansaHeist.html’ → ‘/Users/ericlindell/testHtmlDirInSameFolder/HnDirs/LufthansaHeist.html’

MRAB · July 20, 2024, 8:33pm

Read the final line of the traceback closely.

It’s trying to rename a file when it has only the filename, not the full path, so it’s looking in the current directory, wherever that happens to be. That’s because the dicts files and folders contain filenames (from entry.name), not full paths.

ericlindelldotnet · July 20, 2024, 11:59pm

This suggestion worked perfectly. I tested it with multiple HTMLs and associated folders.
Now, I’d like to make it work for a directory with nested subdirectories.
I thought of calling this script from a bash script that iterates through all subfolders, but I don’t know how to pass the subfolder to this python script.
I also thought of doing os.walk and passing each subdirectory to this script, but I’m not sure that will work. So many different levels at one time.

Thank you again for your kind and capable assistance.

My code now looks like . .

for entry in scandir('/Users/ericlindell/testHtmlDirInSameFolder/HnDirs'):
    if entry.is_file() and entry.name.endswith('.html'):
        files[entry.name.removesuffix('.html')] = entry.name
    elif entry.is_dir() and entry.name.endswith('_files'):
        folders[entry.name.removesuffix('_files')] = entry.name

# Which names occur in both dictionaries?
common = files.keys() & folders.keys()

for name in common:
    print(f'Found a file called {files[name]} and a folder called {folders[name]}')

parent_folder = '/Users/ericlindell/testHtmlDirInSameFolder/HnDirs/'

for name in common:
    # Make subfolder for HTML file & associated folder.
    mkdir(join(parent_folder, name))

    # Move the HTML file into the subfolder.
    rename(
      join(parent_folder, files[name]),
      join(parent_folder, name, files[name]))

    # Move the _files folder into the subfolder.
    rename(
      join(parent_folder, folders[name]),
      join(parent_folder, name, folders[name]))

MRAB · July 21, 2024, 12:39am

You need to think very carefully about what should happen if you iterate through nested folders.

You’re moving an HTML file and the associated folder into a subfolder and then iterating through subfolders, which will include the subfolder that you just created, so it’ll move them into a subfolder in the subfolder, and so on, forever, moving them deeper and deeper…

Maybe it should not move them if the folder contains only an HTML file and associated folder and also has the associated name, e.g. a folder is called “LufthansaHeist” and it contains only a file called “LufthansaHeist.html” and a folder called “LufthansaHeist_files”; just skip that folder, as it’s already how you want it.

ericlindelldotnet · July 21, 2024, 2:27pm

That makes a lot of sense; I hadn’t thought of that. Some possible workarounds . .

Perhaps editing the bash script above. It puts HTML but not folder in its own directory.

Or put all HTML files/directories in one top-level folder, then run my currently working script. Then redistribute them to original folders. Funny how complex these things can get.

Thank you again.

Here’s my current effort nesting scandir inside os.walk, topdown=False. This fails on line
“for entry in scandir(dirname):”
I don’t know how to tell scandir which folder to scan.

path = '/Users/ericlindell/htmlDirJoin/HnDirs/'
path = os.path.normpath(path)
# res = []
for root, dirs, files in os.walk("/Users/ericlindell/htmlDirJoin/HnDirs", topdown=False):
    for dirname in dirs:
      files.clear()
      folders.clear()
      for entry in scandir(dirname):
        if entry.is_file() and entry.name.endswith('.html'):
          files[entry.name.removesuffix('.html')] = entry.name
        elif entry.is_dir() and entry.name.endswith('_files'):
          folders[entry.name.removesuffix('_files')] = entry.name

      # Names common to both dicts (files & folders)
      common = files.keys() & folders.keys()
      for name in common:
        print(f'Found a file called {files[name]} and a folder called {folders[name]}')

      parent_folder = '/Users/ericlindell/htmlDirJoin/HnDirs/'

      for name in common:
        # Make subfolder for HTML file & its directory.
        mkdir(join(parent_folder, name))

        # Move the HTML file into the subfolder.
        rename(
          join(parent_folder, files[name]),
          join(parent_folder, name, files[name]))

        # Move the _files folder into the subfolder.
        rename(
          join(parent_folder, folders[name]),
          join(parent_folder, name, folders[name]))

MRAB · July 21, 2024, 3:10pm

You’re passing scandir only the name of the folder, but where is that folder? You need to give it the full path.

ericlindelldotnet · July 21, 2024, 4:45pm

Matthew, big surprise.
I ran this concept through chatGPT, explaining in regular English, and it provided me with code that worked perfectly, 1st time. This one traverses the entire folder structure bottom up!!

Cheers! & thanks again for your helpful suggestions.

import os
import shutil

def combine_html_with_files(directory):
    for root, dirs, files in os.walk(directory, topdown=False):
        for filename in files:
            if filename.endswith('.html'):
                html_file = os.path.join(root, filename)
                folder_name = os.path.splitext(filename)[0] + "_files"
                folder_path = os.path.join(root, folder_name)
                if os.path.exists(folder_path) and os.path.isdir(folder_path):
                    # Create a new folder for each HTML file and its associated folder
                    new_folder_name = os.path.splitext(filename)[0]
                    new_folder_path = os.path.join(root, new_folder_name)
                    os.makedirs(new_folder_path, exist_ok=True)
                    
                    # Move the HTML file into the new folder
                    shutil.move(html_file, os.path.join(new_folder_path, filename))
                    
                    # Move the "_files" folder into the new folder
                    shutil.move(folder_path, os.path.join(new_folder_path, folder_name))
                    
                    print(f"Created folder: {new_folder_name}")
                    
                else:
                    print(f"Warning: Folder {folder_name} not found for {filename}")

if __name__ == "__main__":
    directory_to_process = "/Users/ericlindell/htmlDirJoin/HnDirs"
    combine_html_with_files(directory_to_process)