Python Learning Tips - Laser Focused on CSV Data Processing

Dear Community,

I’m posting here hoping to get the following from generous community members:

  • A list of concepts to learn (e.g., how to extract subset of data from filenames), and
  • The best online resources to learn those concepts.

However, at this point, I’d like to focus my learning strictly to what I will need to accomplish my immediate goal.

So I’m going to post my situation, a table that shows what the desired output should be, what I need in to be able to do in order to get unstuck, and my background. Hopefully with all that you provide the best advice that you can.

Note: I tried to use ChatGPT, Bing Chat, and Bard to help me write code snippets. I was initially successful in using those tools to help me write a basic function, but now that I want to do something a bit more complex, the generative AI tools are not providing useful code.

SITUATION:
This effort is part of a data migration / ETL project. My script has to read a CSV file, analyze data, and output another CSV file with the results of the analysis and the data categorized. Here is a table that shows:
a) The list of file names within the supplied CSV file.
b) The portion of the file name that is common to a given subset of files, labeled DB #. I would like to automate the extraction of this portion of the file name.
c) The results of the categorization that I want to automate. Currently, I’m only able to easily label as “audit trail” files that end with “audit.pdf”, as well as folder items.

Filename DB # File Category
DB-APP-1001201_audit.pdf DB-APP-1001201 audit trail
DB-APP-1001201_v1.0(1).pdf DB-APP-1001201 previous original
DB-APP-1001201_v1.0.pdf DB-APP-1001201 previous likeness
DB-APP-1001201_v2.0.msg DB-APP-1001201 previous original
DB-APP-1001201_v2.0.pdf DB-APP-1001201 previous likeness
DB-APP-1001201_v3.0.msg DB-APP-1001201 original
DB-APP-1001201_v3.0.pdf DB-APP-1001201 likeness
DB-APP-1001240_audit.pdf DB-APP-1001240 audit trail
DB-APP-1001240_v1.0(1).pdf DB-APP-1001240 previous original
DB-APP-1001240_v1.0.pdf DB-APP-1001240 previous likeness
DB-APP-1001240_v2.0.msg DB-APP-1001240 previous original
DB-APP-1001240_v2.0.pdf DB-APP-1001240 previous likeness
DB-APP-1001240_v3.0.msg DB-APP-1001240 original
DB-APP-1001240_v3.0.pdf DB-APP-1001240 likeness
DB-APP-1001286_audit.pdf DB-APP-1001286 audit trail
DB-APP-1001286_v1.0(1).pdf DB-APP-1001286 original
DB-APP-1001286_v1.0.pdf DB-APP-1001286 likeness

WHAT MY CODE NEEDS TO DO:
My initial need is to be able to also categorize these files as either original, likeness, previous original, previous likeness, etc. They cannot be categorized by just looking at the information in one file name. I need to consider the data that is part of all the files that belong to the same DB #.
The category depends on all the information that is stored after the DB # portion of the name (e.g., v1.0.pdf, v2.0.msg, v1.0(1).pdf, etc.). I need to be able compare all the files that belong to the same DB #, and then use that info to categorize. In this specific case, some rules are:

  • The lowest v numbers are previous files - if there’s a v1, v2, v3, then v1 and v2 are previous files, while the highest number is considered the latest or current file.
  • If there is a file that contains a number in parenthesis, it can be labeled an “original”.

I don't know how to parse out the different parts of the file name, nor iterate through just a subset, nor compare each subitem in that subset one to another.

Current code:
This is the main portion of the code I already have that does most of the work. The script does other stuff that’s not shown, like read some parameters from a config file, make a copy of the input CSV file, etc.

# Function to update the CSV file, save it, and close it
def update_csv_file(csv_file, new_data):
    with open(csv_file, 'w', newline='') as file:
        csv_writer = csv.writer(file)

        # Add the new column header "File Category"
        new_header = new_data[0]
        new_header.insert(new_header.index("Path") + 1, "File Category")
        csv_writer.writerow(new_header)

        for row in new_data[1:]:
            
            item_type = row[new_header.index("Item Type")]
            if item_type == "Folder":
                file_category = "Uncategorized Folder"
            else:
                file_name = row[new_header.index("Name")]
                if file_name.endswith("audit.pdf"):
                    file_category = "audit trail"
                else:
                    file_category = "Other File"
            row.insert(new_header.index("Path") + 1, file_category)
            csv_writer.writerow(row)

My Background:

  • I’ve just started to become familiar with Python. I’m using version 3.11.6 with VSCode on Windows 10. Hope is to be able to get better and better soon.
  • I’ve written code in the past - as needed to accomplish project goals - but it’s been many years since coding was a frequent part of my life. About 15+ years ago I used to write code 10% of the time and about 20 years ago I used to write code 50% of the time.
  • Some previous languages, that I’ve mostly forgotten how to use, are Java, VB, Pascal, and FORTRAN.

Thank you all in advance!!

JPG.

You can split a filename with the .split method filename.split("_"). That’ll give you the DB# and the remainder.

A useful string method is .endswith, and you can split just once on at the end with filename.rsplit(".", 1).

I’d also suggest judicious use of dict and maybe defaultdict to store and categorise the filenames. You can use one to group all the filenames by DB #, for example, and then process all those having the same DB #.

To parse those filenames you’d usually first separate the name from the extension, using either os.path.splitext or pathlib.Path:

>>> from pathlib import Path
>>> p = Path('DB-APP-1001201_v1.0(1).pdf')
>>> p.name
'DB-APP-1001201_v1.0(1).pdf')
>>> p.suffix
'.pdf'
>>> p.stem
'DB-APP-1001201_v1.0(1)'

Then, if the name is pretty standardized and simple, you could either parse is directly using for instance .split on strings, and iterating over the pieces:

>> p.stem.split('-')
['DB', 'APP', '1001201_v1.0(1)']

Or it might be easier to use a regex:

>>> import re
>>> pat = re.compile(r'^DB\-APP\-(?P<appid>\d+)_(?P<version>v\d+\.\d+)(?P<var>\(\d+\))?$')
>>> m = pat.match(p.stem)
>>> m['appid']
'1001201'
>>> m['version']
'v1.0'

Thank you Matt for your kind reply. Besides using Google searches, do you have any specific online resources where I could learn more about dict and defaultdict?

Hans, thank you so much for your kind and prompt reply. What is your top online resource (besides Google) that you would recommend for me to learn more about pathlib, re, the compile method, etc.?

The most reliable – and generally also pretty readable :slight_smile: – resource are the Python module docs at python.org.