Python Learning Tips - Laser Focused on CSV Data Processing

JPG · October 21, 2023, 10:30pm

Dear Community,

I’m posting here hoping to get the following from generous community members:

A list of concepts to learn (e.g., how to extract subset of data from filenames), and
The best online resources to learn those concepts.

However, at this point, I’d like to focus my learning strictly to what I will need to accomplish my immediate goal.

So I’m going to post my situation, a table that shows what the desired output should be, what I need in to be able to do in order to get unstuck, and my background. Hopefully with all that you provide the best advice that you can.

Note: I tried to use ChatGPT, Bing Chat, and Bard to help me write code snippets. I was initially successful in using those tools to help me write a basic function, but now that I want to do something a bit more complex, the generative AI tools are not providing useful code.

SITUATION:
This effort is part of a data migration / ETL project. My script has to read a CSV file, analyze data, and output another CSV file with the results of the analysis and the data categorized. Here is a table that shows:
a) The list of file names within the supplied CSV file.
b) The portion of the file name that is common to a given subset of files, labeled DB #. I would like to automate the extraction of this portion of the file name.
c) The results of the categorization that I want to automate. Currently, I’m only able to easily label as “audit trail” files that end with “audit.pdf”, as well as folder items.

Filename	DB #	File Category
DB-APP-1001201_audit.pdf	DB-APP-1001201	audit trail
DB-APP-1001201_v1.0(1).pdf	DB-APP-1001201	previous original
DB-APP-1001201_v1.0.pdf	DB-APP-1001201	previous likeness
DB-APP-1001201_v2.0.msg	DB-APP-1001201	previous original
DB-APP-1001201_v2.0.pdf	DB-APP-1001201	previous likeness
DB-APP-1001201_v3.0.msg	DB-APP-1001201	original
DB-APP-1001201_v3.0.pdf	DB-APP-1001201	likeness
DB-APP-1001240_audit.pdf	DB-APP-1001240	audit trail
DB-APP-1001240_v1.0(1).pdf	DB-APP-1001240	previous original
DB-APP-1001240_v1.0.pdf	DB-APP-1001240	previous likeness
DB-APP-1001240_v2.0.msg	DB-APP-1001240	previous original
DB-APP-1001240_v2.0.pdf	DB-APP-1001240	previous likeness
DB-APP-1001240_v3.0.msg	DB-APP-1001240	original
DB-APP-1001240_v3.0.pdf	DB-APP-1001240	likeness
DB-APP-1001286_audit.pdf	DB-APP-1001286	audit trail
DB-APP-1001286_v1.0(1).pdf	DB-APP-1001286	original
DB-APP-1001286_v1.0.pdf	DB-APP-1001286	likeness

WHAT MY CODE NEEDS TO DO:
My initial need is to be able to also categorize these files as either original, likeness, previous original, previous likeness, etc. They cannot be categorized by just looking at the information in one file name. I need to consider the data that is part of all the files that belong to the same DB #.
The category depends on all the information that is stored after the DB # portion of the name (e.g., v1.0.pdf, v2.0.msg, v1.0(1).pdf, etc.). I need to be able compare all the files that belong to the same DB #, and then use that info to categorize. In this specific case, some rules are:

The lowest v numbers are previous files - if there’s a v1, v2, v3, then v1 and v2 are previous files, while the highest number is considered the latest or current file.
If there is a file that contains a number in parenthesis, it can be labeled an “original”.

I don't know how to parse out the different parts of the file name, nor iterate through just a subset, nor compare each subitem in that subset one to another.

Current code:
This is the main portion of the code I already have that does most of the work. The script does other stuff that’s not shown, like read some parameters from a config file, make a copy of the input CSV file, etc.

# Function to update the CSV file, save it, and close it
def update_csv_file(csv_file, new_data):
    with open(csv_file, 'w', newline='') as file:
        csv_writer = csv.writer(file)

        # Add the new column header "File Category"
        new_header = new_data[0]
        new_header.insert(new_header.index("Path") + 1, "File Category")
        csv_writer.writerow(new_header)

        for row in new_data[1:]:
            
            item_type = row[new_header.index("Item Type")]
            if item_type == "Folder":
                file_category = "Uncategorized Folder"
            else:
                file_name = row[new_header.index("Name")]
                if file_name.endswith("audit.pdf"):
                    file_category = "audit trail"
                else:
                    file_category = "Other File"
            row.insert(new_header.index("Path") + 1, file_category)
            csv_writer.writerow(row)

My Background:

I’ve just started to become familiar with Python. I’m using version 3.11.6 with VSCode on Windows 10. Hope is to be able to get better and better soon.
I’ve written code in the past - as needed to accomplish project goals - but it’s been many years since coding was a frequent part of my life. About 15+ years ago I used to write code 10% of the time and about 20 years ago I used to write code 50% of the time.
Some previous languages, that I’ve mostly forgotten how to use, are Java, VB, Pascal, and FORTRAN.

Thank you all in advance!!

JPG.

MRAB · October 22, 2023, 12:42am

You can split a filename with the .split method filename.split("_"). That’ll give you the DB# and the remainder.

A useful string method is .endswith, and you can split just once on at the end with filename.rsplit(".", 1).

I’d also suggest judicious use of dict and maybe defaultdict to store and categorise the filenames. You can use one to group all the filenames by DB #, for example, and then process all those having the same DB #.

hansgeunsmeyer · October 22, 2023, 3:25pm

To parse those filenames you’d usually first separate the name from the extension, using either os.path.splitext or pathlib.Path:

>>> from pathlib import Path
>>> p = Path('DB-APP-1001201_v1.0(1).pdf')
>>> p.name
'DB-APP-1001201_v1.0(1).pdf')
>>> p.suffix
'.pdf'
>>> p.stem
'DB-APP-1001201_v1.0(1)'

Then, if the name is pretty standardized and simple, you could either parse is directly using for instance .split on strings, and iterating over the pieces:

>> p.stem.split('-')
['DB', 'APP', '1001201_v1.0(1)']

Or it might be easier to use a regex:

>>> import re
>>> pat = re.compile(r'^DB\-APP\-(?P<appid>\d+)_(?P<version>v\d+\.\d+)(?P<var>\(\d+\))?$')
>>> m = pat.match(p.stem)
>>> m['appid']
'1001201'
>>> m['version']
'v1.0'

JPG · October 25, 2023, 4:13am

Thank you Matt for your kind reply. Besides using Google searches, do you have any specific online resources where I could learn more about dict and defaultdict?

JPG · October 25, 2023, 4:21am

Hans, thank you so much for your kind and prompt reply. What is your top online resource (besides Google) that you would recommend for me to learn more about pathlib, re, the compile method, etc.?

hansgeunsmeyer · October 25, 2023, 1:19pm

The most reliable – and generally also pretty readable – resource are the Python module docs at python.org.

Topic		Replies	Views
Beginner help with tables Python Help	7	769	November 21, 2022
Parse CSV file and output to new file Python Help	2	378	December 20, 2022
Need some help writing a basic code Python Help	0	449	April 26, 2022
Pulling Data From Multiple Files Python Help	2	184	January 12, 2024
Various length dataframe to extract (or split) Python Help	3	312	May 11, 2022

Python Learning Tips - Laser Focused on CSV Data Processing

Related Topics