Dear Community,
I’m posting here hoping to get the following from generous community members:
- A list of concepts to learn (e.g., how to extract subset of data from filenames), and
- The best online resources to learn those concepts.
However, at this point, I’d like to focus my learning strictly to what I will need to accomplish my immediate goal.
So I’m going to post my situation, a table that shows what the desired output should be, what I need in to be able to do in order to get unstuck, and my background. Hopefully with all that you provide the best advice that you can.
Note: I tried to use ChatGPT, Bing Chat, and Bard to help me write code snippets. I was initially successful in using those tools to help me write a basic function, but now that I want to do something a bit more complex, the generative AI tools are not providing useful code.
SITUATION:
This effort is part of a data migration / ETL project. My script has to read a CSV file, analyze data, and output another CSV file with the results of the analysis and the data categorized. Here is a table that shows:
a) The list of file names within the supplied CSV file.
b) The portion of the file name that is common to a given subset of files, labeled DB #. I would like to automate the extraction of this portion of the file name.
c) The results of the categorization that I want to automate. Currently, I’m only able to easily label as “audit trail” files that end with “audit.pdf”, as well as folder items.
Filename | DB # | File Category |
---|---|---|
DB-APP-1001201_audit.pdf | DB-APP-1001201 | audit trail |
DB-APP-1001201_v1.0(1).pdf | DB-APP-1001201 | previous original |
DB-APP-1001201_v1.0.pdf | DB-APP-1001201 | previous likeness |
DB-APP-1001201_v2.0.msg | DB-APP-1001201 | previous original |
DB-APP-1001201_v2.0.pdf | DB-APP-1001201 | previous likeness |
DB-APP-1001201_v3.0.msg | DB-APP-1001201 | original |
DB-APP-1001201_v3.0.pdf | DB-APP-1001201 | likeness |
DB-APP-1001240_audit.pdf | DB-APP-1001240 | audit trail |
DB-APP-1001240_v1.0(1).pdf | DB-APP-1001240 | previous original |
DB-APP-1001240_v1.0.pdf | DB-APP-1001240 | previous likeness |
DB-APP-1001240_v2.0.msg | DB-APP-1001240 | previous original |
DB-APP-1001240_v2.0.pdf | DB-APP-1001240 | previous likeness |
DB-APP-1001240_v3.0.msg | DB-APP-1001240 | original |
DB-APP-1001240_v3.0.pdf | DB-APP-1001240 | likeness |
DB-APP-1001286_audit.pdf | DB-APP-1001286 | audit trail |
DB-APP-1001286_v1.0(1).pdf | DB-APP-1001286 | original |
DB-APP-1001286_v1.0.pdf | DB-APP-1001286 | likeness |
WHAT MY CODE NEEDS TO DO:
My initial need is to be able to also categorize these files as either original, likeness, previous original, previous likeness, etc. They cannot be categorized by just looking at the information in one file name. I need to consider the data that is part of all the files that belong to the same DB #.
The category depends on all the information that is stored after the DB # portion of the name (e.g., v1.0.pdf, v2.0.msg, v1.0(1).pdf, etc.). I need to be able compare all the files that belong to the same DB #, and then use that info to categorize. In this specific case, some rules are:
- The lowest v numbers are previous files - if there’s a v1, v2, v3, then v1 and v2 are previous files, while the highest number is considered the latest or current file.
- If there is a file that contains a number in parenthesis, it can be labeled an “original”.
I don't know how to parse out the different parts of the file name, nor iterate through just a subset, nor compare each subitem in that subset one to another
.
Current code:
This is the main portion of the code I already have that does most of the work. The script does other stuff that’s not shown, like read some parameters from a config file, make a copy of the input CSV file, etc.
# Function to update the CSV file, save it, and close it
def update_csv_file(csv_file, new_data):
with open(csv_file, 'w', newline='') as file:
csv_writer = csv.writer(file)
# Add the new column header "File Category"
new_header = new_data[0]
new_header.insert(new_header.index("Path") + 1, "File Category")
csv_writer.writerow(new_header)
for row in new_data[1:]:
item_type = row[new_header.index("Item Type")]
if item_type == "Folder":
file_category = "Uncategorized Folder"
else:
file_name = row[new_header.index("Name")]
if file_name.endswith("audit.pdf"):
file_category = "audit trail"
else:
file_category = "Other File"
row.insert(new_header.index("Path") + 1, file_category)
csv_writer.writerow(row)
My Background:
- I’ve just started to become familiar with Python. I’m using version 3.11.6 with VSCode on Windows 10. Hope is to be able to get better and better soon.
- I’ve written code in the past - as needed to accomplish project goals - but it’s been many years since coding was a frequent part of my life. About 15+ years ago I used to write code 10% of the time and about 20 years ago I used to write code 50% of the time.
- Some previous languages, that I’ve mostly forgotten how to use, are Java, VB, Pascal, and FORTRAN.
Thank you all in advance!!
JPG.