How to filter CSV file?

thought · November 24, 2022, 2:22pm

I have a csv file named film.csv the title of each column is as follows (with a couple of example rows):

Year;Length;Title;Subject;Actor;Actress;Director;Popularity;Awards;*Image
1990;111;Tie Me Up! Tie Me Down!;Comedy;Banderas, Antonio;Abril, Victoria;Almodóvar, Pedro;68;No;NicholasCage.png
1991;113;High Heels;Comedy;Bosé, Miguel;Abril, Victoria;Almodóvar, Pedro;68;No;NicholasCage.png
1983;104;Dead Zone, The;Horror;Walken, Christopher;Adams, Brooke;Cronenberg, David;79;No;NicholasCage.png
1979;122;Cuba;Action;Connery, Sean;Adams, Brooke;Lester, Richard;6;No;seanConnery.png
1978;94;Days of Heaven;Drama;Gere, Richard;Adams, Brooke;Malick, Terrence;14;No;NicholasCage.png
1983;140;Octopussy;Action;Moore, Roger;Adams, Maud;Glen, John;68;No;NicholasCage.png

I need to parse this csv with basic command (not using Pandas)

How would I extract all movie titles with the actor first name = Richard , made before year 1985 , and award = yes ? (I have been able to get it to show lisy where awards == yes , but not the rest)
How can I count how many times any given actor appears in the list?

smontanaro · November 24, 2022, 3:02pm

Tristan, welcome to the wonderful world of Python. Assuming this is a homework assignment, I will avoid providing too much detail. Here are some hints.

Open your CSV file in the context of a with statement so it is automatically closed when your are finished with it.
Use the csv module to process the open file. Check the csv.DictReader class.
Loop over the films with a for loop.
As you iterate over the records, test each for your name and date constraints, saving those records, probably to a list.
Check the Actor and Actress fields, incrementing counts for each one you encounter. A dictionary will be helpful here.

If you are completely new to Python, I suggest you take a spin through the tutorial if you haven’t already.

When you want to know how to use a particular module, search for something like python csv module in most any search engine. That will provide you a link to the relevant module documentation on docs.python.org as well as links to various tutorials people have written. Python is a pretty simple language to learn (compared to C++, for example). Once you’ve mastered the fundamentals, you will spend most of your time coming to grips with the rich set of standard and third-party modules and packages.

thought · November 24, 2022, 3:11pm

Hello, thank you for your reply.
I can ask a more precise question.
When I run:

filter = {}
lines = open('film.csv', 'r').readlines()
columns = lines[0].strip().split(';')

lines.pop(0)

for i in lines:
    x = i.strip().split(';')
    # Checking if the movie was made before 1985
    if int(x[columns.index('Year')]) < 1985:
        # Checking if the actor's first name is Richard
        if x[columns.index('Actor')].split(', ')[1] == 'Richard':
            # Checking if awards == Yes
            if x[columns.index('Awards')] == 'Yes':
                # Printing out the title of the movie
                print(x[columns.index('Title')])

I get an error for:

line 13, in <module> if x[columns.index('Actor')].split(', ')[1] == ('Richard'):
IndexError: list index out of range

how can i fix this?

MRAB · November 24, 2022, 5:52pm

Find out which part is raising the exception. Is it x[columns.index('Actor')] or is it split(', ')[1]? I suspect it’s the latter due to a comma without a space afterwards, although the data posted here looks OK and it works for me, so it’s probably on one of the other lines in the file. Personally, I’d split only on a comma and then strip the whitespace afterwards as a missing space after a comma tends to be less noticeable than a missing comma.

tjreedy · November 25, 2022, 1:37am

An open file is a line iterator. Hence, you can replace lines 2-5 with

lines = open('film.csv')
columns = next(lines).strip().split(';')  # Pick off first line.

for i in lines: will continue iterating, starting with line 2 of the .csv.

For debugging, print each line to see which is the line which causes the exception.
In 3.11, the offending index will be marked. This is one of the great new features of the release.