Total no. of rows in a csv

Hi. I need help to understand this question.

“Find the number of distinct bookings from the given dataset in bookings.csv”. Does this have to be done with pandas? Please help. Thanks.

Did the instructor tell you to use Pandas? (How did you know that there is such a thing in the first place?) Does it say anything else in the assignment at all?

No sir. I dont know. This was an assignment in the final course project. I missed the previous 3 modules which had pandas, sql and matplotlib. So my grade has gone down and to improve my grade I have to the project first. so I was just guessing. Could you please tell me how I should proceed? Thankyou. ( i am doing a course in Python from an institute called Learnbay.)

Well, how can anyone on the Internet help with that? We aren’t taking your course, so we don’t know what you were supposed to learn in those modules.

You don’t need pandas for this. Pandas does have a very convenient
read_csv method which reads a CSV file and returns a DataFrame:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv

You’ll see it has very many parameters, but aside from the CSV filename
itself they are nearly all optional - supply only those you need to make
the load work correctly.

The nice thing about a DataFrame is that it also has many methods,
which might make finding distinct bookings easier. On the other hand, it
has many methods, and finding what you want may take a while.

The other way is with the presupplied csv Python module:

You get a csv.reader instance and read all the rows from it.

The important of the word “distinct” is that you only want to rows as
different according to some criteria. For example, if the CSV is just an
accumulation of records, including revisions to older records, then
you’re in a sense only interested in the last row for each booking. The
bookings will be identified by a particular column or combination of
columns. Those columns comprise the key whose various values you want to
count.

Typically you would do this with a dict, storing each row in the
dictionary according to the key, perhaps keeping only the last.

But if all you need to do is to count the distinct keys, use a set,
and add just the key to the set. Then measure the length of the set when
finished.

Cheers,
Cameron Simpson cs@cskk.id.au

2 Likes

Thankyou for your help.

Hi, Mr. Cameron Sir,

I have done it this way.

import pandas as pd
data = pd.read_csv('C:\Bookings.csv')
#data.info()
a = data['booking_id'].nunique()
print('Unique Booking Id: ',a)

data1 = pd.read_csv('C:\Sessions.csv')
b = data1['session_id'].nunique()
c = data1['search_id'].nunique()
print('Unique Session Id: ',b,'    Unique Search Id:  ',c)

Can b and c be found using only one line of code?

Please help. Thankyou.

I have done it this way.

import pandas as pd
data = pd.read_csv('C:\Bookings.csv')
#data.info()
a = data['booking_id'].nunique()
print('Unique Booking Id: ',a)

Ah, the `DataFrame and its many methods.

data1 = pd.read_csv('C:\Sessions.csv')
b = data1['session_id'].nunique()
c = data1['search_id'].nunique()
print('Unique Session Id: ',b,'    Unique Search Id:  ',c)

Can b and c be found using only one line of code?

Well, there’s the trite:

 b = data1['session_id'].nunique(); c = data1['search_id'].nunique()

or:

 b, c = data1['session_id'].nunique(), data1['search_id'].nunique()

but they’re really no better. You’re only computing 2 values, there’s no
need for anything clever.

We almost never use the “statement; statement” syntax in Python, BTW.

Some random remarks:
data and data1 are not great names, consider bookings and
sessions as easier to remember.

The same with a, b, c: we’re often happier with wordy but
meaningful names like n_unique_bookings or things like that.

Because the backslash has special meaning in strings, for Windows paths
we often use a “raw string” like this:

 r'C:\Bookings.csv'

because in a raw string the backslash is not special. For your two
filenames you’re ok, but if you’d made a new.csv then this:

 'C:\new.csv'

actually contains a newline character, not the two characters \ and
n. Whereas:

 r'C:\new.csv'

does what you want.

You can also use UNIX style forward slashes in Windows paths:

 'C:/new.csv'

which sidesteps the backslash issue.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

Thankyou so much.

Hello Cameron Sir,

I have a problem in pandas (merging two columns(from_city and to_city) to find the travel route) and then to find customers who have travelled more than once (using groupby I think) and to find the maximum frequency of each route (using mode() I think. Can I mail you the dataset and question at cs@cskk.id.au. if it is okay with you? Thanks sir.