Selecting a column by integer in Pandas

art · May 5, 2023, 7:09am

Hello,

I am encountering an issue with selecting a column by integer in Pandas. The syntax used by the course instructor is in the code below. The data is provided. I see in a video that this syntax works for them. I’m not sure which version of Python or Pandas they are using. I’m hoping someone can help me understand why this is failing for me.

import pandas as pd

# read data from URL and store in a DataFrame
url = "https://github.com/chendaniely/pandas_for_everyone/blob/master/data/gapminder.tsv?raw=true"
df = pd.read_csv(url, sep='\t')

# show data from dataframe
print("# --- --- --- --- --- --- --- --- --- --- --- --- ")
print(df.head())
print("# --- --- --- --- --- --- --- --- --- --- --- --- ")

# show 3rd column of DataFrame
print(df[[2]])

# AFAICT, line above should be equivalent to line below 
# print(df[['year']])

Error Traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 5876, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 5935, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index([2], dtype='int64')] are in the [columns]"

thanks for reading!

rob42 · May 5, 2023, 8:45am

I’m not sure why you have the code line print(df[[2]]) ?

If you look at the head frame, you’ll see:

# --- --- --- --- --- --- --- --- --- --- --- --- 
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
# --- --- --- --- --- --- --- --- --- --- --- ---

So, you can use any of the column names, such as:
print(df[['year']])
print(df[['lifeExp']])

… or whatever column name is there.

Caveat: I’m not a Pandas user, so I may be wrong.

To add: Just an FYI, in case you don’t know:
If you want to see what version of Pandas you are using, just pop in the code line print(pd.__version__). I usually put that in at the beginning of a script, just after the import so that I can see the version of a package.

abessman · May 5, 2023, 9:27am

These can’t be equivalent, because 2 is a legal column name:

df = pd.DataFrame({1: [0, 1, 2], "year": [3, 4, 5]})
print(df)
## Output:
#    1  year
# 0  0     3
# 1  1     4
# 2  2     5
print(df[[1]])  # Which column?

Of course, the column named 1 is printed, not the column with index 1.

To access columns by index number, use .iloc.

df.iloc[:, 1]
## Output:
# 0    3
# 1    4
# 2    5
# Name: year, dtype: int64

Note also the difference between df[1] and df[[1]].

type(df[1])  # pandas.core.series.Series
type(df[[1]])  # pandas.core.frame.DataFrame

art · May 6, 2023, 2:49am

Hi Rob42!
That’s very helpful.
Thank you!

art · May 26, 2023, 10:44pm

AB! thanks for your illustrative reply. And it seems like that ambiguity is what caused the approach to be deprecated after version 0.19 with the arrival of iloc in version 0.20. I had encountered this in an old pandas course.

Thanks for the input, y’all!