Warning when adding containers as attributes of dataframes?

acampove · September 3, 2024, 10:58pm

Dear Experts,

I am writting something like:

import pandas as pd


data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}
df = pd.DataFrame(data)


my_list = [1, 2, 3, 4, 5]


df.my_list = my_list

and I get a warning:

UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

pandas seems to believe that I want to add a new column, when in reality I just want to attach a list to the dataframe. I do not want to just turn off the warnings. How do you tell pandas that I did not meant to add a column but just attach an object.

Cheers

cameron · September 3, 2024, 11:50pm

I am writting something like:

import pandas as pd
data = {
   'Name': ['Alice', 'Bob', 'Charlie'],
   'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
my_list = [1, 2, 3, 4, 5]
df.my_list = my_list

Columns are available as attributes, which is doubtless why this is
forbidden:
https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe-column-attribute-access-and-ipython-completion

You can probably subclass DataFrame and intercept this, or perhaps
better provide a metadata property on which you can stash additional
data like this. IIRC the pandas and numpy libraries try quite hard to
interoperate with compatible types.

I don’t see a prepovided faility for attaching arbitrary related data
like your list example above.

avi.gross · September 4, 2024, 12:05am

Acampove,

You may want to read the web page they are pointing you to or other documentation as what you are asking pandas to do is a legal statement, but also a common mistake often when a user wants something else. It is only a WARNING.

Your dataframe called df should be looke at first to see that it produces this:

>>> df
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

You have a dataframe with two columns and three rows.

You have a list with five numbers: my_list = [1, 2, 3, 4, 5]

What exactly do you want to do? In general, you can add a row to a dataframe or a column or you can index a dataframe to return selected parts. But how does five numbers have meaning here?

There are languages like R that create a new column and in an example like this, if I typed:

# Note this is NOT Python but R
df$NewCol <- list(11,12,13)

Then the result would be a third column containing those integers. The error message guesses you are trying to do something like that and is warning you it is not how it is done.

I think it likely you want to do something very different as you say you want to attach the list. Do you mean as an attribute with some name like “my_list” that can be attached to many objects, again, in languages like R, but also in a somewhat different way in Python? ManyPython objects can indeed have attributes added and this is the way to add that as in:

df.instrument_name = 'Binky'

You can access it later as in:

>>> df.instrument_name
'Binky'

But if you check, what you did worked as described even with the warning.

>>> my_list = [1, 2, 3, 4, 5]
>>> df.my_list = my_list

>>> print(df.my_list)
[1, 2, 3, 4, 5]

So, you need to specify what you wanted to do. If it was to attach a list, you may sometimes get a warning but it works. But if you want anything else, please explain where you want the list to go or be used.

acampove · September 4, 2024, 12:27am

Hello @cameron

Thanks for your answer. I do not think a wrapper class is a good idea, because I might have to change things in too many places in the code. The metadata class though, seems like the right way.

Cheers.

acampove · September 4, 2024, 12:37am

Dear @avi.gross

I am trying to store some metadata, the list is not meant to be part of the stuff in the dataframe itself.

Yes, I want to do that with the list. However Pandas thinks I am trying to add another column to the dataframe.

Not Available:

But if you check, what you did worked as described even with the warning.
>>> my_list = [1, 2, 3, 4, 5]
>>> df.my_list = my_list

>>> print(df.my_list)
[1, 2, 3, 4, 5]
So, you need to specify what you wanted to do. If it was to attach a list, you may sometimes get a warning but it works. But if you want anything else, please explain where you want the list to go or be used.

Right, it does work. However I would like not to have these warnings and do things the way pandas accepts it. Otherwise I will have to either:

Have a million warnings all over the code, hiding the warnings that actually matter.
Turn off al the warnings.

Neither of them is good or safe.

Cheers.

avi.gross · September 4, 2024, 12:37am

Actually, on re-reading the request I think now that the OP is saying they are doing exactly what they want by attaching a new variable to the dataframe object and their main concern is how to avoid seeing error messages that are actually somewhat misguided warnings.

A little search and experimentation suggests the following code did suppress UserWarning messages for me, albeit not just from Pandas and probably for the rest of the program:

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}
my_list = [1, 2, 3, 4, 5]

df = pd.DataFrame(data)
df.my_list = my_list

Others may share better methods. One consideration here is that the assignment may be outside of normal Pandas functionality so there is no specific place to expect the ability to tell a Pandas command by some option that you know what you are doing.

Others may well supply better answers.

acampove · September 4, 2024, 12:41am

Hi, @avi.gross

Thanks for your answer. Yes, that’s one way, but as you said, it would hide everything, including warnings that I might actually need to see. So far, the best answer seems to be to create a metadata class that would not be seen as a container by pandas and that would hold the list, as @cameron suggested.

Cheers.

cameron · September 4, 2024, 12:43am

I do not think a wrapper class is a good idea, because I might have to
change things in too many places in the code.

I was thinking a direct subclass, not a wrapper. There might still be
code changes but hopefully they would be few.

The metadata class though, seems like the right way.

I was proposing that as an property provided by the direct subclass.
Just checking we’re on the same page there.

cameron · September 4, 2024, 12:53am

UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

I might have been misreading this as an exception.

On further reading, there’s an experimental DataFrame attribiute
called attrs which you can set to a dict, which could then contin
your list?

See the examples in the docstring here:

github.com

pandas-dev/pandas/blob/d9cdd2ee5a58015ef6f4d15c7226110c9aab8140/pandas/core/generic.py#L364


      
              in the event that axes are refactored out of the Manager objects.
              """
              obj = cls.__new__(cls)
              NDFrame.__init__(obj, mgr)
              return obj
          
          # ----------------------------------------------------------------------
          # attrs and flags
          
          @property
          def attrs(self) -> dict[Hashable, Any]:
              """
              Dictionary of global attributes of this dataset.
          
              .. warning::
          
                 attrs is experimental and may change without warning.
          
              See Also
              --------
              DataFrame.flags : Global flags applying to this object.

and the docs here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html#pandas-dataframe-attrs

avi.gross · September 4, 2024, 12:56am

Acampove,

I appreciate your further clarification. Cameron pointed out something that may explain it a bit. He points out that the implementation of pandas itself is using named variables in the dataframe object. Sore enough, when I look at the object using did(df) I see tons of dunder variables but others used by Pandas are the column names of Age and Name and others used internally that start with a single underscore.

Potentially what might happen if I overwrite say the Name Column?

>>> df
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

>>> df.Age
0    25
1    30
2    35
Name: Age, dtype: int64

>>> df.Name
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

# Now, change it outside the expected way! 
>>> df.Name = ["wrong", "Wrong", "WRONG"]

>>> df
    Name  Age
0  wrong   25
1  Wrong   30
2  WRONG   35

It worked because I used the right number of items (3) but when I try four items, I get a ValueError.

What I think this suggests is that Pandas uses some well-known python methods that intercept such assignments and do some validation as a key attribute a dataframe guarantees is that all columns have the same length at the end of any operation.

And, since apparently it is possible to update using this method, when you try to add a currently unrelated variable, the software may think you were trying to update a variable name it uses and perhaps spelled it wrong.

So, my guess is now that Pandas internally has engineered this. There may be a getter that intercepts requests and redirects valid ones and sometimes generates warnings or error messages.

I supplied one way to shut it up, but I can understand why you get an error just like some languages get mad if you type:

if ( alpha = beta ) ...

The warning may be that you likely did not want to copy beta into alpha but perhaps wanted to test for equality with a double-equals symbol as in:

if ( alpha == beta ) ...

In brief, you did nothing wrong but it worries you might have. And, had you tried to update a name already in use, with something of the wrong length or maybe type, it would be good if it was caught.

acampove · September 4, 2024, 1:03am

Dear @cameron

Thanks for your answer. Yes, I totally missed that attrs feature, it seems to be new though. The nice thing about it is that it will be copied alongside the dataframe. With it:

import pandas as pd                                                                                                                          
                                                                                                                                                                                                                                                                                          
data = {                                                                                                                                         'Name': ['Alice', 'Bob', 'Charlie'],                                                                                                     
    'Age': [25, 30, 35]                                                                                                                      }                                                                                                                                            
df = pd.DataFrame(data)                                                                                                                      
                                                                                                                                             
                                                                                                                                             
my_list = [1, 2, 3, 4, 5]                                                                                                                    
                                                                                                                                             
                                                                                                                                             df.attrs['lst'] = my_list

in fact does not show any warning and I assume that this is meant to be the use case of this new attrs feature.

Cheers.

avi.gross · September 4, 2024, 1:05am

I agree, Cameron, that a subclass would be a way to go in which some variable name is already in use such as Annotation that you can modify, perhaps using a setter/getter if you want to get around warnings.

Another possibility if the annotation belongs to the class rather than an instance of the class, might be attaching it to the class.

I do see in some documentation that adding things to a pandas object may not be guaranteed to be copied when other objects or copies of this one are made.

And, I hesitate to say this, but if Pandas is allowing columns containing other composite objects such as a list, then you could add a new column and store anything in any one or all rows as you wish. Weird design and perhaps thinking outside the box to put it inside the box!

acampove · September 4, 2024, 1:10am

Hi @avi.gross

Yes, in summary, it seems we are not meant to use attributes that way with dataframes, because, as you said, pandas will interpret them as a column, through some mechanism. The Pandas people seem to have come up with an attrs feature to take care of this use case as @cameron mentioned above.

Actually, given that they already implemented attrs, it might be a better idea to make that warning an exception. That way the attribute would not be by accident made into a new column or even worse, replace the values of an existing column.

Cheers.

avi.gross · September 4, 2024, 1:14am

That is an interesting area, Cameron, and it worked when I tried it. Python has other dictionaries it uses such as to hold names and values of various internals and this simply adds a user-private dictionary anyone can use.

Using the same data as before, I added a name/value pair on a new copy of the mydf so warnings would be given as some warnings only get shown once a day or so.

>>> dfnew.attrs["my_list"] = my_list
>>> print(dfnew.attrs)
{'my_list': [1, 2, 3, 4, 5]}
>>> dfattr = pd.DataFrame(data)
>>> dfattr.attrs["my_list"] = my_list
>>> print(dfattr.attrs)
{'my_list': [1, 2, 3, 4, 5]}
>>> dfattr.attrs["my_list"]
[1, 2, 3, 4, 5]

So if using a recent version of python, this could be a useful way to safely store multiple kinds of metadata without warnings.

avi.gross · September 4, 2024, 1:28am

I apologize if some of my messages happen to duplicate what others said before I get around to reading them, and hopefully this message is not a duplicate.

The use of exceptions is meant to be something serious enough to perhaps make the code stop, or be caught and handled. Since most python classes some people use allow or even encourage you to add additional fields without trying to intercept your attempt, I think it should not be an error globally.

And, in general, this is harmless as long as the programmer chooses names less likely to collide. Since the pandas dataframe does seem to intercept such attempts, if they wanted to, they could make it a more serious error along with a printed notice that they should be using attrs to add dictionary items instead.

The warning could indeed be modified to include suggesting that kind of notice as well as suggesting the accepted way to update an existing column.

This discussion reminds me a bit of how some browsers want to carefully allow some kinds of local storage perhaps just for one session, by supplying a designated object that can have entries like this added to it such as from JavaScript. If this feature of pandas remains, it provides a similar way to store data safely in a dataframe.

I think I can now safely exit this topic.