Why doesn't this replacement change absolutely all entries meeting criteria?

at1with0 · May 1, 2024, 10:07pm

I tried this on a dataframe df that has one column containing the strings “A” and “B”. I want to change B to 1 and A to 0. Here is how I tried:

# Define the mapping dictionary
mapping = {'A': 0, 'B': 1}

# Replace values in the entire DataFrame using the mapping dictionary
df.replace(mapping, inplace=True)

print(df)

But when I look at the entries in that (after exporting to a csv so I can read it), it almost does the job. However, the “B” in the first row of the only column with strings in it is left unchanged. The rest of the column does adhere to the mapping I defined. I would like to have it where it changes all instances of B to 1 (and A to 0). What’s going on?

kknechtel · May 1, 2024, 10:10pm

Please show us a complete example so that we can reproduce (and properly understand) the problem. This would include:

an example of the initial DataFrame (ideally, some simple code to create a small DataFrame that exhibits the problem)
what exact result you get for that input
what exact result you expect for that input

at1with0 · May 1, 2024, 10:39pm

That’s probably beyond the scope of my question which is a project I’m working on. But pretend that the dataframe is smaller than the one involved in my project:

df = pd.DataFrame([[0.6, 'B'], [0.4, 'A']], columns=['Column1', 'Column2'])

Now when I run the code in the first post, I don’t know what happens with this dataframe but in the one I am working on, all the A’s and B’s correctly get switched to 0’s and 1’s except the B in the first row. There is one column of strings; “A” or “B”. Every entry gets switched to 0 or 1 correctly but not the first B in the first row. As far as I can tell, every entry that was “A” is now 0 and “B” switches to 1.

Except the first entry in that column…It doesn’t get switched.

edit: so the switching works here but doesn’t work for my dataframe. It comes from a csv having roughly 35,000 entries and I’m not sure if I can post something like that. Every letter is switched to 0 or 1 via the rules given in the initial post but the first entry does not get switched; only the rest get switched.

kknechtel · May 1, 2024, 11:03pm

Fortunately, there’s an easy way to find out: by trying it.

>>> import pandas as pd
>>> df = pd.DataFrame([[0.6, 'B'], [0.4, 'A']], columns=['Column1', 'Column2'])
>>> print(df)
   Column1 Column2
0      0.6       B
1      0.4       A
>>> mapping = {'A': 0, 'B': 1}
>>> df.replace(mapping, inplace=True)
>>> print(df)
   Column1  Column2
0      0.6        1
1      0.4        0

It looks to me like this is the correct result according to your intended logic. Yes?

So, if we get the right result with a test DataFrame but not with the actual DataFrame, then there must be something special about the actual DataFrame that causes the problem. Perhaps you can figure this out by trying to shrink it down. For example, if you just remove the first row (since that seems to be where the problem occurs), do you still see the problem in the modified result? What if you try using only the first half of the rows, or only the second half? What if you try using only the first row?

My guess is: in that column with strings, in the first row in the data, where you think you have a "B", you don’t actually have a "B". Maybe you have "B ". Maybe you have "B\n". Maybe you have a B followed by some non-printable control character. Maybe you have another Unicode character that looks like (depending on the font) a B, such as В (Cyrilic) or Β (Greek) or Ᏼ (Cherokee) or ᗷ (Canadian Aboriginal Syllabics - by the way, this one and the Cherokee one actually represent completely different sounds and are unrelated to our concept of “B”) or ℬ (as far as Unicode is concerned, you can have script letters that are actually different characters, rather than being the same thing in a different font - even though it might be rendered using a substituted script font) or Ｂ (“Fullwidth” capital letter sometimes used when writing European languages embedded within Chinese, Japanese or Korean text).

I could go on. Text is complicated.

at1with0 · May 1, 2024, 11:12pm

I did try it; see edit. I didn’t see the relevance on trying code I knew would work in a situation where I knew it would work and, sure enough, it did work in that special case. The csv imported into a dataframe has roughly 600 rows and 60 columns.

And when I tried it on my dataframe, it left the B in the first row of the only column having strings unchanged. I just don’t know what else to try, given your guess. Even if I have "B " or “B\n” should not the replace method using the mapping I defined still work??

kknechtel · May 1, 2024, 11:14pm

No; the keys in the mapping are only tested for exact matches, not searched for as substrings. Doing that would be much slower, and it would also defeat the purpose of using a dictionary. Aside from that, it would be completely ambiguous. What if you had "BB" for example - what should that get mapped to? How about "AB"?

at1with0 · May 1, 2024, 11:16pm

In my dataset the only strings are A and B; so no need to worry about the cases you mentioned.

kknechtel · May 1, 2024, 11:17pm

Well, Pandas clearly disagrees; so you’ll need to do something to assess what’s there accurately, and decide how it should be handled.

at1with0 · May 1, 2024, 11:18pm

In other words, you don’t know. Of course, I can get into the csv and manually change the B to a 1 but I was hoping for an automated way to do this.

kknechtel · May 1, 2024, 11:32pm

Of course I don’t know; it’s quite impossible for me to know what data you actually have, because it isn’t available for me to look at. I suggested several possibilities, and pointed at the first obvious place to look in order to investigate those possibilities.

If the question is “how do I make it handle the data I actually have?”, that will depend on what the issue is in the data. For example, you could try to strip whitespace first, or try to filter out non-alphabetic characters, or try to add more hard-coded replacements to the dictionary.

MRAB · May 2, 2024, 12:09am

Try printing out the problem entry of the dataframe with the ascii function.

For example:

>>> df = pd.DataFrame([[0.6, 'В'], [0.4, 'A']], columns=['Column1', 'Column2'])
>>> df
   Column1 Column2
0      0.6       В
1      0.4       A
>>> # Looks like a 'B'.
>>> df.iloc[0]['Column2']
'В'
>>> # Still looks like a 'B'.
>>> ascii(df.iloc[0]['Column2'])
"'\\u0412'"
>>> # Definitely not a 'B'.

Rosuav · May 2, 2024, 12:12am

Maybe you have a B preceded by some invisible character, like a zero-width joiner U+FEFF.