Coverting data type of a df column from obj to int

dani_macu · May 3, 2022, 8:18am

Hi, I have an assignment due next week which I am very stuck on and I am hoping for guidance. I am stuck on preprocessing data so if I can’t clear this up I will not progress far with this assignment.

I have a column in my data frame app_df[‘Size’] which is an obj data type. The brief is:

Size column has sizes in Kb as well as Mb. To analyze, you’ll need to convert these to numeric.
Extract the numeric value from the column
Multiply the value by 1,000, if size is mentioned in Mb

I have tried
app_df[‘Size’] = app_df[‘Size’].astype(‘int’)

and get the following error which I don’t understand. I can’t get the data into the right format I won’t be able to extract the numerical data to be able to be able to perform the right mathematical operations?

ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_22780/4109999247.py in
----> 1 app_df[‘Size’] = app_df[‘Size’].astype(int)

~\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
5813 else:
5814 # else, only a single dtype is given
→ 5815 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
5816 return self._constructor(new_data).finalize(self, method=“astype”)
5817

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
416
417 def astype(self: T, dtype, copy: bool = False, errors: str = “raise”) → T:
→ 418 return self.apply(“astype”, dtype=dtype, copy=copy, errors=errors)
419
420 def convert(

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
325 applied = b.apply(f, **kwargs)
326 else:
→ 327 applied = getattr(b, f)(**kwargs)
328 except (TypeError, NotImplementedError):
329 if not ignore_failures:

~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
589 values = self.values
590
→ 591 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
592
593 new_values = maybe_coerce_values(new_values)

~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_array_safe(values, dtype, copy, errors)
1307
1308 try:
→ 1309 new_values = astype_array(values, dtype, copy=copy)
1310 except (ValueError, TypeError):
1311 # e.g. astype_nansafe can fail on object-dtype of strings

~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_array(values, dtype, copy)
1255
1256 else:
→ 1257 values = astype_nansafe(values, dtype, copy=copy)
1258
1259 # in pandas we don’t store numpy str dtypes, so convert to object

~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
1172 # work around NumPy brokenness, #1987
1173 if np.issubdtype(dtype.type, np.integer):
→ 1174 return lib.astype_intsafe(arr, dtype)
1175
1176 # if we have a datetime/timedelta array of objects

~\anaconda3\lib\site-packages\pandas_libs\lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: ‘19M’

vainaixr · May 3, 2022, 8:35am

you are doing something like this,

x = '19M'
int(x)

which would give the error,

ValueError: invalid literal for int() with base 10: '19M'

need to convert it,

int(x.split('M')[0])

gives,

if all the entries have a similar pattern like,

'19M'
'20M'

then the above technique would work,
if the entries have different patterns, like,

'19M'
'18K'
'21G'

then would have to do something like this,

y = ''
for i in x:
  try:
    y += str(int(i))
  except:
    pass

which would give,

'19'
'18'
'21'

for the above mentioned values of x

dani_macu · May 3, 2022, 8:37am

Thank you for your reply. Would I have to convert the colum data type to an integer before I do this? That is what I have been trying and I am very stuck. How do I run that for loop on the entire column? I need to keep the letter value (m or K) so that I know what I need to multiply by

vainaixr · May 3, 2022, 8:43am

you could keep the letter value also,
I have not used pandas that much, so, dont really know if there is a way to handle the Kb/Mb case.
but, something like this,

y = ''
z = ''
for i in x:
  try:
    y += str(int(i))
  except:
    z += i

if z == 'M':
  y = int(y)*pow(2, 20)
elif z == 'K':
  y = int(y)*pow(2, 10)

could work.
some more tweaks could be made,

x = '19M'
y, z = [''] * 2
for i in x:
  try:
    y += str(int(i))
  except ValueError:
    z += i

y = int(y)

match z:
  case 'M': y *= pow(2, 20)
  case 'K': y *= pow(2, 10)

I do not think any data type conversion is required.
need to access all entries of a particular column one by one, and then apply the transform described above.
if your dataframe is something like this,

import pandas as pd
x = pd.DataFrame({'Size': ['19M', '20K', '30K']})

then

new_size_list = []
for sz in x['Size']:
  y, z = [''] * 2
  for i in sz: 
    try:
      y += str(int(i))
    except ValueError:
      z += i
  y = int(y)
  if z == 'M': y *= pow(2, 20)
  elif z == 'K': y *= pow(2, 10)
  new_size_list.append(y)
x['Size'] = new_size_list

would probably be your transformed column.

cameron · May 3, 2022, 9:16am

By Danielle Catherine MacDonald via Discussions on Python.org at 03May2022 08:28:

Hi, I have an assignment due next week which I am very stuck on and I
am hoping for guidance. I am stuck on preprocessing data so if I can’t
clear this up I will not progress far with this assignment.

I have a column in my data frame app_df[‘Size’] which is an obj data type. The brief is:

Size column has sizes in Kb as well as Mb. To analyze, you’ll need to convert these to numeric.

Extract the numeric value from the column

Multiply the value by 1,000, if size is mentioned in Mb

I have tried
app_df[‘Size’] = app_df[‘Size’].astype(‘int’)

It sounds to me like your ‘Size’ column contains strings with unit
suffxes; your traceback shows:

~\anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.lib.astype_intsafe()
ValueError: invalid literal for int() with base 10: '19M'

The string '19' is a valid int() argument, but '19M' is not due to
the 'M'.

The docs for astype() are here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

It only accepts a very limited set of types, and is really aimed at
converting ints to floats etc - low level conversions of things
which are already well understood and compatible.

I think what you need to do is write a function of your own to take one
of these strings and return the corresponding int value. As the brief
suggests, break the string into the integer (digits) part and the unit
('M').

If you can break the string into 2 pieces:

'19', 'M'

then you can deal with them separately. Suppose you’ve got those into
variables named int_part and unit_part: then you can go:

int_value = int(int_part)

to get a 19 and look up the unit string to find out what to multiply
by.

As far as breaking up the string goes, fortunately you probably have a
fixed number of units to handle. You could make a mapping (a dict) of
these:

unit_scale = {
    'K': 1000,
    'M': 1000000,
}

One thing that is nice is that you can also check for only those units
when converting the string: if it isn’t there then you can raise a
ValueError about it. And if you know this fixed set of suffixes you
can use those to break up the string.

For a given suffix, you can use the str.endswith method to test if it
ends with that suffix. Hardwired example:

s = '19M'
suffix = 'M'
if s.endswith(suffix):
    # it ends with this suffix

Now you know the unit suffix and can look it up in the unit_scale
mapping. Better, you can get the suffixes to check from the keys of
unit_scale, too, as unit_scale.keys(). And because you know the
suffix, you know its length, and because you know that you can get the
part of the string before it, which is what you want for int_part.

Strings are sequences. So this expression s[:-2] is all except the
last 2 characters in s. You can use that to compute int_part.

So, let’s suppose you figure out those steps. With a function like this:

def convert_size(size_s):
    ... break string into int_part and unit_part
    ... convert the int_part to an int
    ... multiply by the scale implied by the unit_part
    return int_value

You can make a new column in your DataFrame just by assigning a list to
it, like this:

>>> import pandas
>>> df=pandas.DataFrame({'a':[1,2,3],'b':['19M','29G','7K']})
>>> df
   a    b
0  1  19M
1  2  29G
2  3   7K
>>> df['c']=[ len(b_value) for b_value in df['b'] ]
>>> df
   a    b  c
0  1  19M  3
1  2  29G  3
2  3   7K  2

If you replace len(b_value) with your convert_size(size_value) you
should be able to make the desired column.

Hopefully that gives enough pieces to make a start on this, and perhaps
to complete it.

Cheers,
Cameron Simpson cs@cskk.id.au

dani_macu · May 3, 2022, 9:34am

Thank you so much for your reply; I’ve had such a headache trying to work this out. You are right about the values in the size column, they are for mobile phone app sizes; 413 unique values, most int with the suffix M, some with K and there is also a string value which says “Size differs”.

How I was trying to work it through was
That I needed to write a for loop which would:
Check each row in the size column
Split the values so that the integers and units were separate and able to be identified by index position
Write and if condition so that if index -1 (as the units are at the end of the object) was == M then multiply by 1000 and save this data in a new colum called “Converted_Size” as an interger data type.
After this drop all rows with the string only value in the size column.

Am I way off track? All I have been able to do for this project thus far is create the data frame by importing the csv and dropping the null values. This is the next step in the data processing and I just can’t get it quite straight in my head.

cameron · May 4, 2022, 12:15am

By Danielle Catherine MacDonald via Discussions on Python.org at 03May2022 09:44:

Cameron Simpson:

ValueError about it. And if you know this fixed set of suffixes you

Thank you so much for your reply; I’ve had such a headache trying to work this out. You are right about the values in the size column, they are for mobile phone app sizes; 413 unique values, most int with the suffix M, some with K and there is also a string value which says “Size differs”.

How I was trying to work it through was
That I needed to write a for loop which would:
Check each row in the size column
Split the values so that the integers and units were separate and able to be identified by index position
Write and if condition so that if index -1 (as the units are at the end of the object) was == M then multiply by 1000 and save this data in a new colum called “Converted_Size” as an interger data type.
After this drop all rows with the string only value in the size column.

Am I way off track?

No, your logic is sound.

All I have been able to do for this project thus far is create the data
frame by importing the csv and dropping the null values. This is the
next step in the data processing and I just can’t get it quite straight
in my head.

Often the trick is to write the pieces individually, then combine them
later. And DataFrames make the “loop” part easier.

Since you’re using DataFrames I’d suggest my formulation of:

df["Converted_Size"] = [ convert_size(size_s) for size_s in df["Size"] ]

That, all on its own, runs:

the loop per value
making the new column

There’s no need to drop the "Size" column unless you want to.

Then you just write a convert_size function which takes a single value
from the "Size" column and returns the value for the “Converted_Size”`
column. So just doing your (copied from above):

Split the values so that the integers and units were separate and able
to be identified by index position
Write and if condition so that if index -1 (as the units are at the end
of the object) was == M then multiply by 1000 and save this data in a
new colum called “Converted_Size” as an interger data type

This lets you concentrate on just converting a single string into the
appropriate integer value in isolation.

It also makes it much easier to test. At the interactive prompt you can
just go:

>>> convert_size('19M')

to see what it does and what it returns.

To avoid having it explode in your face with bad values, you could have
it return float('nan') for invalid size strings; that’s a special
float “not a number” value. It’s core purpose it to let you run
calculations without worrying about invalid results until later (for
example dividing by zero, etc). The basic effect is that any operation
with an NaN always returns another NaN, so that you know at the end
that the result is invalid, but your operation doesn’t raise an
exception in the middle. Which is good for bulk operations.

However, if you get an invalid string, you should print() a warning
just before returning the NaN so that you know that your function
failed to handle one of the strings. Include the offending string in the
warning so that you can see what it was - it may just be a suffix you’ve
missed, or some other legitimate string. Eg:

def convert_size(size_s):
    if ... split up string, recognise suffixes etc ...
        .......
        return int_value
    # we didn't handle the string and return int_value
    print("unhandled size string:", repr(size_s))
    return float('NaN')

You should find it easy enough to test size_s[-1] for your suffix. My
.endswith suggestion is a little more general - it would work with
multicharacter suffixes like 'MB', but it sounds like you do not have
those.

Cheers,
Cameron Simpson cs@cskk.id.au

dani_macu · May 5, 2022, 8:03am

THanks so much for your assitance with this. In the end I created a new coloumn for the size unit by extracting the string variable. Then I split out the string from the original column to just leave the value data. Then I wrote an if condition which mutliplied the appropriate values in the size column if the unit value in the new column was M. Then I dropped the new column I made when all the calculations were done and I didn’t need it any more. Thanks for helping me think through the logic.

cameron · May 5, 2022, 8:33am

By Danielle Catherine MacDonald via Discussions on Python.org at 05May2022 08:14:

THanks so much for your assitance with this. In the end I created a new
coloumn for the size unit by extracting the string variable. Then I
split out the string from the original column to just leave the value
data. Then I wrote an if condition which mutliplied the appropriate
values in the size column if the unit value in the new column was M.
Then I dropped the new column I made when all the calculations were
done and I didn’t need it any more. Thanks for helping me think through
the logic.

Glad to hear it. - Cameron Simpson cs@cskk.id.au

Isha_sh · January 10, 2023, 9:30am

Hi Danielle,

Can you share with me this project in jupyter python file format. As I have to take help from your project. It would be really great if you can help me with this.

Thanks!