Code Optimization Help


I hope this is the correct board and I apologize if it is not. I was hoping to get some help optimizing a code portion

SortedList = []
for j in range(Size):
   XValues = [DF[(DF['Label'] == i) & (DF['X'] == j)]['Y'].mean() for i in Labels]
   SortedList.append([x for y, x in sorted(zip(XValues, Labels)) if y != 0.])

DF is a pandas dataframe and Labels is an external list. It is very slow since Size is in the 1000s. Any help would be appreciated. Thank you!

python-list might be a good place to ask, but a quick pointer is to not create so many lists if you can help it. Assuming this is Python 3 then you have XValues, sorted(), and the final list that you do append to SortedList all duplicating data; you’re duplicating all your data 3 times per loop. But if you make XValues be a generator expression and filter using either filter() or a genexp then you only need to end up with a concrete list if you call sorted() last.

I’m also not sure why you’re using & instead of and; if you want the index using a boolean then either just use the boolean directly or call int() on it.

And can y be anything but a float? If not then you don’t need y != 0. and you can express it as just y.

import operator

SortedList = []
for j in range(Size):
   XValues = (DF[DF['Label'] == i and DF['X'] == j]['Y'].mean() for i in Labels)
   filtered_list = filter(operator.itemgetter[0], zip(XValues, Labels))
   SortedList.append([x for y, x in sorted(filtered_list)])

Seems to me you could do most of the computation via:

DF.groupby([‘Label’, ‘X’]).Y.mean()