I would like to calculate the Gini index in python with categorical variables. I saw this code that could help me to recycle:
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.abs(np.subtract.outer(x, x)).mean()
# Relative mean absolute difference
rmad = mad/np.mean(x)
# Gini coefficient
g = 0.5 * rmad
return g
I have data with areas visited by people for example:
- person 1: [zone2, zone4, zone5, zone2, zone2]
- person 2 [zone1, zone5, zone4, zone1, zone1, zone3]
- person 3 [zone3, zone3, zone3, zone1, zone3]
and I want to know how dispersed (or slightly dispersed) that person is depending on the areas they visit in a parameter from 0 to 1. So, I want to obtain that person 3 is less dispersed than person 1. I think that the value 0 of this parameter represents less dispersion. To do that, I think the Gini index represents that, but my variables (zones) are categorical.
Do you know how I can solve this?