np.percentile returns wrong value?

AceTylerCholine · July 13, 2023, 5:27pm

After getting an unexpected value from scipy.stats.iqr I discovered the error was coming from np.percentile. Here’s an example of my issue:

x=[1,1,1,2,2,2,3,3,3,4]

There are 10 values in x, the 25th percentile should be mean of 2nd and 3rd value, both of which are ‘1’, so the result should = 1,

But np.percentile(x, 25) returns 1.5

I get that Python starts counting at 0, but when using percentile it shouldn’t just ignore the first value in the list.

Presumably, this relatively basic and common function of NumPy in 2023 doesn’t have an “Error”, but I truly feel like it’s returning the wrong value. Am I crazy? Can someone explain this issue to me?

jamestwebber · July 13, 2023, 5:39pm

What version of numpy is this? I get 1.25.

It turns out “percentile” doesn’t have a strict definition. The docs for numpy.percentile lay out the many different methods you can use.

options = [
    'inverted_cdf',
    'averaged_inverted_cdf',
    'closest_observation',
    'interpolated_inverted_cdf',
    'hazen',
    'weibull',
    'linear',
    'median_unbiased',
    'normal_unbiased'
]

x = [1,1,1,2,2,2,3,3,3,4]

for opt in options:
    print(np.percentile(x, 25, method=opt))

outputs the following

1
1.0
1
1.0
1.0
1.0
1.25 # this one is the default :D
1.0
1.0

kknechtel · July 14, 2023, 12:27am

From the documentation:

Given a vector V of length n, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V… This function is the same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

It goes into detail after that, but the application is clear. Since there are 10 elements in your array, it takes 9 steps through the array to get from the minimum value to the maximum value. 25 percent of 9 is 2.25, so it conceptually starts at index 0 and takes 2.25 steps from there, landing between elements 2 and 3 (in zero-based indexing).