Statistics.stdev() does not agree with my own standard deviation calculation

Slurms_MacKenzie · July 22, 2023, 6:34pm

Hello all,

some might argue this is more of a maths question, but I feel like the answer is something Python related, possibly something to do with Python floats - though I’m not sure.

I’ve written a small script that generates a set of ten numbers between one and ten inclusive and then calculates the standard deviation using an algorithm that I just looked up on Wikipedia. Then, finally, it calculates the standard deviation of the set using statistics.stdev() and compares the two.

Both calculated values are sort of in the same ballpark, but also miles off depending on how you look at it, and I can’t work out why. If someone could weigh on on this I’d appreciate it greatly. Here comes the script…

from random import randrange
from statistics import mean, sqrt, stdev

values = []
values_mean = None
deviations = []
deviations_squared = []
variance = None
standard_deviation = None

# generate ten random values
for _ in range(10):
    values.append(randrange(1,10))

# calculate the mean of the random values
values_mean = mean(values)

# calculate the deviation from the mean for each value
for i in range(len(values)):
    deviations.append(values_mean - values[i])

# build a list of squared deviations
for i in range(len(deviations)):
    deviations_squared.append(deviations[i]**2)

# calculate the mean of the squared deviations (the "variance")
variance = mean(deviations_squared)

# calculate the standard deviation
standard_deviation = sqrt(variance)

print('the manually calculated standard deviation is >', str(standard_deviation) + '\nthe standard deviation according to stdev() is >', str(stdev(values)))
~

flyinghyrax · July 22, 2023, 6:49pm

Not at my PC to run it, but your script certainly looks correct.

How much difference are we talking about? 1.0, 0.1, 0.001? Depending on what you’re seeing I’d imagine it could be accrued floating point error, but only if the difference is small.

MRAB · July 22, 2023, 6:59pm

It’s due to “Bessel’s correction”.

If you’re calculating the standard deviation of a complete population, then you divide the sum of the squared deviations by the size of the population:

variance = sum(deviations_squared) / len(deviations_squared)

But if you’re calculating the standard deviation of a sample from a larger population, then you divide the sum of the squared deviations by one less than the size of the sample:

variance = sum(deviations_squared) / (len(deviations_squared) - 1)

This is because if it’s only a sample, so you might be “missing” some of the true variance.

Slurms_MacKenzie · July 22, 2023, 7:08pm

Thanks for the speedy response, I added a loop to run the script x1000 and collected the difference between the two calculations for each iteration, then calculated the mean. It came to 2.434051111955767, so not insignificant

Slurms_MacKenzie · July 22, 2023, 7:16pm

It could be, but I think there is something else going on. I modified the script so instead of calculating the mean of the squared deviations via statistics.mean(), it calculates the sum of the squared deviations and then divides them by len(deviations) - 1.

When the algorithm is run in a loop x1000 and the difference between the two calculated stddevs is collected for each iteration, the mean average of all the differences is still coming out as 2.5368491195361726, or there abouts, which is actually a greater difference than before, so I can only assume that statistics.stdev() assumes it’s argument is a complete group and not a subset from a larger group?

steven.rumbalski · July 22, 2023, 7:31pm

Your result matches statistics.pstdev(values) so look at the docs for the difference between pstdev() and stdev().

kknechtel · July 22, 2023, 9:55pm

To synthesize:

This means (pun intended) that the sum of the squared deviations is calculated, and then divided by 10 (because there are 10 values).

pstdev does the same calculation that you are doing: the sum of the squared deviations is divided by the number of samples (10), in order to compute the population variance; then the square root is taken to get the population standard deviation.

However, stdev computes a sample standard deviation, which divides by 9 (one less than the number of values instead). It assumes that the input values are drawn from a larger population, rather than representing a complete population. In these circumstances,

This applies, as Matthew describes in detail.

Thus, the answers differ, not by an offset, but by a constant factor of sqrt(10/9) (for an input of 10 values).

Slurms_MacKenzie · July 23, 2023, 6:45pm

Thanks @steven.rumbalski and @kknechtel. I’ve substituted .stdev() for .pstdev() and the algorithm in the script now calculates the same value as the function call. I literally know nothing about it, but it would seem more appropriate for .stdev() to calculate the population standard deviation, and then to have a function named something like .sstdev() that calculates a sample standard deviation. Perhaps it’s because calculating a sample stdev is a more common operation?

One other interesting thing: as I’ve mentioned above, I encapsulated the stdev calculation in a loop that runs 1000 times, and at the end of the loop it calculates the difference between the stdev calculated by the algorithm and that calculated by .pstdev() and then stores the difference for that iteration of the loop in a list. Remember, the values for which the stdev is calculated are randomly generated for each iteration of the loop, but if the algorithm I wrote in there is correct, the calculated difference should be 0 each time.

In practice however, the difference is not always 0. When the algorithm is run x1000, on a small handful of the loop’s iterations, the algo and .pstdev() disagree and the difference is some very long scientific number.

Not sure what’s going on there, perhaps some technical, low level thingy? I’ve included the final version of the script below.

from random import randrange
from statistics import mean, sqrt, pstdev

diffs = []

for _ in range(1000):
    values = []
    values_mean = None
    deviations = []
    deviations_squared = []
    variance = None
    standard_deviation = None

    # generate ten random values
    for _ in range(10):
        values.append(randrange(1,10))

    # calculate the mean of the random values
    values_mean = mean(values)

    # calculate the deviation from the mean for each value
    for value in values:
        deviations.append(values_mean - value)

    # build a list of squared deviations
    for deviation in deviations:
        deviations_squared.append(deviation**2)

    # calculate the mean of the squared deviations
    variance = mean(deviations_squared)

    # calculate the standard deviation
    standard_deviation_algo = sqrt(variance)

    # calculate the standard deviation using .pstdev
    standard_deviation_func = pstdev(values)

    diffs.append(max(standard_deviation_algo, standard_deviation_func) - min(standard_deviation_algo, standard_deviation_func))

print(diffs)

defjaf · July 23, 2023, 7:46pm

I tried it, and the “very long scientific number” is always of order 1e-16 — this is a very small number, of order the precision of a 64-bit floating point number. For a computation like this, details of the order of operations actually matters, and the difference between your calculation and the statistics modules effects the result.

kknechtel · July 23, 2023, 10:53pm

By which you mean, a number that is very close to zero, that is provided with a lot of decimal-digit precision in a rather obnoxious way.

This is the standard issue of floating-point imprecision. Please read:

https://0.30000000000000004.com/

PythonCHB · August 5, 2023, 7:02am

Welcome to the wonderful world of float calculations!

It’s been covered, but this is why you might want o use math.isclose() to determine if you got the same result, rather than ==.

rhettinger · August 5, 2023, 9:08pm

The algorithm you wrote is “approximately correct” but each floating point step is subject to round-off error. In contrast, the code in the statistics module computes exact values at each step including a correctly rounded square root of an exact ratio at the end.

Here is an edit to your code that will reproduce the standard library results as you expected:

from random import randrange
from statistics import mean, sqrt, pstdev
from fractions import Fraction                         # <-- Edited

diffs = []

for _ in range(1000):
    values = []
    values_mean = None
    deviations = []
    deviations_squared = []
    variance = None
    standard_deviation = None

    # generate ten random values
    for _ in range(10):
        values.append(Fraction(randrange(1,10)))       # <-- Edited

    # calculate the mean of the random values
    values_mean = mean(values)

    # calculate the deviation from the mean for each value
    for value in values:
        deviations.append(values_mean - value)

    # build a list of squared deviations
    for deviation in deviations:
        deviations_squared.append(deviation**2)

    # calculate the mean of the squared deviations
    variance = mean(deviations_squared)

    # calculate the standard deviation
    h = sqrt(variance)                                 # <-- Edited
    x = variance - Fraction(h) ** 2                    # <-- Edited
    standard_deviation_algo = float(h + x / (2.0 * h)) # <-- Edited

    # calculate the standard deviation using .pstdev
    standard_deviation_func = pstdev(values)

    diffs.append(max(standard_deviation_algo, standard_deviation_func) - min(standard_deviation_algo, standard_deviation_func))

print(diffs)

FWIW, the statistics module docs have a link to the pure Python source code, so it is easy to investigate how everything works.

Topic		Replies	Views
Stastic Python, Kindly share the code in python Python Help help	4	1537	February 8, 2022
Median value problem Python Help help	9	1003	January 15, 2023
Mean method in math module Python Help	4	626	September 7, 2023
How do I calculate the normality of my data in Python Python Help release	1	421	January 29, 2021
Help with my python code Python Help help	0	353	June 7, 2022

Statistics.stdev() does not agree with my own standard deviation calculation

Related Topics