An average function builtin?

ncoghlan · September 5, 2024, 10:22am

All those fields work best with vectorized versions of the various data analysis algorithms, which can be found in data analysis modules.

The standard library version is there for educational purposes and as a convenience for ad hoc scripting rather than being relied on for any kind of serious number crunching activity.

So campaigning for a built-in simply isn’t going to get you anywhere (I doubt you could even find a sponsor for such a PEP, let alone get the PEP accepted).

Even a C accelerator module might be judged to be too much maintenance hassle for not enough benefit, but it’s an idea that would at least be given some consideration rather than being dismissed outright.

franklinvp · September 5, 2024, 1:24pm

Not enough tests. staticstics.mean uses math.fsum, while you are adding associating from the left. You can produce examples in which these two will differ significantly.

If you want to guarantee the same results, you will have use or imitate fsum.

Maybe try with [float(2**53), float(1.0), float(1.0)]. I haven’t tried myself, but

fsum([float(2**53), float(1.0), float(1.0)]) / 3.0
#  gives 3002399751580331.5

while

((float(2**53) + float(1.0)) + float(1.0))/3
# gives 3002399751580330.5

zykron1 · September 5, 2024, 11:49pm

If you read my code, I implemented a Kahan summation system, it works fine.
Edit: I really don’t care about getting a builtin anymore, I learned quite a lot on the way and I might just close this post. Perhaps I might change this to something about adding a C accelerator to the statistics library.

MegaIng · September 6, 2024, 12:00am

No you did not. Did you look up what Kahan summation is before making this claim? Or did you just expect us to not actually read your code? Or is there some other branch where you implemented it correctly that we can’t see?

No it does not. It uses an even better, but significantly slower sum implementation, that has the benefit of being 100% accurate no matter the input types

zykron1 · September 6, 2024, 12:16am

I apologize for the misconception, it seems as my push failed, will push soon.
Please note, I probably won’t do many more changes as I know this won’t get merged. Like I said above, I have given up on this.

Edit 1: Below is my result:

>>> avg([float(2**53), float(1.0), float(1.0)])
3002399751580331.5

oscarbenjamin · September 6, 2024, 12:19am

Where are you pushing? Can you send a link?

It seems as if people are discussing code that they can see but I don’t think that I have seen any…

zykron1 · September 6, 2024, 12:29am

Here you go: GitHub - zykron1/cpython at fix-issue-123658

zykron1 · September 6, 2024, 12:30am

Here it is with the simulation, GitHub - zykron1/cpython at fix-issue-123658

oscarbenjamin · September 6, 2024, 12:52am

Ah, okay. That converts to float (double) and uses what looks like Kahan summation.

The statistics module is different because it handles Decimal and Fraction as well:

In [5]: from fractions import Fraction as F

In [6]: import statistics

In [7]: print(statistics.mean([F(1,3), F(1,7)]))
5/21

avi.gross · September 6, 2024, 2:21am

You make a good point, Alyssa.

I remember when the language S first came out at Bell Labs (which partially evolved into a public domain language called R I often use) and the name implied Statistical computing. (R is a sort of joke on --S.)

The point was that for serious statistics work, you wanted a language with good data types and vectorized functions that allowed fairly rapid calculations and an easy way to describe complex operations such as multiplying matrices.

That was not the main goal of python where the original idea was to base much of the language on a more flexible concept of a list and a few other data structures that also can contain many different things and thus are not as efficient to store or manipulate.

The main way for Python to compete is to create additional modules or collections of modules and mainly use a somewhat different concept and environment within that zone. Numpy and it’s cousin pandas are one of many such examples. Often many problems start by taking a few items in a python program that are stored in lists and/or dictionaries and various collections or iterables and copying them into numpy arrays or pandas dataframes. Once there, in an environment largely written in compiled code in libraries, you can write what is python code but does many things very fast as in vectorized. When done, you may optionally switch the results back into more familiar python objects but I suspect more and more places that is not necessary as the numpy type objects are now written to blend in all over the place.

There are other worlds peripheral to python such as various modules that make graphs, scientifc/statistical libraries such as sklearn and a slew of machine-learning and AI tools starting with TensorFlow and then layer upon layer built on top of it, and of course numpy, such as keras.

When I use such tools, I see them as an extension of python that adds functionality that is not built-in but is made possible by an architecture that allows optional expansion.

But although some see taking a mean as a totally common need, others see areas where taking means is part of a larger grouping of code, including working on massive amounts of data. As @ncoghlan points out:

“The standard library version is there for educational purposes and as a convenience for ad hoc scripting rather than being relied on for any kind of serious number crunching activity.”

Much of Python is that way. And it is not a bad thing to then, only when needed, extend python for your purposes.

This reminds me a bit of an amusing little book about Lisp ^[1] years ago that showed a recursive function to decide if A is greater than B. I will spare you the code except to say that if You asked if a billion is greater than a trillion (assuming you are not in a region like Britain) then the answer is to subtract one from each and call yourself until one or the other number reaches 0. At that point, about a billion deep in the stack unless you have tail recursion implemented, you decide a trillion is bigger!

But is that really a good solution? It is sort of elegant and in principle works for really big numbers on a near infinite Turing machine. But on many computers which say hold a number in 64 bits, there are generally really fast methods to determine if the contents of register A are less than, greater than or equal to the contents of register B – and in constant time.

The point is that toy version of LISP that used recursion for everything is not necessarily best to use that way for many problems but certainly fine for teaching ways of approaching problems, especially small ones. Python is way more than that LISP was but shares some aspects, IMO.

The Little LISPer Google Books ↩︎