Originally started on github: math.average function proposal · Issue #123658 · python/cpython · GitHub
I would like to propose a new builtin named avg
that averages an itterable. I have made the code for this, however I am aware that this is a potentially controversial request. Perhaps this should be a part of the statistics module? You can find a pr for this here: gh-123658 Added a built-in average function by zykron1 · Pull Request #123659 · python/cpython · GitHub
There’s already statistics.mean()
, so is this really just about making that a built in function? Personally I don’t think it’s used often enough to justify it becoming a built in.
I recognize your perspective on statistics.mean() is currently accessible. But there are a few reasons why it could be advantageous to have average as a built-in function:
Convenience: You don’t have to import extra modules to access built-in functionalities. This can make code simpler, especially for novices or short scripts where it’s preferable to have as few imports as possible.
Consistency: For frequent operations like sum, min, and max, Python comes with a number of built-in functions. This list would become consistent with other basic operations if average was added, which may improve the intuitiveness of Python’s standard library.
Performance: Because a built-in function is implemented at a lower level, it is usually faster. Benchmarks, for example, have indicated that utilizing statistics could be up to 12 times slower than a built-in average function.mean(), that iscarried out using only Python. When average computations are made frequently or on huge datasets, this speed increase may be substantial.
Standard Library Philosophy: Python places a lot of emphasis on minimising boilerplate code and making the language easy to use. This idea is supported by the provision of avg as a built-in function, which eliminates the need to create extra code for a common operation.
Although statistics.mean() is a helpful function, adding avg as a built-in could improve the functionality, speed, and accessibility of this fundamental statistical operation.
Priorities. Is having statistics.mean()
as a builtin function more useful than having functools.reduce()
as a builtin function?
Have you read PEP 450 – Adding A Statistics Module To The Standard Library | peps.python.org ?
Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one.
There would be many hurdles:
- Do you handle all the special cases like NaN etc. the same way?
- Have you tried improving the performance of
statistics
without adding builtins and without changing the behavior of such special cases? - Did you compare to
statistics.fmean
? - Or to
sum(x)/len(x)
? - Why
avg
instead ofaverage
ormean
? - …
Be careful with this one, as it assumes you can iterate over it and get its length separately (or iterate over it twice and get the same number of elements), which isn’t true of arbitrary iterables. But it does show why this isn’t nearly as important as the others, since for anything where that IS true (most likely a list), it’s by far the simplest solution.
I have been looking into similar things for a while now.
I have been trying to work on special cases similar to this one, but same as you encountered a lot of resistance and most of it is reasonable.
See couple of my PRs that were somewhat similar to yours that were rejected:
- gh-122586: itertools.count.(peek, consume) by dg-pb · Pull Request #122588 · python/cpython · GitHub
- gh-120478: `itertools.ilen` addition by dg-pb · Pull Request #120483 · python/cpython · GitHub
Numeric, iterator & numeric-iterator operations in Python could use some serious work. And there are already enough of content to draw from to come up with more radical advancements. Thus, introducing more complexity with special cases that can already be achieved (even if via slower methods) is probably not a good way forward.
E.g. sum
already has performant and flexible approach to reduction via add
.
I think what would be more useful is developments of CAPI, which introduce intermediate iterator protocol that allows re-using correct and performant accumulators, reducers, key-predicate-signalling at C level.
Thus, ideally, if this was so, this specific case could re-use same (correct and fast) accumulation as builtins.sum
with minimal amount of effort.
Have a look at conclusion of Add `index: bool` keyword to `min`, `max`, `sorted` - #91 by dg-pb. This is where I am currently at with this.
And I don’t think mean
is that often used to be added to builtins. statistics.mean
is a very reasonable place for it IMO.
The recurring theme is that nothing written in plain python is fast enough. Have you considered using rust?
When speed is paramount, use numpy. That is faster than a python builtin function could ever be.
It does a different thing.
No, the recurring theme is that people are interested in speeding up Python primitives where possible. Because why wouldn’t we do that? But most code written in Python is fast enough[1]. And most code that isn’t, can be made fast enough by language-agnostic optimisations like caching, coding to use the hardware effectively, and similar techniques. It’s a tiny minority for which optimising language primitives is critical - it’s often nice to have, but that’s entirely different.
otherwise, why would so many people write code in Python? ↩︎
My interpretation of the situation is very much different.
Em… Yes. Have you?
You mentioned prioritizing correctness over speed. My implementation achieves both—it’s faster and returns the exact same result, because it’s written in C as a built-in. While fmean
is faster in some specific cases, it doesn’t consistently outperform a built-in.
You also suggested using sum(x) / len(x)
. While that’s close in speed to my implementation, it’s worth noting that len(x)
will return 0 for an empty iterable, leading to a division by zero error—something my approach handles more gracefully.
As for why avg
instead of average
or mean
, I’d counter with: why len
instead of length
? Simplicity and brevity are often valued in Python’s naming conventions, and avg
aligns with that.
Be careful using the word “exact” here. Does your implementation always return the exact same result as the statistics.mean
function for all possible inputs?
What does it do in this case:
>>> import statistics
>>> statistics.mean([1e100, 3, -1e100])
1.0
Yes I tested with scientific numbers. It does in fact work perfectly fine. Currently the code for this lives on my fork, however I will not attempt a merge until enough support.
Python 3.14.0a0 (heads/fix-issue-123658-dirty:e5477316b4, Sep 4 2024, 16:43:22) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> avg([1e100, 3, -1e100])
1.0
>>> ```
Adding an optional C accelerator module for statistics
would be a much less controversial change than proposing a new built-in.
While a C accelerator module would indeed be beneficial, integrating avg
as a built-in function has a broader impact. Built-ins are widely accessible, encourage best practices, and improve performance across all applications without additional dependencies. Given how commonly averages are calculated in numerous fields like data science, finance, and engineering, having a fast, built-in avg
function could significantly streamline code and potentially reduce cognitive load. Not to mention that an import statement can come with some additional bloat. In the end this is all just an effort to make python better and more cleaner.
Some things have to be imported. Even in simple cases very few Python scripts can manage without importing anything. To you it might seem like mean
is a commonly used function but I very rarely use it so to me it is not.
The argument for average to be a builtin is very unlikely to be successful so I suggest giving up on that to focus on the other parts.
Ahmed,
I sympathize with many wishes to include all kinds of functionality commonly used in as convenient place as possible.
But Python is growing and changing, and some think way too much or too fast, and has become bloated in an amazing number of ways.
If we add your version of the “mean” under any name, then someone will want the harmonic mean, and someone will want a function to do standard deviation or kurtosis or a trimmed mean where you specify what kind of outliers to remove and it can be endless.
It can make more sense to gather lots of statistical functions into one place under some module names stats or throw them into the math module and so on. You often can end up finding multiple versions such as in numpy too. It then does become necessary for the user to either use colon notation to specify which one they want or do an import and perhaps rename it to whatever they want.
As for speed, others have commented on this and I find it unsettling. It is true that Python is one of many interpreted languages where a purist using mainly functions built on calling other functions all written in the higher language, be it Python or R, will find it to be significantly less efficient. I have seen people write shell scripts that read in data line by line and call a pipeline with a dozen separate programs to process each line. It can take a long time while replacing it with mainly a single program, like an AWK script that runs once as a single process, shortened it from hours to seconds. But that too was largely interpreted and obviously a decently written program using SQL on a database or highly efficient compiled languages like C++ or more modern ones, could result in it completing almost as soon as it starts.
If certain kinds of optimized speed considerations dominate, maybe interpreted languages are not ideal. But what we now have is a bit of a Frankenstein monster where more and more pieces are grafted in not just from C or C++ but libraries from FORTRAN or RUST. There is nothing necessarily wrong with that, albeit it may remove some flexibility.
I have often wanted to use some code as in a function as a model I can study and adapt. Often I find a few dozen lines written that I can look at, see how they did something and borrow it into something I create, or make my own modified copy where I added some features not in the original.
But, more and more often, I end up seeing it is “.internal” as in it just calls a compiled library function. It is now a black box. Yes, with some work I can find the code in some other language, and perhaps borrow parts to make my own version that I then have to figure out how to link in and so on. Most people will just give up!
This is not to say that you cannot make a compiled bit in C++ that supports six dozen scenarios just in case so perhaps I just need to read the manual page and perhaps find out how to ask nicely and get what I want. But with a goal for efficiency, you probably would find it easier to make the function compile into a small amount of memory and run fast.
As just an example of what I mean, here is an outline. I often have a function I call that I would like to do a bit more processing before returning a result, perhaps in the mean example, one that removes what I consider not available (NA) or things like Inf before doing things like calculating the mean. Or I may want it to first trim away outliers or do some kind of rounding. One way to do it is to add additional optional arguments that the function would ignore but have it passed along as a … to other functions it calls which do know what to do with it, perhaps ones I wrote.
Your proposed function would not likely do the things I mention. That is not necessarily bad but if you also look at python as a sort of interactive teaching tool, …
But my personal view is that when python was created, it made decisions that later users did not appreciate. Lists are nice abstract structures that can do anything. But doing serious arithmetical operations on them is a pain. A pretty concept like a list of lists to represent a matrix, let alone deeper nested list structures to represent 3-D and 4-D matrices where nothing seriously checks data integrity, are not really ideal. Vectors of a sort, as in other languages, have advantages especially in speed. So do dataframe objects and more. Hence, to do serious computing of some kinds, some have left python entirely for languages designed differently, such as R, or have had to add modules like numpy and pandas that extend the language. But note that as useful as these are, and often heavily used also by other packages that do statics or “AI” and so on, they are not in the python core.
Would you argue your proposal will bring more bang for the buck than if something like that became standard?
I actually recently needed to locate functions to do the mean and sd and so on and a brief search told me what to include and use. Many languages are now designed in modular fashion and often the core is mainly a bootstrap for loading what is actually needed in your program.
Perhaps there can be an intermediate idea here. As an example, in R, too many programs used a growing set of packages that came to be called the tidyverse. Any one program might start early on by including one after another such package even if hardly anyone ever used them all in the same program. So, someone set up a package you could load, called tidyverse, which did little more than load a whole bunch of the commonly used packages as a bundle. Your program got a tad simpler and you no longer knew or cared which function was from which package but if you used other packages, you still needed to add them one at a time.
So, can python have a similar concept? Can you start with a relatively small and sparse core and then pick one or a few add-in clusters of modules? Right now, most versions of python simply load such a batch whether you need them or not and there is contention on who gets to be in whatever “core” python means. But could you easily just load “statistics_group” or “text_processing_group” so that a few lines made your own “core” and then the discussion could change to lobbying for your favorite functionality to be included in that group.
Please note the above comments are not against adding or changing for better versions. Some things can and should be done to keep python competitive and usable by many people. We just cannot put everything imaginable in, let alone near the core.
There isn’t a compelling reason for this to be a built-in.
Also adding this name is sure to conflict with code out there already. I mean it may work, but I have code with avg
variables that would then shadow a built-in.
If you want these things available for interpreter usage, import them or add something to import them in PYTHONSTARTUP.