Improve the numpy API to make it more flexible, compatible, object-oriented

jimy-byerley · September 26, 2020, 9:51am

Hello everyone.
I write this because even through I’m using numpy and scipy a lot in my programs, I don’t find it very satisfying in every use case.

So I started thinking about ways to make it better. I wrote this manifest.

To start thinking about the API I asked the following questions: What in definitely is a numpy array ?
answer: a buffer
It’s just an object to put numbers in in an optimized way that allows compiled operations on it, thus fastening computation and storage.

This was the mainline

The manifest is more complete than this post, but here are the features the proposed API should add:

extensible (list-like) optimized arrays
stacking arrays without copying
arrays not owning the data
allowing to use any object with the buffer protocol as an array, without any copy
array elements type (dtype) can be any user-defined class
including python class or compiled library type (think about https://github.com/Zuzu-Typ/PyGLM)
multiple array types
with common functions and additional methods specialized for the particular array
object oriented and subclassable types
possibility to add custom optimized operations
so if you want to rewrite some operations like array.add or some math functions using a JIT compiler or a C-module, you can
non-buffer arrays, like databases from files

participation, crisitcism and ideas are welcome

The main motivation for this is to try provide a more convenient and powerfull numpy to the community

What do you think of such an API ?
Do you have folks some ideas of what to add to this API ? (can be just function name changes, or fundamental design concerns)
Is some interested in implementing this together with me in a future ?

lrjball · September 26, 2020, 1:51pm

There is a numpy mailing list which would be the best place to direct this sort of idea. If it gains traction you can work on a Numpy Enhancement Proposal (NEP). See here for more details: https://numpy.org/neps/nep-0000

jimy-byerley · September 26, 2020, 2:20pm

Oh thanks for the advice ! I didn’t noticed numpy used a similar system to PEP

The thing is this new API is hardly incompatible with the existing structures in numpy. I mean for instance dtype is intended to serve the same purpose but the features the API propose with it makes it impossible to implement as an extension of the current numpy. same thing for extensible arrays
This API is mostly made of breaking changes.

I guess it is still a good idea to submit it to them

PythonCHB · September 29, 2020, 3:41pm

You really need to go to the numpy community, do some more research, and then go from there. But a couple comments:
numpy arrays are more than an API: the implementation is very much built in, and changing it would make for massive incompatibilities. I like to think of numpy arrays as two things:

A wrapper around a block of data (as you say, a buffer) – more specifically, “strided” data. It can be used as a way to interact with code written in other lanagues, C, C++, Fortran, more recently Julia, Rust, …
A Python nd-array object – this is the API that you see from Python.

If you change the Python API much, then it will no longer be compatible with mountains of Python code. If you change the underlying representation it will be incompatible with mountains of extension code. So the numpy community has a challenge when trying to move the library forward!

(note: you could make a new Python ndarray object, and if it uses the enhanced buffer protocol, you could at least get access to many compiled extensions)

As to your specific ideas: Some are already done, some are being worked on, and some are essentially impossible without a massive (breaking) restructuring.

jimy-byerley · September 30, 2020, 3:45pm

Thanks for the answer and for having read my specific ideas !

I totally agree with you description of the two sides of numpy (internal structure and python API). And of course some of my ideas are just incompatible with the current internal structure of an ndarray.

But I think if some people are interested in this project we can work together to create something like an alternative that can become a new way to go in the new programs. Of course the current numpy will last long as it’s widely adopted.
This API and structure change is so big that it’s pointless to try to move numpy to it. Better to build something new aside.

jimy-byerley · September 30, 2020, 3:51pm

I think that fundamentally, buffers of arbitrary elements like structured numpy arrays, and ndarrays for maths are different problems.
It shouldn’t be handled by the same class.

That may be what made numpy grow in complexity with time, and that now would makes it difficult to handle completely by some extensions (I’m thinking to rust) and uneasy to extend and subclass

jimy-byerley · April 14, 2021, 9:04pm

Some news about this idea:
I recently took time to make a proof of concept of few features I proposed in the first manifest. (Also I plan to use it in a project of mine soon)

Here is what it looks like with Cython: arrex
Cython is definitely very convenient to write such an extension mixing python types and memory management code

It gives an idea of how can look the following features:

A dynamically sized array, that can share its buffer with views or through the buffer protocol but never freezing the array size
Usually exposing the buffer ask to ensure the array will not reallocate: it’s done here but still allowing the array to resize)
Custom dtypes completely independent of any spec or dependency of the module providing the arrays
despite the flexibility of the dtype, the array access is ~3x faster than numpy (depending on the platform and presence of specific optimizations)

The implementation details (that could be criticized by some):

The array itself is not owning its data, instead it reference a bytes object that owns the data. When the array need to grow its capacity, a new bytes object is created. Tha way, the old one is only released when nothing reference it anymore.
The dtype objects must be byte-copyable (except the PyObject* header of course). At insertion/retreival, the content of the object exceeding the header is byte-copied from the object to the buffer. In order to declares a type as new dtype, the user must make sure that type is an extension type and is byte-copyable.

Well, for now I’m not sure I will have time to make a more complete version of it. At least some of the most important features I proposed are here …
But if anyone have time and interest to spend on it, I would be very happy to have your mind on the way it is built

pitrou · April 15, 2021, 9:37pm

You may be interested in Python array API standard — Python array API standard 2021.01-DRAFT documentation