Improve the numpy API to make it more flexible, compatible, object-oriented

Hello everyone.
I write this because even through I’m using numpy and scipy a lot in my programs, I don’t find it very satisfying in every use case.

So I started thinking about ways to make it better. I wrote this manifest.

To start thinking about the API I asked the following questions: What in definitely is a numpy array ?
answer: a buffer
It’s just an object to put numbers in in an optimized way that allows compiled operations on it, thus fastening computation and storage.

This was the mainline

The manifest is more complete than this post, but here are the features the proposed API should add:

  • extensible (list-like) optimized arrays

  • stacking arrays without copying

  • arrays not owning the data
    allowing to use any object with the buffer protocol as an array, without any copy

  • array elements type (dtype) can be any user-defined class
    including python class or compiled library type (think about https://github.com/Zuzu-Typ/PyGLM)

  • multiple array types
    with common functions and additional methods specialized for the particular array

  • object oriented and subclassable types

  • possibility to add custom optimized operations
    so if you want to rewrite some operations like array.add or some math functions using a JIT compiler or a C-module, you can

  • non-buffer arrays, like databases from files

participation, crisitcism and ideas are welcome

The main motivation for this is to try provide a more convenient and powerfull numpy to the community :slight_smile:

What do you think of such an API ?
Do you have folks some ideas of what to add to this API ? (can be just function name changes, or fundamental design concerns)
Is some interested in implementing this together with me in a future ?

There is a numpy mailing list which would be the best place to direct this sort of idea. If it gains traction you can work on a Numpy Enhancement Proposal (NEP). See here for more details: https://numpy.org/neps/nep-0000

Oh :open_mouth: thanks for the advice ! I didn’t noticed numpy used a similar system to PEP

The thing is this new API is hardly incompatible with the existing structures in numpy. I mean for instance dtype is intended to serve the same purpose but the features the API propose with it makes it impossible to implement as an extension of the current numpy. same thing for extensible arrays
This API is mostly made of breaking changes.

I guess it is still a good idea to submit it to them :slight_smile:

You really need to go to the numpy community, do some more research, and then go from there. But a couple comments:
numpy arrays are more than an API: the implementation is very much built in, and changing it would make for massive incompatibilities. I like to think of numpy arrays as two things:

  1. A wrapper around a block of data (as you say, a buffer) – more specifically, “strided” data. It can be used as a way to interact with code written in other lanagues, C, C++, Fortran, more recently Julia, Rust, …

  2. A Python nd-array object – this is the API that you see from Python.

If you change the Python API much, then it will no longer be compatible with mountains of Python code. If you change the underlying representation it will be incompatible with mountains of extension code. So the numpy community has a challenge when trying to move the library forward!

(note: you could make a new Python ndarray object, and if it uses the enhanced buffer protocol, you could at least get access to many compiled extensions)

As to your specific ideas: Some are already done, some are being worked on, and some are essentially impossible without a massive (breaking) restructuring.

Thanks for the answer and for having read my specific ideas !

I totally agree with you description of the two sides of numpy (internal structure and python API). And of course some of my ideas are just incompatible with the current internal structure of an ndarray. :slight_smile:

But I think if some people are interested in this project we can work together to create something like an alternative that can become a new way to go in the new programs. Of course the current numpy will last long as it’s widely adopted.
This API and structure change is so big that it’s pointless to try to move numpy to it. Better to build something new aside.

I think that fundamentally, buffers of arbitrary elements like structured numpy arrays, and ndarrays for maths are different problems.
It shouldn’t be handled by the same class.

That may be what made numpy grow in complexity with time, and that now would makes it difficult to handle completely by some extensions (I’m thinking to rust) and uneasy to extend and subclass :thinking:

Some news about this idea:
I recently took time to make a proof of concept of few features I proposed in the first manifest. :sunglasses: (Also I plan to use it in a project of mine soon)

Here is what it looks like with Cython: arrex
Cython is definitely very convenient to write such an extension mixing python types and memory management code

It gives an idea of how can look the following features:

  • A dynamically sized array, that can share its buffer with views or through the buffer protocol but never freezing the array size
    Usually exposing the buffer ask to ensure the array will not reallocate: it’s done here but still allowing the array to resize)

  • Custom dtypes completely independent of any spec or dependency of the module providing the arrays

  • despite the flexibility of the dtype, the array access is ~3x faster than numpy (depending on the platform and presence of specific optimizations)

The implementation details (that could be criticized by some):

  • The array itself is not owning its data, instead it reference a bytes object that owns the data. When the array need to grow its capacity, a new bytes object is created. Tha way, the old one is only released when nothing reference it anymore.

  • The dtype objects must be byte-copyable (except the PyObject* header of course). At insertion/retreival, the content of the object exceeding the header is byte-copied from the object to the buffer. In order to declares a type as new dtype, the user must make sure that type is an extension type and is byte-copyable.

Well, for now I’m not sure I will have time to make a more complete version of it. At least some of the most important features I proposed are here …
But if anyone have time and interest to spend on it, I would be very happy to have your mind on the way it is built :slight_smile:

You may be interested in Python array API standard — Python array API standard 2021.01-DRAFT documentation