2D NumPy array of objects vs. 2D Python list efficiency

pauljurczak · March 9, 2023, 10:15pm

I have a 2D list a of varying shape NumPy arrays and equivalent NumPy array of objects b = np.array(a, dtype='object'). It is inconvenient to slice a, e.g. a[:][1] is not equivalent to b[:, 1], can’t be expressed by slicing and requires a list comprehension [e[1] for e in a[:]] instead.

Are there any inefficiencies of using b instead of a, e.g. copying on construction of b? I can’t find references to memory layout and representation details in NumPy documentation, due to “object” being too generic for search engines.

franklinvp · March 10, 2023, 4:54pm

You can read about the internal representation of np.array here.

Since you are concerned about efficiency and certain slices, note particularly Point 8 about telling it whether to store your values in row or column major order, the parameter order of the constructor

kknechtel · March 10, 2023, 11:18pm

Conceptually, a Numpy array with dtype=object will store pointers (of the usual pointer size on the current platform, so normally 64 bits nowadays) that point at the actual underlying Python object representations (i.e., the same sort of pointer as PyObject * when writing C code, using the API for working with Python code within C). The same holds if the pointed-at objects are themselves Numpy arrays - since those are still Python objects. A list of lists (not really a “2d list”, but emulating that) will do essentially the same thing, but in multiple levels. A Python list’s underlying memory will store pointers to other Python objects, regardless of the object type, list size or anything else. So the list of lists stores pointers to lists, which store pointers to the “varying shape NumPy arrays”.

Essentially, a Numpy array of objects works similarly to a native Python list, except that

it keeps track of multiple array dimensions (the array shape), and uses this information to implement more sophisticated indexing;
it will always represent a rectangular shape (when emulating “2d” with a list of lists, each contained list has its own memory tracking overhead, and can vary in length independently - this is the price for more flexibility);
it will have a different strategy for reallocating memory when it grows (Numpy arrays are not really meant to be concatenated sequentially).

Aside from that, it’s hard to say anything useful without seeing an actual use case and a specific performance concern.

pauljurczak · March 11, 2023, 12:18am

Thank you for detailed description. So far, my use case is not performance critical. I was mainly looking for more convenient notation and making sure there are no obvious gotchas. My original “2D” list of lists is rectangular and doesn’t require any changes after creation, so it looks like a great fit for Numpy array representation.

kknechtel · March 11, 2023, 1:16am

The next question, of course, is: why don’t the contained Numpy arrays have matching shapes (dimensions)? That would potentially allow for just using a single, higher-dimensional Numpy array, which is much more in-line with how Numpy is intended to work (and how it’s able to deliver performance improvements).

pauljurczak · March 11, 2023, 1:42am

I wish they had, but it’s a non-negotiable nature of a problem at hand.