numpy array access

uhcdd · May 18, 2023, 8:19am

Hello everybody,
have

N = 1000
A = zeros((N), dtype=float)
B = zeros((N), dtype=float)
C = zeros((N), dtype=float)

Fill B, C with some data.
What is the difference between

A = B+C

and

A[:] = B[:]+C[:]

Might there be performance or space allocation issues for large N?

kknechtel · May 19, 2023, 3:02am

This makes a new array with the sum values, and uses A as a name for that array. The previous array named A is no longer named A; if it has no other references, it will be garbage-collected.

The slicing of B and C in Numpy creates view objects, not separate arrays. This is different from built-in lists, where slicing will create a new list. These view objects conceptually represent a sub-section of the original array without copying any data; when you index (or re-slice) them, they use math to figure out the corresponding index in the original array. So, there is no copying, but future operations will be a bit slower.

The slice assignment to A means that the data in the existing A array will be replaced with the data from the calculation. No new array, nor view, is created, but values have to be copied across.

The best way to understand the performance characteristics (both space and time) is to test it.

That said: for the problem as described, the code that makes the most sense is A = B + C, but without first creating an empty A (as it is useless here).

uhcdd · May 19, 2023, 2:56pm

Thanks, learned something new.
Another aspect: A = B+C is part of an iteration loop and might be repeated several times, lets say m-times.
Do I understand right: A = B+C will have m times garbage considering A, i.e. allocated memory might blow up for really large N and m?
Ok, maybe to use fortran for that but fortran is unhandy.

yt87 · May 20, 2023, 6:46am

Predicting performance is hard. Always test your conjectures. It appears you guessed it wrong. The code below is for a jupyter notebook

import numpy as np
rng = np.random.default_rng(123)

n = 16_000_000
A = rng.random(n)
B = rng.random(n)
C = np.empty_like(A) 

%timeit -n 10 c = A * B
%timeit -n 10 C[:] = A * B

It prints on my computer (the numbers will vary slightly from run to run):

57.7 ms ± 566 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
79.4 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For more complicated array expressions you may want to check NumExpr 2.0 User Guide — numexpr 2.6.3.dev0 documentation,

eendebakpt · May 20, 2023, 6:47pm

For both C = A * B and C[:] = A * B the product of A and B is calculated and a new object is constructed. A faster approach is to use the numpy inplace operations (assuming the array C can be pre-allocated). Benchmark:

import numpy as np
rng = np.random.default_rng(123)

n = 16_000_000
A = rng.random(n)
B = rng.random(n)
C = np.empty_like(A) 

%timeit C = A * B
%timeit C[:] = A * B 
%timeit np.multiply(A, B, C)

results in:

35.7 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
47.7 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
27.7 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If overwriting the contents of A is possible, then A *= B is also a fast option (about 21 ms).

yt87 · May 21, 2023, 8:44pm

You are right, of course. A * B is equivalent to np.multiply(A, B), where the output is allocated. Thus C[:] = A * B does not avoid allocation, as @uhcdd suspected, but is in fact slower because the computed value has to be copied to C. To avoid creation of a temporary, the form np.multiply(A, B, C) should be used.
By the way, the original post hat +, not *, the same argument applies.

komoto48g · May 29, 2023, 4:47pm

Hi all,

I happened to need to write a program where it receives data chunks and makes an array of some length. While I tested how much numpy is faster than list slicing.

The test functions is as follows:

import numpy as np

N = 256  # N length array of integer (4-byte)
M = 100  # M chunk num => M * (4N = 1kB) btyes array

## data chunks
tpl = tuple(range(N))
lst = list(range(N))
arr = np.arange(N)

def test_list1():
    data = []
    for i in range(M):
        data += lst

def test_list2():
    _lst = [0] * N * M
    j = 0
    for i in range(M):
        _lst[j:j+N] = lst
        j += N

def test_array():
    _arr = np.zeros(N * M, dtype=int)
    j = 0
    for i in range(M):
        _arr[j:j+N] = arr
        j += N

The benchmark:

from timeit import timeit
from matplotlib import pyplot as plt

X = np.arange(100, 2_001, 100)
Y = np.zeros((len(X), 3))
n = 1
k = 40

def _fcalc(j):
    global M
    data = []
    for x in X:
        M = int(x)
        t1 = timeit(test_list1, number=n) / n
        t2 = timeit(test_list2, number=n) / n
        t3 = timeit(test_array, number=n) / n
        data.append([t1, t2, t3])
        print(f"{j}/{k}, {M=:,d}", end='\r')
    print()
    return data

plt.rcParams["axes.prop_cycle"] = plt.cycler("color", ["blue", "orange", "green",])

yy = []
for j in range(k): # k times trial
    y = _fcalc(j)
    yy.append(y)
    plt.plot(X, y, '.', alpha=0.1)
yy = np.array(yy)
Y = np.median(yy, axis=0)

plt.plot(X, Y, label=["list1", "list2", "array"])
plt.xlabel('M [kiB]')
plt.ylabel('Time [s]')
plt.legend()
plt.grid(1)
plt.show()

Results:

“list1” (blue line) shows the speed of test_list1() using list.extend call to create M [kiB] sized array.
“list2” (orange) is for test_list2() using list slicing.
“array” (green) is for test_array() using ndarray slicing.

The numpy slicing is normally twice as fast as list slicing due to its direct heap access.
Interestingly, less than some size (in this case about 400 kiB), the speed of list.extend call is ~~even faster than the numpy slicing~~ faster than list slicing and almost the same as 'ndarray` slicing. It’s probably caused by L2 cache access, but I’m not a computer expert so I don’t know.

Tested with the CPU:

Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz (core x 2)
L1 cache: 128 KB
L2 cache: 512 KB

hprodh · December 2, 2024, 9:14am

You could avoid memory allocation doing :

A[:] = B[:]
A[:] += C[:]

This might yield speed-up, yet it might also not.
The things you have to know :

Here, you have to get the memory adresses within the A array to set them to their new values. Depending on the size of A, last time it was accessed, and the computer memories, it might be within the L1 cache, the L3, or the RAM (at some arbitrary heap position). This can drastically influence the memory access latency.
Numpy arrays are data structures which are contiguous in memory, while python lists are not.
Allocating the space for the new array with A = B + C will still require to access B and C.
Unless you have a specific need to re-assign the values in memory, you better keep things simple with numpy and count on external compilation libraries to increase performance and memory management (e.g. numba, numexpr, pythran, jax.jit, cython).