numpy array access

Hello everybody,
have

N = 1000
A = zeros((N), dtype=float)
B = zeros((N), dtype=float)
C = zeros((N), dtype=float)

Fill B, C with some data.
What is the difference between

A = B+C

and

A[:] = B[:]+C[:]

Might there be performance or space allocation issues for large N?

This makes a new array with the sum values, and uses A as a name for that array. The previous array named A is no longer named A; if it has no other references, it will be garbage-collected.

The slicing of B and C in Numpy creates view objects, not separate arrays. This is different from built-in lists, where slicing will create a new list. These view objects conceptually represent a sub-section of the original array without copying any data; when you index (or re-slice) them, they use math to figure out the corresponding index in the original array. So, there is no copying, but future operations will be a bit slower.

The slice assignment to A means that the data in the existing A array will be replaced with the data from the calculation. No new array, nor view, is created, but values have to be copied across.

The best way to understand the performance characteristics (both space and time) is to test it.

That said: for the problem as described, the code that makes the most sense is A = B + C, but without first creating an empty A (as it is useless here).

4 Likes

Thanks, learned something new.
Another aspect: A = B+C is part of an iteration loop and might be repeated several times, lets say m-times.
Do I understand right: A = B+C will have m times garbage considering A, i.e. allocated memory might blow up for really large N and m?
Ok, maybe to use fortran for that but fortran is unhandy.

Predicting performance is hard. Always test your conjectures. It appears you guessed it wrong. The code below is for a jupyter notebook

import numpy as np
rng = np.random.default_rng(123)

n = 16_000_000
A = rng.random(n)
B = rng.random(n)
C = np.empty_like(A) 

%timeit -n 10 c = A * B
%timeit -n 10 C[:] = A * B

It prints on my computer (the numbers will vary slightly from run to run):

57.7 ms ± 566 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
79.4 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For more complicated array expressions you may want to check NumExpr 2.0 User Guide — numexpr 2.6.3.dev0 documentation,

1 Like

For both C = A * B and C[:] = A * B the product of A and B is calculated and a new object is constructed. A faster approach is to use the numpy inplace operations (assuming the array C can be pre-allocated). Benchmark:

import numpy as np
rng = np.random.default_rng(123)

n = 16_000_000
A = rng.random(n)
B = rng.random(n)
C = np.empty_like(A) 

%timeit C = A * B
%timeit C[:] = A * B 
%timeit np.multiply(A, B, C)

results in:

35.7 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
47.7 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
27.7 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If overwriting the contents of A is possible, then A *= B is also a fast option (about 21 ms).

5 Likes

You are right, of course. A * B is equivalent to np.multiply(A, B), where the output is allocated. Thus C[:] = A * B does not avoid allocation, as @uhcdd suspected, but is in fact slower because the computed value has to be copied to C. To avoid creation of a temporary, the form np.multiply(A, B, C) should be used.
By the way, the original post hat +, not *, the same argument applies.

Hi all,

I happened to need to write a program where it receives data chunks and makes an array of some length. While I tested how much numpy is faster than list slicing.

The test functions is as follows:

import numpy as np

N = 256  # N length array of integer (4-byte)
M = 100  # M chunk num => M * (4N = 1kB) btyes array

## data chunks
tpl = tuple(range(N))
lst = list(range(N))
arr = np.arange(N)

def test_list1():
    data = []
    for i in range(M):
        data += lst

def test_list2():
    _lst = [0] * N * M
    j = 0
    for i in range(M):
        _lst[j:j+N] = lst
        j += N

def test_array():
    _arr = np.zeros(N * M, dtype=int)
    j = 0
    for i in range(M):
        _arr[j:j+N] = arr
        j += N

The benchmark:

from timeit import timeit
from matplotlib import pyplot as plt

X = np.arange(100, 2_001, 100)
Y = np.zeros((len(X), 3))
n = 1
k = 40

def _fcalc(j):
    global M
    data = []
    for x in X:
        M = int(x)
        t1 = timeit(test_list1, number=n) / n
        t2 = timeit(test_list2, number=n) / n
        t3 = timeit(test_array, number=n) / n
        data.append([t1, t2, t3])
        print(f"{j}/{k}, {M=:,d}", end='\r')
    print()
    return data

plt.rcParams["axes.prop_cycle"] = plt.cycler("color", ["blue", "orange", "green",])

yy = []
for j in range(k): # k times trial
    y = _fcalc(j)
    yy.append(y)
    plt.plot(X, y, '.', alpha=0.1)
yy = np.array(yy)
Y = np.median(yy, axis=0)

plt.plot(X, Y, label=["list1", "list2", "array"])
plt.xlabel('M [kiB]')
plt.ylabel('Time [s]')
plt.legend()
plt.grid(1)
plt.show()

Results:

  1. “list1” (blue line) shows the speed of test_list1() using list.extend call to create M [kiB] sized array.
  2. “list2” (orange) is for test_list2() using list slicing.
  3. “array” (green) is for test_array() using ndarray slicing.

The numpy slicing is normally twice as fast as list slicing due to its direct heap access.
Interestingly, less than some size (in this case about 400 kiB), the speed of list.extend call is even faster than the numpy slicing faster than list slicing and almost the same as 'ndarray` slicing. It’s probably caused by L2 cache access, but I’m not a computer expert so I don’t know.

Tested with the CPU:

Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz (core x 2)
L1 cache: 128 KB
L2 cache: 512 KB
1 Like