Vectorizing a function

I have a numpy ndarray:

import numpy as np

tst = np.arange(5,35,5)

tst is: array([ 5, 10, 15, 20, 25, 30])

I would like to create a new ndarray that is just like ‘tst’, except that each element is flanked by a certain number of consecutive integers. For example, if I wanted a flank of 1, then the new array would be:

array([ 4,5,6, 9,10,11, 14,15,16, 19,20,21, 24,25,26, 29,30,31])

I can create a function that will do this for a single number:

def flank(mdpt, lngth):
    lw = mdpt - lngth
    hgh = mdpt + lngth
    return(np.arange(lw,hgh+1,1))

But I can’t figure out how to vectorize this, so I can give it a vector (like ‘tst’) and an integer for the flank length (1, in the case above), and it will return the appropriate array. I have tried ‘np.frompyfunc’ and ‘np.vectorize’. From what I have read, ‘np.vectorize’ may not be a good choice, since I want this to work efficiently with very large vectors.

I would appreciate any suggestions.

Thank you.

Eric

I figured out a solution, though I don’t know whether this solution is any more efficient than simply putting things into a loop:

np.concatenate([flank(x,1) for x in tst])

Eric

Alright Eric, I have not tested if this is faster but it does it without a for loop and using all numpy. I’m guessing its pretty fast, assuming you sufficient memory for the preallocation and tiling. Might be a better function than numpy.matlib.repmat, I used it because most of my calculations at this level (matrices/arrays) are more familiar with Matlab. I usually am in pandas land using DataFrames.

The as_strided function might not be necessary either, there is a big warning on it from the numpy docs. I used it because when I first read your problem, sliding windows came to mind. Be careful with it, if you use a smaller int then you have to change the stride arg, defaults to (8,) meaning steps of 8 bytes.

Also np.vectorize is a misnomer. Depends how you define vectorize i.e. do you define it using vectorized operations that allow loop unrolling in the cpu/SIMD instructions or do you define it as a single call doing multiple things (a lot of people consider that vectorizing). Anyway, np.vectorize is pretty much just a wrapper around a for-loop if I remember correctly from my readings on it. Its supposedly not fast at all, similar to using pandas.DataFrame.apply in certain situations.

EDIT: You could certainly do this with just tiling and reshaping, no weird view tricks, see flank2. I don’t know which is faster, I don’t feel like profiling it. Sorry. Would be a good exercise for you to do probably though. Its really the only way to know. You should compare your way with mine and any others.

import numpy as np
from numpy.lib.stride_tricks import as_strided
from numpy.matlib import repmat

def flank(arr, lngth):
    window_size = 1 + 2*lngth
    num_points = len(arr) * window_size
    prealloc = np.zeros((num_points,), dtype=np.int64)
    tiled = repmat(arr.reshape(1, -1).T, 1, lngth)
    view = as_strided(prealloc, shape=(len(arr), window_size), writeable=True)
    view[:, :] = tiled + np.arange(-lngth, lngth + 1, 1)
    return prealloc

tst = np.arange(5, 35, 5)
print(flank(tst, 1))

def flank2(arr, lngth):
    num_points = len(arr) * (1 + 2*lngth)
    tiled = repmat(arr.reshape(1, -1).T, 1, lngth)
    return (tiled + np.arange(-lngth, lngth + 1, 1)).reshape(-1, num_points)[0]
print(flank2(tst, 1))

Hi John,

Thanks so much! This is both helpful and interesting. It looks like your second answer is the winner - almost 3x as fast as either your first answer or my answer (whose function name I changed to “flank3”):

In [10]: %timeit answer1 = flank(tst, 1)
23.4 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit answer2 = flank2(tst, 1)
9 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit answer3 = np.concatenate([flank3(x,2) for x in tst])
23.5 µs ± 57 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I really appreciate the help.

Best wishes,

Eric

Very interesting. Thinking about it more, I’m not surprised my first implementation is slower. Its probably the overhead of performing the preallocation of zeros and creating the window view, I mean it has to be because flank does everything flank2 does and more so…thanks for profiling it.

If you REALLY wanted to make this fast. You could could probably tweak flank2 slightly and compile it using numba.njit (No python Just in Time compilation) that will make it blazingly fast. Just note, numba does not support all of numpy, If there was anything in flank2 which would not be supported, it would be repmat as my guess.

This SO post looks like a good alternative to repmat

One more variation which uses basic numpy broadcasting:

def flank4(arr, length):
    deltas = np.arange(-length, length + 1).reshape(1, -1)
    arr = arr.reshape(-1, 1)
    return (arr + deltas).flatten()

On my laptop this gives the following results:

In [1]: %timeit answer1 = flank(tst, 1)
21.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit answer1 = flank2(tst, 1)
7.73 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [3]: %timeit answer1 = flank4(tst, 1)
4.57 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1 Like

Oh, this broad casting trick is a good one. Saves the repmat call and allocation which saves us more time.

When I went from Matlab to Python, I got to side step numpy completely because pandas fitted my needs so well. I never got to really finesse my numpy skills as a result but I used to do tricks like this Matlab all the time, they call it binary singleton expansion. Thanks for sharing!

This is also completely numba compatible I believe!

EDIT:

Takes two %timeit calls due to JIT’ing but see the results, cut the time in half on second call

In [1]: import numpy as np

In [2]: from numba import njit

In [3]: @njit
   ...: def flank4(arr, length):
   ...:     deltas = np.arange(-length, length + 1).reshape(1, -1)
   ...:     arr = arr.reshape(-1, 1)
   ...:     return (arr + deltas).flatten()
   ...:

In [4]: tst = np.arange(5,35,5)

In [5]: %timeit answer1 = flank4(tst, 1)
The slowest run took 10.36 times longer than the fastest. This could mean that an intermediate result is being cached.
20 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit answer1 = flank4(tst, 1)
2.95 µs ± 99 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2 Likes

Good point about numba!

I was curious if it was possible to further increase the speed of the numba-compiled function by reducing the allocations even further.

I changed the code so that only once at the beginning an empty array would be allocated:

@numba.njit
def flank5(arr, length):
    n_deltas = 2 * length + 1
    result = np.empty(n_deltas * arr.shape[0], dtype=np.int64)

    for i in range(arr.shape[0]):
        for j in range(n_deltas):
            result[i * n_deltas + j] = arr[i] - length + j
    return result

Unfortunately the code becomes (in my opinion) a bit less readable this way.

I get the following benchmarks (where flank4 is your njitted function):

In [1]: %%timeit flank4(tst, 1)
   ...: answer4 = flank4(tst, 1)
   ...: 
   ...: 
1.5 µs ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [2]: %%timeit flank5(tst, 1)
   ...: answer5 = flank5(tst, 1)
   ...: 
   ...: 
1.1 µs ± 9.01 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2 Likes

Yeah I think its pretty typical when we try to get fast as possible, we lose readability and flexibility.

Always a trade off we have to account for.

1 Like