`builtins.map` performance improvement for one iterable

I have tested map1, which is the same as map, but is more efficient implementation for the case of 1 iterable. Also, implemented mapg = Custom(map1, map), which selects which map function to use based on the input.

from hello import ilen
from map1 import map1
from custom import Custom

mapg = Custom(map1, map)
def test_map1(a):
    ilen(map1(bool, a))

def test_mapg(a):
    ilen(mapg(bool, a))

def test_map(a):
    ilen(map(bool, a))

ns = nda.ary([1, 5, 10, 50, 100, 500, 1000, 5_000, 10_000, 100_000])
rngs = [range(i) for i in ns]
lambdas = [
    [ftl.partial(test_map1, a),
     ftl.partial(test_mapg, a),
     ftl.partial(test_map, a)] for a in rngs
]

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃       5 repeats, 10,000 times           ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Units: ns      map1      mapg       map ┃
┃           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃         1 ┃     241       250       247 ┃
┃         5 ┃     290       292       301 ┃
┃        10 ┃     355       368       377 ┃
┃        50 ┃     887       898       934 ┃
┃       100 ┃    1518      1545      1666 ┃
┃       500 ┃    9680      9619     10319 ┃
┃      1000 ┃   21492     21935     23451 ┃
┃      5000 ┃  117899    118968    126211 ┃
┃     10000 ┃  243840    237693    255482 ┃
┃    100000 ┃ 2427835   2439256   2490276 ┃
┗━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
       | map1 % mapg % map1 ns/iter mapg ns/iter
-------+----------------------------------------
     1 |    2.4   -1.3          5.9         -3.1
     5 |    3.5    2.8          2.1          1.7
    10 |    5.9    2.5          2.2          1.0
    50 |    5.1    3.9          1.0          0.7
   100 |    8.9    7.2          1.5          1.2
   500 |    6.2    6.8          1.3          1.4
  1000 |    8.4    6.5          2.0          1.5
  5000 |    6.6    5.7          1.7          1.4
 10000 |    4.6    7.0          1.2          1.8
100000 |    2.5    2.0          0.6          0.5

Biggest performance gain is close to 10% for iterables of size 100.

mapg = Custom(map1, map) does selection for the cost of 10 ns. So it could be efficiently implemented to benefit existing code as well.

Although performance benefit isn’t very big, however it is non-trivial for sizes of 10-100, which is approximately the range which I was trying to address in Builtins.any performance - #29 by dgrigonis, where I was looking for ways to improve functional tools so that they are competitive against simple loop for short iterable sizes.

This is one part out of 2 for improvement that I came up with while looking at it and is a simpler one.

map function is used extensively so the performance benefits potentially amounts to something a bit more significant. Also, most of use-cases are with 1 argument.
Cpython:

zg '\Wmap\([^,]*,[^,]*\)' | wc -l
     452
zg '\Wmap\([^,]*,[^,]*,[^,]\)' | wc -l
      15

Github:

565k files \Wmap\([\w\s]*,[\w\s]*\)
14k files  \Wmap\([\w\s]*,[\w\s]*,[\w\s]*\)

Does this sound like something that would be worthwhile?

1 Like

Also, the same approach could be applied to filter function, which currently only works for 1 argument and could be efficiently extended to many.
Github:

319k \Wfilter\(\w*lambda

and many of those cases could make use of either more arguments or more flexible functools.partial

How are hello, map1, custom and nda defined?

For hello.ilen implementations see:

For map1.map1:

But implemented for 1 iterable with slightly more efficient PyObject_CallOneArg. Its __next__ is simply:

static PyObject *
map1_next(map1object *lz)
{
    PyObject *it = lz->it;
    PyObject *item = Py_TYPE(it)->tp_iternext(it);
    if (item == NULL) {
        return NULL;
    }
    PyObject *result = PyObject_CallOneArg(lz->func, item);
    Py_DECREF(item);
    return result;
}

custom.Custom is a simple class with __call__:

static PyObject *
Custom_callvector(CustomObject *self, PyObject *const *args,
                   size_t nargsf, PyObject *kwnames)
{
    if (PyVectorcall_NARGS(nargsf) == 2){
        return PyObject_VectorcallDict(self->first, args, nargsf, NULL);
    } else {
        return PyObject_VectorcallDict(self->last, args, nargsf, NULL);
    }
}

I just picked template for it from 2. Defining Extension Types: Tutorial β€” Python 3.12.3 documentation.

nda is just an array object with same API as numpy. nda.ary == numpy.asarray

Sounds reasonable to optimize map for the overwhelmingly most common use case of 1-argument calls.

How about a fast path for a single iterable? Would that have less overhead?

Could you be a bit more specific? Constant/function names/signatures would be useful for me to understand what exactly you are referring to. Best would be if you wrote a code with a place where to put it. :slight_smile: I don’t know a lot so it is very likely that I don’t even know what you are referring to.

We could avoid the stack creation and vector call here in case of a single iterator, I think:

const Py_ssize_t niters = PyTuple_GET_SIZE(lz->iters);
if (niters == 1) {
    PyObject *it = PyTuple_GET_ITEM(lz->iters, 0);
    PyObject *val = Py_TYPE(it)->tp_iternext(it);
    if (val != NULL) {
        result = PyObject_CallOneArg(lz->func, val);
    }
    goto exit;
}
// Slow path

If this improves performance, I think this has a higher chance of getting accepted.

This was the first thing that I have tried and it didn’t seem to work well. However, as you mentioned it again, I was using %timeit magics and list instead of ilen and these were distorting results. Will double check it.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃            5 repeats, 10,000 times                ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Units: ns      map1      mapg      map2       map ┃
┃           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃         1 ┃     232       239       252       240 ┃
┃         5 ┃     293       308       325       297 ┃
┃        10 ┃     357       370       393       379 ┃
┃        50 ┃     865       870       998       954 ┃
┃       100 ┃    1542      1532      1825      1798 ┃
┃       500 ┃    9465      9540     10172     10218 ┃
┃      1000 ┃   21588     21233     23345     22270 ┃
┃      5000 ┃  115924    118096    122446    121526 ┃
┃     10000 ┃  237003    247886    258268    246955 ┃
┃    100000 ┃ 2399945   2371219   2531544   2433305 ┃
┗━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

       | map1 % mapg % map2 % map1 ns/iter mapg ns/iter map2 ns/iter
-------+------------------------------------------------------------
     1 |    3.4    0.5   -4.9          8.3          1.2        -11.8
     5 |    1.2   -4.0   -9.6          0.7         -2.4         -5.7
    10 |    5.7    2.3   -3.8          2.2          0.9         -1.5
    50 |    9.4    8.8   -4.6          1.8          1.7         -0.9
   100 |   14.2   14.8   -1.5          2.6          2.7         -0.3
   500 |    7.4    6.6    0.4          1.5          1.4          0.1
  1000 |    3.1    4.7   -4.8          0.7          1.0         -1.1
  5000 |    4.6    2.8   -0.8          1.1          0.7         -0.2
 10000 |    4.0   -0.4   -4.6          1.0         -0.1         -1.1
100000 |    1.4    2.6   -4.0          0.3          0.6         -1.0

It doesn’t work well. It ends up even slower than normal map. I don’t know why, maybe you have any ideas?

Are you testing with multiple builds? You can use worktrees on Windows: (Git bootcamp and cheat sheet) Probably unrelated.

PyObject_CallOneArg() is just a wrapper around _PyObject_VectorcallTstate(). So that has to be the issue:

Maybe this? At least it shouldn’t be slower.

if (niters == 1) {
    PyObject *it = PyTuple_GET_ITEM(lz->iters, 0);
    PyObject *val = Py_TYPE(it)->tp_iternext(it);
    if (val != NULL) {
        PyObject args[1]:
        args[0] = val;
        result = _PyObject_VectorcallTstate(tstate, lz->func, &args, 1, NULL);
        Py_DECREF(val);
    }
    return result;
}
// Slow path

Good to know this. This explains why PyObject_CallOneArg is so fast. _PyObject_VectorcallTstate is fast and PyObject_CallOneArg is written to optimize it well for 1 arg.

However, I don’t think this is an issue. After all, I am using PyObject_CallOneArg in map1 and it performs well.

The differences of map1 and map2 for 1 iterable are:

  1. The conditional statement of which complex part is not used for 1 iterable.
  2. lz->iters is an Iterator and not tuple[Iterator]
  3. Additional: const Py_ssize_t niters = PyTuple_GET_SIZE(lz->iters);
  4. PyObject *it = PyTuple_GET_ITEM(lz->iters, 0);

I wander if it has anything to do with else statement which is never entered.

I was mostly wondering why it’s slower than map(). After all, it does more than just call _PyObject_VectorcallTstate().

You did remove the stack creation, right?

So, map1 is trimmed to the bone with everything unnecessary trimmed.

map2 has a big if statement in map_next:
http://tpcg.io/_ANM2M4

You could also try to add a lz-iter attribute to map() and use that if there’s only one iterable.

Same results. Strangely the difference is most profound for size 100.

The speed greatly suffers for both cases: 1 iterable and more than 1.

It may be just a compiler artifact. You add few lines of the code, and the compiler hits some thresholds and optimizes other lines of the code differently. You add a fast path to gain 10% in one special case, three months later you optimize other part of the code, and it make your fast path slower than the general path. And it can be very different on other platform or with other version of the compiler.

1 Like

It does seem to be this way. I tested some more and performance increases as I keep removing code which is not executed.

1 Like

Is this about map(f, it), which can be written instead as f(x) for x in it?