`builtins.map` performance improvement for one iterable

dg-pb · May 14, 2024, 6:55am

I have tested map1, which is the same as map, but is more efficient implementation for the case of 1 iterable. Also, implemented mapg = Custom(map1, map), which selects which map function to use based on the input.

from hello import ilen
from map1 import map1
from custom import Custom

mapg = Custom(map1, map)
def test_map1(a):
    ilen(map1(bool, a))

def test_mapg(a):
    ilen(mapg(bool, a))

def test_map(a):
    ilen(map(bool, a))

ns = nda.ary([1, 5, 10, 50, 100, 500, 1000, 5_000, 10_000, 100_000])
rngs = [range(i) for i in ns]
lambdas = [
    [ftl.partial(test_map1, a),
     ftl.partial(test_mapg, a),
     ftl.partial(test_map, a)] for a in rngs
]

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃       5 repeats, 10,000 times           ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Units: ns      map1      mapg       map ┃
┃           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃         1 ┃     241       250       247 ┃
┃         5 ┃     290       292       301 ┃
┃        10 ┃     355       368       377 ┃
┃        50 ┃     887       898       934 ┃
┃       100 ┃    1518      1545      1666 ┃
┃       500 ┃    9680      9619     10319 ┃
┃      1000 ┃   21492     21935     23451 ┃
┃      5000 ┃  117899    118968    126211 ┃
┃     10000 ┃  243840    237693    255482 ┃
┃    100000 ┃ 2427835   2439256   2490276 ┃
┗━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

       | map1 % mapg % map1 ns/iter mapg ns/iter
-------+----------------------------------------
     1 |    2.4   -1.3          5.9         -3.1
     5 |    3.5    2.8          2.1          1.7
    10 |    5.9    2.5          2.2          1.0
    50 |    5.1    3.9          1.0          0.7
   100 |    8.9    7.2          1.5          1.2
   500 |    6.2    6.8          1.3          1.4
  1000 |    8.4    6.5          2.0          1.5
  5000 |    6.6    5.7          1.7          1.4
 10000 |    4.6    7.0          1.2          1.8
100000 |    2.5    2.0          0.6          0.5

Biggest performance gain is close to 10% for iterables of size 100.

mapg = Custom(map1, map) does selection for the cost of 10 ns. So it could be efficiently implemented to benefit existing code as well.

Although performance benefit isn’t very big, however it is non-trivial for sizes of 10-100, which is approximately the range which I was trying to address in Builtins.any performance - #29 by dgrigonis, where I was looking for ways to improve functional tools so that they are competitive against simple loop for short iterable sizes.

This is one part out of 2 for improvement that I came up with while looking at it and is a simpler one.

map function is used extensively so the performance benefits potentially amounts to something a bit more significant. Also, most of use-cases are with 1 argument.
Cpython:

zg '\Wmap\([^,]*,[^,]*\)' | wc -l
     452
zg '\Wmap\([^,]*,[^,]*,[^,]\)' | wc -l
      15

Github:

565k files \Wmap\([\w\s]*,[\w\s]*\)
14k files  \Wmap\([\w\s]*,[\w\s]*,[\w\s]*\)

Does this sound like something that would be worthwhile?

dg-pb · May 14, 2024, 7:35am

Also, the same approach could be applied to filter function, which currently only works for 1 argument and could be efficiently extended to many.
Github:

319k \Wfilter\(\w*lambda

and many of those cases could make use of either more arguments or more flexible functools.partial

blhsing · May 14, 2024, 7:42am

How are hello, map1, custom and nda defined?

dg-pb · May 14, 2024, 7:51am

For hello.ilen implementations see:

For map1.map1:

github.com

python/cpython/blob/main/Python/bltinmodule.c#L1412


      
          
          static int
          map_traverse(mapobject *lz, visitproc visit, void *arg)
          {
              Py_VISIT(lz->iters);
              Py_VISIT(lz->func);
              return 0;
          }
          
          static PyObject *
          map_next(mapobject *lz)
          {
              PyObject *small_stack[_PY_FASTCALL_SMALL_STACK];
              PyObject **stack;
              PyObject *result = NULL;
              PyThreadState *tstate = _PyThreadState_GET();
          
              const Py_ssize_t niters = PyTuple_GET_SIZE(lz->iters);
              if (niters <= (Py_ssize_t)Py_ARRAY_LENGTH(small_stack)) {
                  stack = small_stack;
              }

But implemented for 1 iterable with slightly more efficient PyObject_CallOneArg. Its __next__ is simply:

static PyObject *
map1_next(map1object *lz)
{
    PyObject *it = lz->it;
    PyObject *item = Py_TYPE(it)->tp_iternext(it);
    if (item == NULL) {
        return NULL;
    }
    PyObject *result = PyObject_CallOneArg(lz->func, item);
    Py_DECREF(item);
    return result;
}

custom.Custom is a simple class with __call__:

static PyObject *
Custom_callvector(CustomObject *self, PyObject *const *args,
                   size_t nargsf, PyObject *kwnames)
{
    if (PyVectorcall_NARGS(nargsf) == 2){
        return PyObject_VectorcallDict(self->first, args, nargsf, NULL);
    } else {
        return PyObject_VectorcallDict(self->last, args, nargsf, NULL);
    }
}

I just picked template for it from 2. Defining Extension Types: Tutorial — Python 3.12.3 documentation.

nda is just an array object with same API as numpy. nda.ary == numpy.asarray

blhsing · May 14, 2024, 8:19am

Sounds reasonable to optimize map for the overwhelmingly most common use case of 1-argument calls.

Nineteendo · May 14, 2024, 9:17am

How about a fast path for a single iterable? Would that have less overhead?

dg-pb · May 14, 2024, 9:29am

Could you be a bit more specific? Constant/function names/signatures would be useful for me to understand what exactly you are referring to. Best would be if you wrote a code with a place where to put it. I don’t know a lot so it is very likely that I don’t even know what you are referring to.

Nineteendo · May 14, 2024, 9:55am

We could avoid the stack creation and vector call here in case of a single iterator, I think:

github.com

python/cpython/blob/e237b25a4fa5626fcd1b1848aa03f725f892e40e/Python/bltinmodule.c#L1417-L1442


      
          PyThreadState *tstate = _PyThreadState_GET();
          
          const Py_ssize_t niters = PyTuple_GET_SIZE(lz->iters);
          if (niters <= (Py_ssize_t)Py_ARRAY_LENGTH(small_stack)) {
              stack = small_stack;
          }
          else {
              stack = PyMem_Malloc(niters * sizeof(stack[0]));
              if (stack == NULL) {
                  _PyErr_NoMemory(tstate);
                  return NULL;
              }
          }
          
          Py_ssize_t nargs = 0;
          for (Py_ssize_t i=0; i < niters; i++) {
              PyObject *it = PyTuple_GET_ITEM(lz->iters, i);
              PyObject *val = Py_TYPE(it)->tp_iternext(it);
              if (val == NULL) {
                  goto exit;

This file has been truncated. show original

const Py_ssize_t niters = PyTuple_GET_SIZE(lz->iters);
if (niters == 1) {
    PyObject *it = PyTuple_GET_ITEM(lz->iters, 0);
    PyObject *val = Py_TYPE(it)->tp_iternext(it);
    if (val != NULL) {
        result = PyObject_CallOneArg(lz->func, val);
    }
    goto exit;
}
// Slow path

If this improves performance, I think this has a higher chance of getting accepted.

dg-pb · May 14, 2024, 10:11am

This was the first thing that I have tried and it didn’t seem to work well. However, as you mentioned it again, I was using %timeit magics and list instead of ilen and these were distorting results. Will double check it.

dg-pb · May 14, 2024, 10:49am

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃            5 repeats, 10,000 times                ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Units: ns      map1      mapg      map2       map ┃
┃           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃         1 ┃     232       239       252       240 ┃
┃         5 ┃     293       308       325       297 ┃
┃        10 ┃     357       370       393       379 ┃
┃        50 ┃     865       870       998       954 ┃
┃       100 ┃    1542      1532      1825      1798 ┃
┃       500 ┃    9465      9540     10172     10218 ┃
┃      1000 ┃   21588     21233     23345     22270 ┃
┃      5000 ┃  115924    118096    122446    121526 ┃
┃     10000 ┃  237003    247886    258268    246955 ┃
┃    100000 ┃ 2399945   2371219   2531544   2433305 ┃
┗━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

       | map1 % mapg % map2 % map1 ns/iter mapg ns/iter map2 ns/iter
-------+------------------------------------------------------------
     1 |    3.4    0.5   -4.9          8.3          1.2        -11.8
     5 |    1.2   -4.0   -9.6          0.7         -2.4         -5.7
    10 |    5.7    2.3   -3.8          2.2          0.9         -1.5
    50 |    9.4    8.8   -4.6          1.8          1.7         -0.9
   100 |   14.2   14.8   -1.5          2.6          2.7         -0.3
   500 |    7.4    6.6    0.4          1.5          1.4          0.1
  1000 |    3.1    4.7   -4.8          0.7          1.0         -1.1
  5000 |    4.6    2.8   -0.8          1.1          0.7         -0.2
 10000 |    4.0   -0.4   -4.6          1.0         -0.1         -1.1
100000 |    1.4    2.6   -4.0          0.3          0.6         -1.0

It doesn’t work well. It ends up even slower than normal map. I don’t know why, maybe you have any ideas?

Nineteendo · May 14, 2024, 11:01am

~~Are you testing with multiple builds? You can use worktrees on Windows: (Git bootcamp and cheat sheet)~~ Probably unrelated.

Nineteendo · May 14, 2024, 11:05am

PyObject_CallOneArg() is just a wrapper around _PyObject_VectorcallTstate(). So that has to be the issue:

github.com

python/cpython/blob/e237b25a4fa5626fcd1b1848aa03f725f892e40e/Objects/call.c#L385-L396


      
          PyObject *
          PyObject_CallOneArg(PyObject *func, PyObject *arg)
          {
              EVAL_CALL_STAT_INC_IF_FUNCTION(EVAL_CALL_API, func);
              assert(arg != NULL);
              PyObject *_args[2];
              PyObject **args = _args + 1;  // For PY_VECTORCALL_ARGUMENTS_OFFSET
              args[0] = arg;
              PyThreadState *tstate = _PyThreadState_GET();
              size_t nargsf = 1 | PY_VECTORCALL_ARGUMENTS_OFFSET;
              return _PyObject_VectorcallTstate(tstate, func, args, nargsf, NULL);
          }

Maybe this? At least it shouldn’t be slower.

if (niters == 1) {
    PyObject *it = PyTuple_GET_ITEM(lz->iters, 0);
    PyObject *val = Py_TYPE(it)->tp_iternext(it);
    if (val != NULL) {
        PyObject args[1]:
        args[0] = val;
        result = _PyObject_VectorcallTstate(tstate, lz->func, &args, 1, NULL);
        Py_DECREF(val);
    }
    return result;
}
// Slow path

dg-pb · May 14, 2024, 11:27am

Good to know this. This explains why PyObject_CallOneArg is so fast. _PyObject_VectorcallTstate is fast and PyObject_CallOneArg is written to optimize it well for 1 arg.

However, I don’t think this is an issue. After all, I am using PyObject_CallOneArg in map1 and it performs well.

The differences of map1 and map2 for 1 iterable are:

The conditional statement of which complex part is not used for 1 iterable.
lz->iters is an Iterator and not tuple[Iterator]
Additional: const Py_ssize_t niters = PyTuple_GET_SIZE(lz->iters);
PyObject *it = PyTuple_GET_ITEM(lz->iters, 0);

I wander if it has anything to do with else statement which is never entered.

Nineteendo · May 14, 2024, 11:32am

I was mostly wondering why it’s slower than map(). After all, it does more than just call _PyObject_VectorcallTstate().

You did remove the stack creation, right?

dg-pb · May 14, 2024, 11:40am

So, map1 is trimmed to the bone with everything unnecessary trimmed.

map2 has a big if statement in map_next:
http://tpcg.io/_ANM2M4

Nineteendo · May 14, 2024, 11:49am

You could also try to add a lz-iter attribute to map() and use that if there’s only one iterable.

dg-pb · May 14, 2024, 12:58pm

Same results. Strangely the difference is most profound for size 100.

The speed greatly suffers for both cases: 1 iterable and more than 1.

storchaka · May 14, 2024, 2:47pm

It may be just a compiler artifact. You add few lines of the code, and the compiler hits some thresholds and optimizes other lines of the code differently. You add a fast path to gain 10% in one special case, three months later you optimize other part of the code, and it make your fast path slower than the general path. And it can be very different on other platform or with other version of the compiler.

dg-pb · May 14, 2024, 10:22pm

It does seem to be this way. I tested some more and performance increases as I keep removing code which is not executed.

tjreedy · May 15, 2024, 1:55am

Is this about map(f, it), which can be written instead as f(x) for x in it?