Memoryview.index() 860x slower than bytes.index()

I’ve found that memoryview.index() is significantly slower than bytes.index(). In the small, reproducible example below, if you change data = bytes(obj) to data = obj, it runs 860 times slower on my machine. It’s counter intuitive that copying a memoryview to a bytes makes the code faster. Is it possible to have memoryview.index() use memchr when possible?

import time


def main(obj):
    data = bytes(obj)  # removing bytes() => 860x slower !!!
    for _ in range(10_000):
        assert data.index(10, 100) == len(data) - 1


if __name__ == '__main__':
    data = memoryview(bytearray(64 * 1024))
    data[-1] = 10
    start = time.perf_counter()
    main(data)
    stop = time.perf_counter()
    print(f'time = {stop-start:.3f}')
3 Likes

Looking at the code this definitely looks possible but needs someone to implement it.

The existing memoryview index code limits to one-dimensional memoryview which is fairly similar to a bytes (cpython/Objects/memoryobject.c at e90061f5f13ff8ad43cfed0ca724bef42609fe20 · python/cpython · GitHub)

bytes uses a cpython internal library known as “stringlib” which has lots of optimized “Working on blocks of machine bytes” to implement (https://github.com/python/cpython/blob/main/Objects/bytes_methods.c#L522-L535). Using stringlib for 1-d memoryview with 1-byte elements seems like a good optimization.

I’d be curious what a profiler says is taking the extra time; wondering if there’s some simple tweaks to get a lot better performance even in the multi-byte memoryview element case.

cc: @picnixz and @Jelle who added memoryview.index in 3.14 (PR that added: GH-125446)

3 Likes

Sometimes profile results can be surprising but looking at the code the orders of magnitude timing difference is hardly surprising. It turns every byte one-by-one into a heap-allocated Python object just to then compare it with another Python object and deallocate. If you wanted to make this faster you would convert the input to a C type at the start and then do all the comparisons with C types. That would not be difficult but just needs more code to handle the different possible input types and element sizes.

It is acknowledged in the PR that it is “probably a bit slower” which seems a bit understated but the choice is understandable if speed was not a priority.

2 Likes

stringlib won’t be of much interest here. That’s aiming at reducing the worst case of searching for a string of length m in a string of length n from O(m*n) to O(m+n), and best case as low as O(n/m) Lots and lots of delicate work. We’re only looking for a single element, and “one at a time” compare is already theoretically O()-optimal.

As @oscarbenjamin said, memoryviews suffer the costs of “hyper-generalization”. A memoryview abstracts away the base memory address, the item size, the data type, and even the distance between slice elements (strides can even be negative!).

All of that gets re-deduced on every element access, and, adding insult to injury, funnels each compare through the hyper-general PyObject_RichCompareBool(), which requires building Python objects from the raw memory chunks, and re-deducing some of that stuff another time (like what the conceptual type is, to find the right comparison instructions to execute, and to enforce that the only kind of comparison .index() cares about is __eq__).

Tons of redundant and theoretically unnecessary work.

Much of that was already in play with the simpler array.array.index(), which knows in advance that the only kind of stride is one element. So try

    data = array.array('B', obj)

and you’ll find that’s also much slower, but not as bad as memoryview.

There is “a solution”, but doubt anyone will endure the pain to implement it: write different C code for every kind of underlying data type, accessing the raw memory directly via correspondingly typed C variables, and using native C == for comparison. If the code ever needs to call PyObject_RichCompareBool(), it will never be speed-competitive.

Partly voice of experience there: Python’s list.sort() got very much faster some years ago for some type-homogenous lists, when an enterprising high school student undertook to replace (when possible) all-purpose PyObject_RichCompareBool() calls with type-specific custom comparison code hiding in the bowels of listobject.c. For example,

/* Float compare: compare any two floats. */
static int
unsafe_float_compare(PyObject *v, PyObject *w, MergeState *ms)
{
    int res;

    /* Modified from Objects/floatobject.c:float_richcompare, assuming: */
    assert(Py_IS_TYPE(v, &PyFloat_Type));
    assert(Py_IS_TYPE(w, &PyFloat_Type));

    res = PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w);
    assert(res == PyObject_RichCompareBool(v, w, Py_LT));
    return res;
}

In a list of all floats, that’s what’s executed to compare. PyObject_RichCompareBool() can be called, but only in a DEBUG build. Else it’s just a few C instructions. PyFloat_AS_DOUBLE() is trivial, just casting its argument to PyFloatObject*and then accessing the ob_fval member, which is the native raw machine bytes. So it amounts to two memory loads of C doubles, and one native double compare.

In theory, similarly huge savings could be baked in for all types corresponding to native C scalar types.

It’s real work, though.

4 Likes

It could also be done just for the most common data type as a fast path.

3 Likes

Which I have no opinions about, so will leave it to someone who cares enough to do the work :wink:.

I would guess B is the most common flavor of view, and, indeed, is the flavor a raw memoryview() call returns (getting others requires a .cast()). And it’s the flavor the OP used.

But is it worth the bother and churn? Unknown to me. I’ve never used .index() on a memoryview, or an array.array. I’ve never used .count() on them either, which suffers similar sloth for the same reasons. Is there a reason to believe that .index() is more important than .count()? Etc etc. Nothing is simple :frowning:.

I would say it’s certainly worth implementing a fast path for bytes for index(), but I do process a lot of ASCII data :slight_smile: My current workaround of wrapping a memoryview in bytes() before calling index() is fine for now, but I’ve created a todo item for myself to educate myself on the CPython codebase to the end of possibly implementing this later.

I’ve implemented a particular high performance file processing app in Ruby, Racket, C++, and finally Go (which equalled the C++ speed). I’m currently implementing it in Python, and my initial results are very encouraging. Minimizing allocation and copying has been super important, so multiprocessing.shared_memory with custom buffered file I/O and a better architecture (which Python allowed me to arrive at more easily via experimentation) has been performing well. Getting as close to memcpy and memchr as possible is super helpful for performance.

I don’t expect to match the Go speed exactly, but it’s looking like I’ll get close enough to be able to remove Go from the picture, which is a win.

3 Likes

If you’re feeling adventurous, you could use Python’s ctypes module to call your platform C memchr() directly. This is low-level hackery, and very easy to cause segfaults, etc. If you already know the starting address, it should be more-or-less straightforward (although ctypes has a steep learning curve). If the buffer’s starting address is hiding under a memoryview, though, I’m not sure I believe what chatbots tell me about how ctypes can be used to pull back that curtain too.

Too much of a bottomless pit for me dive into now, but someone else might enjoy it :smiley:.

1 Like

Easier than I thought! Under the covers (at the C level), all these things (bytearrays, array.arrays, memoryviews, mmap’ed memory, …) use the C-level “buffer protocol”. That’s not directly exposed at the Python level , though. ctypes gets at it vie

base_address = ctypes.addressof(ctypes.c_char.from_buffer(obj))

where obj is any Python object whose type participates in the C-level buffer protocol.

But there’s a catch! ctypes refuses to return the address if the object marks its buffer as “read-only”. You get a TypeError if you try, So, e.g., you can use this with a bytearray object, but not with a bytes object.

The rest is just “vanilla” code to teach ctypes about memchr()s signature. I can testify that ChatGPT-5 got all those details right too, and suggested code that worked on the first try. Plus it made sense to me :smile:.

Very much faster than using the exposed .index() methods of “B” format buffers with unit stride, but still slower than, e.g., bytearray’s type-specialized .index() (which is entirely in C - (you call ctypes with Pyrhon-level code, which adds significant overheads of its own).

1 Like

Uh… Something is supposed to keep ownership of the underlying Py_buffer so that the memory is finished doing with. But this snippet only returns a mere address.

So you can get an invalid pointer if the underlying object gets mutated (or even garbage-collected):

>>> b = bytearray(b"123")
>>> base_address = ctypes.addressof(ctypes.c_char.from_buffer(b))
>>> b += b"4567"
>>> base_address == ctypes.addressof(ctypes.c_char.from_buffer(b))
False
2 Likes

Yes. As I said,

ctypes is for consenting adults, In context, the address would be obtained in a function that then immediately passes it to libc’s memchr(). If, e.g., the argument is a memoryview that’s already been .release()'ed, trying to do from_buffer() raises an exception:

ValueError: operation forbidden on released memoryview object

There are insecurities remaining due to other threads possibly making the address invalid between the Python code obtaining the address, and later passing it to libc.memchr, but that’s just a fact of life common to all code that gets an address from ctypes via any means. You use at your own risk.

ctypes also offers .from_buffer_copy(), which makes its own physical copy of the buffer, and works for read-only buffers too. That’s as safe as can be, but at potentially high cost.

Pick your poison.

I’ll add that ctypes.c_char.from_buffer(obj) is not just a consumer of, but a participant in, the buffer protocol. It too holds ownership of the shared buffer so long as the instance of ctypes.c_char it creates is reachable.

What it actually does under the covers appears to be this: it creates its own memoryview of the argument, and that’s what holds the reference. So the underlying buffer won’t become invalid so long as the return value is alive.

ctypes.addressof() then retrieves the base buffer address from ctypes’s internal memoryview.

But once the code that created the from_buffer() result goes out of scope, all bets are off. ctype’s internal memoryview is released when the returned value is reclaimed, and the address can become gibberish then.

So in the context of a single function that obtains and uses the address, and then goes away, it’s as safe as any other code that creates its own memoryview of the argument. It’s certainly unsafe to use the address beyond that function’s lifetime.

1 Like

Which means the original code should be rewritten, from

base_address = ctypes.addressof(ctypes.c_char.from_buffer(obj))

to

mybuf = ctypes.c_char.from_buffer(buf)
base_address = ctypes.addressof(mybuf)

That keeps the “hidden” reference to the shared buffer alive until mybuf goes out of scope. In the original, refcounting reclaims the object as soon as the call to .addressof() completes, and other threads can sneak in then and potentially destroy the buffer.

1 Like

Ok, so what you want is something like:

>>> libc = ctypes.CDLL('libc.so.6')
>>> libc.memchr.argtypes = (ctypes.c_void_p, ctypes.c_int, ctypes.c_size_t)
>>> libc.memchr.restype = ctypes.c_void_p
>>> def fast_buffer_index(buf, byte):
...     with memoryview(buf) as m:
...         addr = ctypes.addressof(ctypes.c_char.from_buffer(m))
...         found = libc.memchr(addr, byte, len(m))
...         if found is None:  # NULL
...             raise ValueError("value not found")
...         return found - addr
...         
>>> b = bytearray(b"123")
>>> fast_buffer_index(b, ord(b"2"))
1
>>> fast_buffer_index(b, ord(b"4"))
Traceback (most recent call last):
  File "<python-input-59>", line 1, in <module>
    fast_buffer_index(b, ord(b"4"))
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "<python-input-50>", line 6, in fast_buffer_index
    raise ValueError("value not found")
ValueError: value not found

But the ctypes call overhead is really horrendous until your reach very large string sizes. You definitely want a C (or Rust :wink: ) version of this instead.

2 Likes

Bingo! I didn’t realize memoryview() can be used as a context manager - nice!

I’ll leave the rest for chatbots to answer (they appear to be very knowledgeable in this area):

  • Actual .index() methods support optional start and end arguments too,.
  • How to get at the C library varies across platforms; e..g, , on Windows, it’s ctypes.CDLL(“msvcrt.dll”).
  • Very similar things can be done to get at C’s memmove() and memcpy().

YMMV. It’s over 50x faster on my box on the OP’s original code (using data = obj). Still “a lot” slower than using data = bytes(obj), though.

1 Like

I really appreciate y’all digging into this - I’ve learned a lot :slight_smile:

2 Likes

But I may have mentioned that ctypes has a steep learning curve :wink:. So many moving pieces.

If you try that naively, you’ll find that your fast_buffer_index() can only find the first byte in the buffer. What happened to the rest? It wasn’t copied to begin with :frowning:

To get the whole buffer copied requires changing:

        addr = ctypes.addressof(ctypes.c_char.from_buffer(m))`

to:

        buffer_proto = ctypes.c_char * len(m)
        addr = ctypes.addressof(buffer_proto.from_buffer_copy(m))`

instead.

You’re in a maze of twisty little passages, all alike :wink:.

And I have to take that back. It slipped my mind that, for objects haystack of types bytes and byterarray, haystack.index(needle) works to search for multi-byte needle byte-ish objects too. So all the machinery in stringlib is relevant. Very little of it is actually useful when searching for a single byte, though.

But it is used to great effect in tough cases , like:

>>> x = b'x' * 1_000_000_000
>>> xy = x[: 1_000_000] + b'y'
>>> x.index(xy)
ValueError: subsection not found

That takes a few seconds, but not weels! :smile:. “The obvious” algorithm does take an enormously long time to fail to match:

  • Does it match starting at index 0?
    – No, a million bytes match, but not the final “y”.
  • Does it match starting at index 1?
    – No, a million bytes match, but not the final “y”.
  • Does it match starting at index 2?
    – No, a million bytes match, but not the final “y”.

On and on, close to a billion times, doing about a quadrillion compares in all. stringlib is smart enough to do it at least a million times faster.

You could do that yourself by thinking about it, and realizing that “the trick” is to search for “y” first. Which works great in this specific case, but doesn’t generalize in any obvious way. stiriglib uses a mix of clever approaches that always finish in linear time.

1 Like

To be fair, I designed the implementation of index() to be generic on purpose because there was no indication whether this interface benefits from having fast paths. I left future optimizations to “future me” because it wasn’t worth the code complexity.

Unknown to me. I’ve never used .index() on a memoryview, or an array.array. I’ve never used .count() on them either, which suffers similar sloth for the same reasons. Is there a reason to believe that .index() is more important than .count()?

Note that index() and count() were only introduced in 3.14 because of the Sequence API contract that we needed to respect. We don’t have an explicit __contains__ but we support x in mv because it falls back to __iter__ (or __getitem__ I don’t remember which one is used here) so we are kinda half supporting the contract as well (mv.__contains__ will raise) but remember that memoryview(bytes(...)) is an iterable of int, not bytes, so x in mv works if x is an int here.

6 Likes

Thanks for that! Both for implementing it, and for disclosing the thinking. I agree you did The Best Thing™ at the time.

While true of memoryview(), not true of array.array. Arrays have supported them since Python 2.2 (in 2001!). And for a very similar reason: not because there was user demand, but for “theoretical purity”

The difference is nobody used them :wink:, so nobody cared that they were slow. Now, a quarter century later, it’s “an issue”.

array.array does have __contains__, but makes no effort to make it fast. It uses __getitem__ in a loop, and funnels every compare through PyObject_RichCompareBool(). It too could be made 100s of times faster.

Thinking out loud; if we want to “do something” about this, a general but simple function taking start and end pointers (into a buffer of raw bytes), item size (in bytes), stride between items, and starting address of the bytes to look for, could use memcmp() in a loop to implement __index__, and called by any object supporting the buffer interface. With some limitations; e.g., searching for math.nan “should” always fail (“theoretical purity” again :frowning:), but comparing raw bytes doesn’t know about special cases. When it applies, though (“almost always”), it would be very much faster.

2 Likes