stringlib won’t be of much interest here. That’s aiming at reducing the worst case of searching for a string of length m in a string of length n from O(m*n) to O(m+n), and best case as low as O(n/m) Lots and lots of delicate work. We’re only looking for a single element, and “one at a time” compare is already theoretically O()-optimal.
As @oscarbenjamin said, memoryviews suffer the costs of “hyper-generalization”. A memoryview abstracts away the base memory address, the item size, the data type, and even the distance between slice elements (strides can even be negative!).
All of that gets re-deduced on every element access, and, adding insult to injury, funnels each compare through the hyper-general PyObject_RichCompareBool(), which requires building Python objects from the raw memory chunks, and re-deducing some of that stuff another time (like what the conceptual type is, to find the right comparison instructions to execute, and to enforce that the only kind of comparison .index() cares about is __eq__).
Tons of redundant and theoretically unnecessary work.
Much of that was already in play with the simpler array.array.index(), which knows in advance that the only kind of stride is one element. So try
data = array.array('B', obj)
and you’ll find that’s also much slower, but not as bad as memoryview.
There is “a solution”, but doubt anyone will endure the pain to implement it: write different C code for every kind of underlying data type, accessing the raw memory directly via correspondingly typed C variables, and using native C == for comparison. If the code ever needs to call PyObject_RichCompareBool(), it will never be speed-competitive.
Partly voice of experience there: Python’s list.sort() got very much faster some years ago for some type-homogenous lists, when an enterprising high school student undertook to replace (when possible) all-purpose PyObject_RichCompareBool() calls with type-specific custom comparison code hiding in the bowels of listobject.c. For example,
/* Float compare: compare any two floats. */
static int
unsafe_float_compare(PyObject *v, PyObject *w, MergeState *ms)
{
int res;
/* Modified from Objects/floatobject.c:float_richcompare, assuming: */
assert(Py_IS_TYPE(v, &PyFloat_Type));
assert(Py_IS_TYPE(w, &PyFloat_Type));
res = PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w);
assert(res == PyObject_RichCompareBool(v, w, Py_LT));
return res;
}
In a list of all floats, that’s what’s executed to compare. PyObject_RichCompareBool() can be called, but only in a DEBUG build. Else it’s just a few C instructions. PyFloat_AS_DOUBLE() is trivial, just casting its argument to PyFloatObject*and then accessing the ob_fval member, which is the native raw machine bytes. So it amounts to two memory loads of C doubles, and one native double compare.
In theory, similarly huge savings could be baked in for all types corresponding to native C scalar types.
It’s real work, though.