`itertools.all_equal`

hypernova · March 3, 2024, 12:04am

I posted recently about adding a map_binary function which could be used in combination with operator.eq.

It seemed that there were mixed reactions to this and most people didn’t seem to see what the point of it was, and that it might be better to just provide an implementation of all_equal directly.

That does make a lot of sense, and I have done some initial work on an implementation for this. Initial testing shows that a C implementation of all_equal runs in a bit over half the time compared to the fastest known Python implementation, which is based on a short circuiting version of itertools.groupby.

There is a Python implementation of an all_equal function in more-itertools. The fact that this exists already indicates there are developers out there who are already making use of such a function.

The problem with it is that it is comparably slow compared to a C implementation. This is expected, because the existing Python implementation in more-itertools chains together several Python function calls. The current implementation also does not make use of short-circuit return logic, which means it would be expected to perform poorly on large input. There is an alternative short-circuiting version out there too, but I don’t think that version is being used by the more-itertools module. (Please do correct me if my research into this is wrong.)

If you do an internet search for “python test if all elements are equal” or “python all equal” you will find numerous forum threads where people are asking for a way to implement an all_equal function, because Python doesn’t currently provide one either as part of the core Python install.

There are a very broad range of possible implementations which show up, many of which only work under certain conditions. (For example, leveraging numpy.) It is also not obvious which of these solutions provide generally good performance on a wide range of input (which is important to know). Some solutions are more readable but perform surprisingly poorly. Other implementations are cryptic and are not at all obvious how they work from a first inspection.

None of the above is good, because we are asking developers to not only spend time searching for a solution, but to inspect many possible implementations and somehow pick a “good” one. This is not easy to do.

In my opinion we should really just provide the right tool for the job from the start, then future users will not spend large amounts of their time trying to discern how to get this tool into their codebase.

It would be good to get some further feedback on this.

If you agree, please just comment “I agree” or otherwise indicate approval with an upvote
If you do not agree, please just comment “I do not agree”
or better yet, explain your objections, (presumably if you do not agree you have some reason to object to the proposal)

I don’t expect this to be a particularly interesting proposal to most people. If you happen to have worked on a project before where you would have found it useful to have an all_equal function, then you probably agree it would be good to add one.

If you didn’t work on such a project, I would assume you are indifferent.

I can’t think of any disadvantages to doing so.

Stefan2 · March 3, 2024, 1:39am

That’s not true.

Code (including tests) needs to be written, reviewed, documented, maintained, makes the codebase larger and thus more complicated, may introduce bugs, takes time to compile, tests take time to run. And unlike a recipe, an official function can’t easily be changed anymore, due to people relying on it (not that I think it would get changed, but that’s a general concern).

To me, it feels too narrow. On GitHub, you wrote “all_equal […] really is just for a single purpose, like all and any”. That’s a ridiculous comparison. Those two are very useful in many cases. That’s why they’re not just in the standard library but even among the built-ins.

Similar sentiment from your earlier topic :

hypernova · March 3, 2024, 8:31am

Code (including tests) needs to be written, reviewed, documented, maintained, makes the codebase larger and thus more complicated, may introduce bugs, takes time to compile, tests take time to run. And unlike a recipe, an official function can’t easily be changed anymore, due to people relying on it (not that I think it would get changed, but that’s a general concern).

If these were legitimate considerations then no code would be written, ever.

Yes, adding new features requires someone to do the work. In this case someone has already volunteered to do the work (me) - I’m not expecting anyone else to do it.

I was (jokingly) going to add the following comment on the above post, but decided to remove it because I thought people might interpret it as facetiousness.

There is one minor disadvantage. The Python compile time will increase marginally, but on the other hand any increase in compile time will be offset by your next CPU upgrade, probably by many orders of magnitude.

MegaIng · March 3, 2024, 8:46am

review can, by definition, not be done by just you, so at least one core dev and probably two or so more people have to also invest time
Do you have a history of contributions on the CPython repo? If not, it is a bit hard to trust that you are making a commitment for the next few years (maintenance, not just implementation)
codebase larger, more complicated, slower tests are all completely unavoidable and no amount of commitment on your side can remove these downsides, and these are forever costs, not one time costs.
also, CPython defines the stdlib that is expected in other implementations. So when those catch up, they also need to do work to add this function.

Dismissing these concerns as “illegitimate” is also going to annoy people. Yes, these are very strong counter arguments to adding anything to the stdlib, and most proposals for additions don’t clear them. You have to show that the value of this function overtrumps the added costs of implementing it.

hypernova · March 3, 2024, 9:15am

I understand your concerns. For such a simple algorithm I wouldn’t anticipate much maintenance being required. (Unless there are major changes.)

Let me address your points:

Sure someone will have to review it, but this is quite a small investment in time.
Let me put it another way:
If the main objection is someone will have to do a code review, then how would anything ever be contributed to the project? It doesn’t make sense. You could apply this argument to literally anything, I guess. Why bother to build the Golden Gate bridge? Someone will have to review the designs for safety after all.
You are correct to note that I do not have any contributions to the Python repo. I would like to contribute to it long term, but one has to start somewhere. I picked something which (to me at least) was relatively low hanging fruit.
It was low hanging fruit because I knew what the function should do and guessed that somewhere nearby I could find some similar code and adapt it. With a bit of background reading of the developer docs I managed to put something together.
If the argument about marginally increased compile times was legitimate then we would all still be writing assembler level code, presumably.
I don’t know how long it takes the compiler to build the code and run the additional tests. I would guess a few microseconds. It probably takes longer to print the test output to the terminal than it does to actually run.
I think on the PR or somewhere else the issue of where this code should live has already been addressed. I agree the itertools module is probably a better place (so that it has to be imported) rather than being a builtin.

Finally, there’s no need to be annoyed - someone is offering to contribute something useful.

I agree with your point about value. What you are describing is a “cost benefit analysis”. However, there is no defined procedure for how to measure this. (How would you even make such a measurement, the question doesn’t really even make much sense?) I’m not sure therefore what anyone is supposed to conclude from this.

What I can say about it is that (as mentioned previously) if you do an internet search for “Python all equal” you will find numerous posts made by people asking how to implement such an operation in Python. This does at least suggest there is demand for such a feature, and that it isn’t just one person (me) who wanted to add it.

Stefan2 · March 3, 2024, 12:05pm

Nonsense. Code where the advantages outweigh the disadvantages does make it. I explicitly replied to your “I can’t think of any disadvantages”, mentioning some. Which I’ve seen mentioned by core devs again and again in the last few years. It’s not a good sign how dismissive you are about the costs you want to cause for other people.

hypernova · March 3, 2024, 12:22pm

I think I’ll just quote this

It’s not a good sign how dismissive you are

and then point out your last post starts with

Nonsense

If asking someone to help with a code review is “too much of a burden” to ask, then what hope is there for anything to get done.

Please, be sensible here. It is obviously not the case that “a code review” is a blocker to something being done.

Stefan2 · March 3, 2024, 12:27pm

If you don’t see the difference, that’s another bad sign.

hypernova · March 3, 2024, 12:31pm

If you don’t see the difference, that’s another bad sign.

I have absolutely no idea what you are trying to say here. Difference between what? A bad sign? What do you mean by this?

bbayles · March 7, 2024, 9:49pm

There is a Python implementation of an all_equal function in more-itertools. The fact that this exists already indicates there are developers out there who are already making use of such a function.

For what it’s worth, this was added to more_itertools because it was in the itertools documentation’s recipes section. That is, it wasn’t added in response to user needs or grassroots suggestions.

Per GitHub search, it does probably get more usage than the median function in the library.

pf_moore · March 7, 2024, 10:20pm

For what it’s worth, I needed this today. I wrote

def all_same(m):
    m0 = m[0]
    return all(m0 == mm for mm in m[1:])

in a matter of seconds. I didn’t need anything more complicated (I knew my input was a list) and using a stdlib function or looking up a recipe would have slowed me down.

A little later, I realised I needed to normalise the values.

def all_same(m):
    m0 = normalise(m[0])
    return all(m0 == normalise(mm) for mm in m[1:])

Trivial to do, and no need to normalise any more values than needed.

I’m -1 on this proposal, it simply doesn’t provide enough benefit to justify it.

hypernova · March 7, 2024, 10:50pm

I did some further testing a few days ago and there’s a factor of 2 in runtime performance to be had here.

It will depend on the type of m and how __eq__ is implemented of course.

There’s also the added benefit of coding style, which I have mentioned previously. However this may be a matter of opinion to some.

I’m not really sure what the point of your comment was? You’ve made your point already, enough times.

Yes, you have come up with a trivial solution. I would guess the runtime performance and memory usage is pretty terrible.

Actually, you can perhaps clarify that for me. Is this for statement evaluated as an iterator or does it create an an array in memory of length len(m) - 1? I want to know if it is O(1) in memory or O(n).

Stefan2 · March 8, 2024, 12:58am

Not a for statement or any statement but a generator expression. Or the for clause of that if you specifically meant that part.

The m[1:] makes a copy of the entire list minus one element, taking linear space. (Just like you did twice in your all(map(operator.eq, iterable[1:], iterable[:-1])).

The generator expression gives you an iterator that only takes O(1) memory. So with for mm in m, the whole thing would only take O(1) memory (plus however much the normalise calls take, of course).

hypernova · March 8, 2024, 7:15am

So, to summarize.

The memory complexity and runtime performance of this

def all_same(m):
    m0 = normalise(m[0])
    return all(m0 == normalise(mm) for mm in m[1:])

is O(n), unless I misunderstand what you have just written.

Your response is not clearly expressed. For example,

How am I supposed to read this? It’s not a complete sentence. Are you correcting me on my terminology here or are you telling me something about the memory complexity.

Please at least write full sentences so I can understand what is being written.

Back to the point in hand about memory complexity:

Imagine performing this operation on a gigabyte long strip of ints. How well do you think that is going to work? That’s why there’s a factor of 2 performance, I guess.

Either way, the factor of 2 is real. It doesn’t matter if it comes from

Not copying an array (Python list)
Evaluating the equals operator without having to move in to and back out of Python code from C level

hypernova · March 8, 2024, 7:17am

Why do you think this is my suggestion?

I’m very confused. Nowhere have I suggested adding this code to the Python codebase. So I am confused as to why you say “Just like you did twice”.

I suspect you haven’t read my previous posts properly, otherwise I don’t understand how you came to this conclusion.

Stefan2 · March 8, 2024, 7:39am

Well don’t ignore the context, i.e., what I replied to. You said “Is this for statement […]” and I told you it isn’t one and what it actually is, so you can look it up if you want to.

And nowhere did I say you have.

Because that’s your code and when iterable is a list, then both the iterable[1:] and the iterable[:-1] make a copy of the entire list minus one element, both taking linear space.

Rosuav · March 8, 2024, 7:42am

Link doesn’t work, can you try again please? It’s just showing this entire thread, not (as presumably you intended) a specific post.

Stefan2 · March 8, 2024, 7:45am

Are you sure? For me it goes to a different thread, and that code is in the first post there.

Rosuav · March 8, 2024, 7:47am

Oh! I misread the URL and thought it was going to a specific post within it. (And you’re right, I honestly didn’t notice that it was a different thread, although the two are basically doing the same job anyway.) My bad!

hypernova · March 8, 2024, 8:09am

Ah, I see why you are confused. You did not read the full question.

Near the bottom of the post:

There was confusion about this on the linked post. I later clarified:

A high-performance solution to the "are all elements of an iterable equal" problem

Just to say I think you missed the point here. I am not suggesting
map(operator.eq, iterable[1:], iterable[:-1])
is an example of good code - far from it in fact.

Part of the problem was the post was moved and merged into some other post which had nothing to do with my proposal and was a discussion on documentation.

To clarify the point so that everyone can understand:

My original proposal was to add a version of map which works with binary operators. map_binary or some other name.
Please don’t misunderstand: map works with binary operators if you provide two iterables. This is not the same problem. I am proposing something which takes a single iterable.
I suggested this as an intermediate step to solving the all_equal problem.
The current suggestions to implement all_equal are either slow, or they use some form of prematurely optimized “hack” resulting in unreadable code. (The point of Python is it is supposed to be less complicated than something like C++, not more.)
I suggested adding this because it is more general than implementing all_equal directly.
However it comes with disadvantages: One is that although binary_map is obvious, it cannot be extended further in an obvious way. How would you implement a version for a function of three arguments? It isn’t clear how they should align over the input elements from the iterable. Should you advance the iterable by 1 step each time or 2? In the first case you have something which overlaps elements twice, in the second case you only overlap a single element. The fact that this is complicated to explain perhaps suggests the idea is already too complicated. It’s also not obvious what other usecases you might have. For example, if you want to add all elements with operator.add there is already a sum function. Similarly, there is a product function which you could also implement with binary_map plus operator.mul. Then for other binary operators, it’s not clear what the result of mapping operator.sub or operator.div might be? Hopefully it’s clear what I’m describing here, but I suspect it isn’t. The fact that it is much easier to explain with a diagram suggests that this is probably not a good idea.
I then went away to attempt to implement such all_equal directly, without map_binary. I did this because it became obvious to me that all_equal would be easier to implement directly instead.
The difference in performance is a factor of 2 for iterables containing objects which have simple implementations of __eq__.
It makes no difference on a list of 1 strings where the length of the string is infinite, because most of the runtime is spent evaluating __eq__ and not in the iterating part.