Popping sets and understanding why

When analyzing different method for getting the element from a single element set, I find that a for loop is faster than .pop(), which surprised me a lot.

# ======================================================================
# RANKED BY MEDIAN TIME (fastest first):
# ======================================================================
# 1. for loop             0.05512s  (18.37ns/op, 1.00x)
# 2. value, = s           0.06293s  (20.98ns/op, 1.14x)
# 3. next(iter(s))        0.07988s  (26.63ns/op, 1.45x)
# 4. s.pop()              0.08603s  (28.68ns/op, 1.56x)
# 5. list(s)[0]           0.13469s  (44.90ns/op, 2.44x)

This is my method for establishing the evidence:

import statistics, timeit

setup = "s = {42}"

def bench(stmt, setup=setup, number=3_000_000, repeat=7):
    times = timeit.repeat(stmt, setup=setup, number=number, repeat=repeat)
    return {
        "median": statistics.median(times),
        "stdev": statistics.pstdev(times),
        "runs": times,
    }

results = {
    "next(iter(s))": bench("next(iter(s))"),
    "s.pop()": bench("s = {42}; s.pop()", setup=""),  # Recreate set each time
    "value, = s": bench("value, = s"),
    "list(s)[0]": bench("list(s)[0]"),
    "for loop": bench("for x in s: break"),
}

for k, v in results.items():
    print(f"{k:20} {v}")

# Calculate and show rankings
print("\n" + "="*70)
print("RANKED BY MEDIAN TIME (fastest first):")
print("="*70)
sorted_results = sorted(results.items(), key=lambda x: x[1]['median'])
fastest = sorted_results[0][1]['median']

for i, (name, stats) in enumerate(sorted_results, 1):
    relative = stats['median'] / fastest
    ns_per_op = (stats['median'] / 3_000_000) * 1_000_000_000
    print(f"{i}. {name:20} {stats['median']:.5f}s  ({ns_per_op:.2f}ns/op, {relative:.2f}x)")


# next(iter(s))        {'median': 0.07987911600048392, 'stdev': 0.0015487941622256854, 'runs': [0.08058188100039843, 0.07924388599985832, 0.08089582999946288, 0.08170180900015112, 0.07987911600048392, 0.07793333700010407, 0.07700227099940093]}
# s.pop()              {'median': 0.0860316629996305, 'stdev': 0.0003872496000016712, 'runs': [0.08623365399944305, 0.08636853200005135, 0.085637518999647, 0.0860316629996305, 0.08564598999964801, 0.08562781299951894, 0.0866860940004699]}
# value, = s           {'median': 0.06292977600060112, 'stdev': 0.0001934032472859946, 'runs': [0.06281804099944566, 0.06321052299972507, 0.062730442999964, 0.06291875400074787, 0.06293551399994612, 0.0633145620004143, 0.06292977600060112]}
# list(s)[0]           {'median': 0.1346858329998213, 'stdev': 0.0014844358476449996, 'runs': [0.13523242300016136, 0.1346858329998213, 0.1344134809996831, 0.13504255899988493, 0.13422213600006216, 0.13427748399953998, 0.13876609099952475]}
# for loop             {'median': 0.05512175100011518, 'stdev': 0.00020771769034806938, 'runs': [0.05493225000009261, 0.05492972699994425, 0.05530007499964995, 0.05525388000023668, 0.05512175100011518, 0.0555668480001259, 0.05510627800049406]}

When trying to understand why using dis I take the following approach:

import dis

print("="*70)
print("BYTECODE ANALYSIS OF SET ELEMENT EXTRACTION METHODS")
print("="*70)

# Method 1: for loop
print("\n1. FOR LOOP: for x in s: break")
print("-"*70)
def for_loop():
    s = {42}
    for x in s:
        return x  
dis.dis(for_loop)

# Method 2: unpacking
print("\n2. UNPACKING: value, = s")
print("-"*70)
def unpacking():
    s = {42}
    value, = s
    return value
dis.dis(unpacking)

# Method 3: next(iter(s))
print("\n3. NEXT(ITER(S)): next(iter(s))")
print("-"*70)
def next_iter():
    s = {42}
    return next(iter(s))
dis.dis(next_iter)

# Method 4: s.pop()
print("\n4. S.POP(): s.pop()")
print("-"*70)
def pop_method():
    s = {42}
    return s.pop()
dis.dis(pop_method)

# Method 5: list(s)[0]
print("\n5. LIST(S)[0]: list(s)[0]")
print("-"*70)
def list_index():
    s = {42}
    return list(s)[0]
dis.dis(list_index)

print("\n" + "="*70)
print("INSTRUCTION COUNT SUMMARY")
print("="*70)

methods = [
    ("for loop", for_loop),
    ("unpacking", unpacking),
    ("next(iter(s))", next_iter),
    ("s.pop()", pop_method),
    ("list(s)[0]", list_index),
]

for name, func in methods:
    # Count instructions (excluding the final RETURN_VALUE and setup)
    instructions = list(dis.get_instructions(func))
    # Filter out RESUME and RETURN_CONST/RETURN_VALUE
    core_instructions = [i for i in instructions if i.opname not in ('RESUME', 'RETURN_VALUE', 'RETURN_CONST')]
    print(f"{name:20} {len(core_instructions):2} instructions")

and get the following result:

# 1. FOR LOOP: for x in s: break
# ----------------------------------------------------------------------
#  65           0 RESUME                   0

#  66           2 LOAD_CONST               1 (42)
#               4 BUILD_SET                1
#               6 STORE_FAST               0 (s)

#  67           8 LOAD_FAST                0 (s)
#              10 GET_ITER
#              12 FOR_ITER                 4 (to 24)
#              16 STORE_FAST               1 (x)

#  68          18 POP_TOP

#  69          20 LOAD_FAST                1 (x)
#              22 RETURN_VALUE

#  67     >>   24 END_FOR

#  69          26 LOAD_FAST_CHECK          1 (x)
#              28 RETURN_VALUE

# 2. UNPACKING: value, = s
# ----------------------------------------------------------------------
#  75           0 RESUME                   0

#  76           2 LOAD_CONST               1 (42)
#               4 BUILD_SET                1
#               6 STORE_FAST               0 (s)

#  77           8 LOAD_FAST                0 (s)
#              10 UNPACK_SEQUENCE          1
#              14 STORE_FAST               1 (value)

#  78          16 LOAD_FAST                1 (value)
#              18 RETURN_VALUE

# 3. NEXT(ITER(S)): next(iter(s))
# ----------------------------------------------------------------------
#  84           0 RESUME                   0

#  85           2 LOAD_CONST               1 (42)
#               4 BUILD_SET                1
#               6 STORE_FAST               0 (s)

#  86           8 LOAD_GLOBAL              1 (NULL + next)
#              18 LOAD_GLOBAL              3 (NULL + iter)
#              28 LOAD_FAST                0 (s)
#              30 CALL                     1
#              38 CALL                     1
#              46 RETURN_VALUE

# 4. S.POP(): s.pop()
# ----------------------------------------------------------------------
#  92           0 RESUME                   0

#  93           2 LOAD_CONST               1 (42)
#               4 BUILD_SET                1
#               6 STORE_FAST               0 (s)

#  94           8 LOAD_FAST                0 (s)
#              10 LOAD_ATTR                1 (NULL|self + pop)
#              30 CALL                     0
#              38 RETURN_VALUE

# 5. LIST(S)[0]: list(s)[0]
# ----------------------------------------------------------------------
# 100           0 RESUME                   0

# 101           2 LOAD_CONST               1 (42)
#               4 BUILD_SET                1
#               6 STORE_FAST               0 (s)

# 102           8 LOAD_GLOBAL              1 (NULL + list)
#              18 LOAD_FAST                0 (s)
#              20 CALL                     1
#              28 LOAD_CONST               2 (0)
#              30 BINARY_SUBSCR
#              34 RETURN_VALUE

# ======================================================================
# INSTRUCTION COUNT SUMMARY
# ======================================================================
# for loop             11 instructions
# unpacking             7 instructions
# next(iter(s))         8 instructions
# s.pop()               6 instructions
# list(s)[0]            8 instructions


My interpretation of dis is:

  1. For loop is fastest because:
  • GET_ITER and FOR_ITER are direct bytecode instructions (no function calls)
  • Implemented in highly optimized C code
  • No global lookups needed
  1. Unpacking is close second because:
  • UNPACK_SEQUENCE is a single specialized bytecode instruction
  • No global lookups or function calls
  • Direct operation at the bytecode level
  1. next(iter(s)) is slower because:
  • LOAD_GLOBAL twice (expensive dictionary lookups in builtins namespace)
  • Two CALL operations (function call overhead)
  • The actual work is fast, but the setup is expensive
  1. list(s)[0] is slowest because:
  • Creates an entire list object
  • Global lookup + function call + indexing operation

Is this interpretation correct?

Under the hood, iteration just creates a pointer. Whereas popping mutates the data structure.

The cost of creating the set is included in the statement .pop case (but not in the others), so case is not a fair test.
"s.pop()": bench("s = {42}; s.pop()", setup=""), # Recreate set each time. Interesting to see how the time cost of the one off over heads, for each technique compares though. Even if they’re all trivially small, and not worth further optimisation.

For non-trivial numbers of items (performance wise), I have found dict.fromkeys(items) to often be faster than sets. Ordered too, these days.

That’s really not what you found, because of those set recreations.

Mine, creating 1000 sets and measuring the handling of all of them, times are per set:

  6.5 ± 0.0 ns  baseline
 23.4 ± 0.2 ns  pop2
 23.6 ± 0.1 ns  pop1
 42.8 ± 0.3 ns  forbreak

Python: 3.13.0 (main, Nov  9 2025, 11:53:23) [GCC 15.2.1 20250813]
benchmark script
def baseline(ss):
    for s in ss:
        pass

def forbreak(ss):
    for s in ss:
        for x in s:
            break

def pop1(ss):
    for s in ss:
        x = s.pop()

def pop2(ss):
    for s in ss:
        s.pop()

funcs = [baseline, forbreak, pop1, pop2]

from timeit import timeit
from statistics import mean, stdev
import sys
import random

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} ns '
for _ in range(100):
    random.shuffle(funcs)
    for f in funcs:
        n = 1000
        ss = [{42} for _ in range(n)]
        t = timeit(lambda: f(ss), number=1) / n
        times[f].append(t)
for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__)

print('\nPython:', sys.version)

Attempt This Online!

Are you talking about a recent optimization? At least until recently, it created an iterator object.

Of course! I need more coffee.

import statistics, timeit

print("="*70)
print("ISOLATING s.pop() performance by measuring copy overhead")
print("="*70)

# Measure just the copy operation
copy_times = timeit.repeat(
    "s_local = s.copy()",
    globals={'s': {42}},
    number=3_000_000,
    repeat=7
)

# Measure copy + pop
copy_and_pop_times = timeit.repeat(
    "s_local = s.copy(); s_local.pop()",
    globals={'s': {42}},
    number=3_000_000,
    repeat=7
)

# Measure just pop on a pre-created local set (best we can do)
# We need to recreate the set each iteration
pop_with_recreation_times = timeit.repeat(
    "s_local = {42}; s_local.pop()",
    number=3_000_000,
    repeat=7
)

copy_median = statistics.median(copy_times)
copy_and_pop_median = statistics.median(copy_and_pop_times)
pop_with_recreation_median = statistics.median(pop_with_recreation_times)

# Calculate isolated pop time
isolated_pop_median = copy_and_pop_median - copy_median

copy_ns = (copy_median / 3_000_000) * 1e9
copy_and_pop_ns = (copy_and_pop_median / 3_000_000) * 1e9
isolated_pop_ns = (isolated_pop_median / 3_000_000) * 1e9
pop_with_recreation_ns = (pop_with_recreation_median / 3_000_000) * 1e9

print(f"\nCopy only:           {copy_median:.5f}s ({copy_ns:.2f}ns/op)")
print(f"Copy + pop:          {copy_and_pop_median:.5f}s ({copy_and_pop_ns:.2f}ns/op)")
print(f"Pop with recreation: {pop_with_recreation_median:.5f}s ({pop_with_recreation_ns:.2f}ns/op)")
print(f"\nIsolated pop (copy+pop - copy): {isolated_pop_median:.5f}s ({isolated_pop_ns:.2f}ns/op)")

print("\n" + "="*70)
print("COMPARISON WITH OTHER METHODS:")
print("="*70)

# Re-run the other methods for comparison
other_results = {}

other_results["value, = s"] = timeit.repeat(
    "value, = s",
    globals={'s': {42}},
    number=3_000_000,
    repeat=7
)

other_results["next(iter(s))"] = timeit.repeat(
    "next(iter(s))",
    globals={'s': {42}},
    number=3_000_000,
    repeat=7
)

# Add our isolated pop result
all_results = {
    "s.pop() (isolated)": isolated_pop_median,
    "value, = s": statistics.median(other_results["value, = s"]),
    "next(iter(s))": statistics.median(other_results["next(iter(s))"]),
}

sorted_results = sorted(all_results.items(), key=lambda x: x[1])
fastest = sorted_results[0][1]

for i, (name, median) in enumerate(sorted_results, 1):
    relative = median / fastest
    ns_per_op = (median / 3_000_000) * 1e9
    print(f"{i}. {name:25} {median:.5f}s ({ns_per_op:.2f}ns/op, {relative:.2f}x)")

ISOLATING s.pop() performance by measuring copy overhead

Copy only: 0.08456s (28.19ns/op)
Copy + pop: 0.09873s (32.91ns/op)
Pop with recreation: 0.08814s (29.38ns/op)

Isolated pop (copy+pop - copy): 0.01416s (4.72ns/op)

COMPARISON WITH OTHER METHODS:

  1. s.pop() (isolated) 0.01416s (4.72ns/op, 1.00x)
  2. value, = s 0.06878s (22.93ns/op, 4.86x)
  3. next(iter(s)) 0.08242s (27.47ns/op, 5.82x)

Is this fair?

PRO:

  • Isolates the actual pop() operation
  • Removes the copy overhead we added for testing
  • Shows true performance of just pop()

CON:

  • Subtracting measurements can amplify noise/variance
  • The copy operation itself might have caching effects
  • Small measurement errors become magnified

Result: Isolated pop is 4.72ns/op

The isolated pop() time is FASTER than unpacking. Now everything makes sense.

Duh.

Just conceptually, a pointer is how I think of the underlying C implementation of an iterator. Whatever it is, the point is, incrementing an iterator, and especially ++ing a pointer in C, is a darned site faster than a Python mutation.

In CPython an iterator stores a pointer but the iterator has to be an list_iterator object with a pointer to its type and a reference count just so that it can then store the pointer to the list object and the counter of where the iterator is in the list. All that adds up to 32 bytes and needs to be allocated on the heap.

1 Like

Yeah that looks better, assuming you get stable times. I don’t, I ran it a few times and the “Isolated pop” time varied from 9 ns to 30 ns. (And your 4.72 ns seems suspiciously fast to me, but I might just not be used to a fast computer.)

1 Like

Thanks - I’ve wondered about that. Once the list_iterator etc. has all been created, is the iteration itself simply a matter of bumping the counter? I suppose if all the hash table stuff has already been arranged too, popping need not be so expensive either.

Notably, it needs to be allocated, but thanks to free lists, that allocation can be fast. Particularly when there are lots of the same type of iterator being made and discarded, those allocations basically amount to “here, reuse this one”.

There is more to it. I think it is like the code implementing the for-loop has a pointer to an iterator object. It calls that object’s __next__ method but to do that it has to follow a few pointers to get to the pointer to the function that implements that method (not sure if it can cache this pointer for the duration of the loop). Then that function will be called for each item and each time it is called it will have to check the counter against the length of the list (this would require an atomic lock in free-threading) and then it can return a pointer to the next item but it needs to increment the reference count on that item before returning it.

1 Like

Here is the actual code for the C function that implements list_iterator.__next__:

@oscarbenjamin I do think I’ve seen an optimization at least being discussed, where a for statement, given a list, does not create and use an iterator object but instead implements the list iteration itself. Not sure where/when I saw it and whether it has been done.

1 Like

Here’s the PR for that - been implemented for 3.15. From 3.12+ lists are specialised, so they still make the iterator object but can directly inline the __next__ implementation.

2 Likes