There is a better way — a free Python book on ECS + EBP

This started as a problem with extending an existing OOP application. The domain model didn’t fit the problem any longer. A serious refactoring was required. It bothered me.

Why did I write OOP in the first place? Why did I spend so much time migrating CRUD? All those abstractions weren’t free. Where did I go wrong?

It is very hard to go up against mainstream ideas. Every textbook is OOP + CRUD, and so every coding agent wants to write OOP + CRUD. But the pointer chasing fights raw performance, and the rigidity of the domain model makes development slow.

After twenty years I’ve had enough. I committed to Entity-Component-Systems (ECS) and Existence-Based Processing (EBP). This book is the lessons I learned.

You don’t need OOP. You need systems.
You don’t need abstractions. You need code that’s friendly to the metal.

The Python edition (sister Rust edition exists too) is here:

Forty-four sections. Every claim is backed by a measurement you can run on your own laptop in under a minute. Work in progress. No paywall.

I’m not sure this is the right forum for it, but welcome any questions. I just wanted to drop this and say: There is a better way.

(please help me to tag it where it belongs).

Looks interesting Bjorn - thanks.
Did you find the effort required for ECS was worthwhile in Python, without the explicit control over how objects are stored in memory that seem to facilitate many of its performance advantages?
Does anything make it a still useful paradigm, nonetheless?

Short answer: yes — and since the question is really about effort, three reasons, roughly in the order they paid off.

First, it’s cheaper to extend and to test, which is the part that has nothing to do with speed. Once you stop modelling entities as objects and start modelling them as columns, a new feature is an addition rather than a restructuring — a new column, or a new presence table plus the system that drives it, bolted on without touching what’s already there. Existence-based processing makes that nearly free at runtime too: a system iterates only the entities that currently have the component, so an absent state is an empty loop, not a branch you pay for on every instance. The OOP equivalent — new field, new branch, new subclass — tends to mean re-engineering the model rather than patching it. Testing falls out of the same shape: a system is a function over arrays, tested in isolation (or against a protocol) with no object graph to assemble first. The effort is front-loaded into the SoA mindset; after that the marginal cost per feature drops instead of compounding.

Second, the memory advantage justifies itself, especially at scale. Typed columns run 4–5× smaller than the equivalent object graph, and narrower dtypes widen the gap (a list of ints is 5–40× a numpy column depending on width). Past a million rows it’s usually footprint, not CPU, that’s the ceiling — the difference between a working set that stays in RAM/cache and one that doesn’t.

Third, the performance — with the caveat your question points straight at. You’re right that Python won’t hand you struct layout, but it does hand you the part that matters here, spelled numpy/array.array: a typed column is contiguous and you choose the width. What isn’t free is the inner loop. §3 pulls three flips apart on one dataset — AoS→SoA is ~10× on a column sum from access pattern alone, before any numpy; list→array.array is smaller but slower to sum, because each int gets unboxed before the add; array.array→np.sum is ~30× on the same bytes once the loop runs in C. So the speed is real as long as the systems stay array-shaped enough to keep the hot loop out of the interpreter — and the book says plainly you don’t get Rust’s numbers. I put it third because it’s the most conditional of the three.

Receipts on your own hardware: uv run code/measurement/empty_tables.py (read the 0% row) and aos_vs_soa_footprint.py.

This is the best place.

People need a language that is human-friendly. Python was designed for this. And this is good enough for CPUs that mostly sit waiting for user input. Python was also designed to access fast metal-friendly code. I believe its first killer app, around 1995, was Numerical Python, one of NumPy’s predecessors, which gave friendly batch AND INTERACTIVE access to state-of-the-art Fortran routines, such as LINPACK, and easy connection to followup processing, such as plotting. Scientific and other heavy-duty number processing has continued to be a major reason for Python’s success.

If this has been buried in some quarters by OOP over-hype, I seem to have missed it. And it is unfortunate. OOP is fine for tkinter UIs (I work on IDLE). But I also helped my daughter to 100s of hours of simulation with numpy for her PhD thesis.

While Python intentionally hides cache hits versus misses, some of the code has been tuned with that in mind. An example is A deque, for instance, is a doubly-linked list of arrays sized for

I think most readers will recognize OOP, but CRUD (create-read-update-delete), ECS (entity-component-systems), and EBP (existence-based processing) were new to me. Putting unknown acronyms in the book title may push away people who might otherwise read it. I’m reading it as ‘supercharged python+numpy’, but I don’t have a suggestion better than that.

I believe somewhere I saw range of int32 given as ±2**1 rather than **31.

The small int cache now goes up to 1024.

The type in the DAG is too small to read, even enlarged 300% on a 4k screen.

Terry — thank you, especially for the three corrections.

  • int32 range — typo, fixed (±2¹ → ±2³¹).
  • DAG — fixed. It’s now a full-resolution image you can open and zoom, instead of being scaled down to fit the text column.
  • small-int cache — I can’t reproduce 1024. On 3.14.5 and on main it’s still [-5, 256] (int('257') is int('257') is False; _PY_NSMALLPOSINTS = 257). Are you on a 3.15 build, or a different cache? You’d know the internals better than I would, and I want the footnote right.

On OOP, I think we’re agreeing. Objects cost little when the program mostly sits waiting for the user — your world, and a fair description of most desktop UI. The bill comes due when the budget is tight, or when the domain model stops fitting and you’re staring at a refactor. That second case is what pushed me to write this — not “OOP is wrong,” just that it isn’t free, and the price stays invisible until it isn’t.

The title’s a fair hit. The acronyms are the three things the book builds before it names them, but a reader can’t tell that from the cover. Sitting with it for now.

Yeah, acronyms like these can be tricky. I know CRUD from writing REST APIs, and I know ECS from my background in game programming, but even with my familiarity with the terms it took a moment to parse. Perhaps because I associate them with specific domains? EBP was also unfamiliar.

Increased upper limit

m1 = 1024
m2  = 1024
m1 is m2
False  

True in released 3.15.0b1 and 3.16.0a0 compiled a week ago. (On Windows, but should not matter). I saw the discussion of what the new upper limit should be. Higher was considered and rejected.

Terry — thank you, this is exactly the kind of correction the book should absorb. The notion turned the example into its own lesson: The upper limit moved (256 → 1024 in 3.15.0b1 and newer) while the book was being written, which is precisely why the chapter tells readers to treat the small-int cache as an implementation detail they can’t lean on rather than to memorise the number. Hard to ask for a cleaner demonstration.

I’ve added a line to §2, “Numbers and how they fit,” right under the small-int cache note, crediting your report:

Q: Would you be okay being quoted there by name, with a link to your post? Glad to reword or drop the attribution if you’d prefer it left off.

And — compiling 3.16.0a0 a week ago is bleeding edge. Respect.

Attribution by name is fine.

Historical note: the question of whether 2d arrays should be stored by rows or columns goes back to at least the 1970s. Fortran stored matrices by columns, C by rows. This was not terribly important, except when LINPACK and other FORTRAN packages were translated to C. For data tables with possibly heterogeneous records/rows, the issue is more acute, as your book discusses. BMD/BMDP (in Fortran ) from UCLA may have been the first statistical package. I believe SPSS used some of its (open-source) code. SAS and others came later. The nasty truth is than neither storage/access pattern is optimal for all uses.

Did you imply that numpy has a store-by-rows option, and that Pandas uses it?

“Existence-based processing” as the expansion of EBP is extremely niche. Where did you get it? It does not appear in the most inclusive site I found, https://www.allacronyms.com/EBP, with about 170 entries (sort a-z and go to end). I might call it subset or iterable processing, or ‘active subset processing’ (ASP), where the active subset is an overt collection rather that being indicated by a flag attribute. Any ID collection could work. Whether an NDArray is the best implementation is a separate question.

FWIW, I was also lost by the acronyms in the introduction and the equally opaque terms they expand to. I had absolutely no idea what the book was about until I got to page three. Something about not using classes which normally means using functional programing instead. (Interesting after that though. Very glad I kept reading. :slightly_smiling_face:)

Attribution noted — thank you, I’ll credit you by name.

The Fortran-by-columns / C-by-rows lineage is exactly the right frame, and the LINPACK-translation pain is the historical version of the cache-line argument the book makes at small scale. “Neither storage/access pattern is optimal for all uses” is the book’s whole thesis in one sentence — the chapters pick the columnar (SoA) layout because the simulator’s hot loops read one field across many entities, and say plainly that a row-at-a-time workload would choose the opposite. BMDP/SPSS/SAS I didn’t know; I’ll chase that lineage.

On numpy/pandas: I didn’t mean to imply pandas rides numpy’s row option — if I gave that impression I was sloppy. numpy defaults to row-major (C order) and offers order='F' for column-major, so it has both. pandas is the other way: it’s column-oriented under the hood (same-dtype columns consolidated so each column is contiguous; Arrow-backed columnar in 2.0+). That’s exactly why the book’s move — drop pandas, hold explicit numpy columns — is a continuation of column thinking, not a reversal of it.

EBP: the concept is Richard Fabian’s, from Data-Oriented Design (he has a section literally titled “Existence Based Processing” link). The three-letter EBP is my shorthand, which is why an acronym index wouldn’t surface it — the spelled-out term is the searchable one. Your “active subset processing” is a fair description and I like that it names the mechanism (an overt ID collection, not a flag) without prejudging the implementation — which is the same separation you draw at the end. I’ll keep Fabian’s term for continuity with that literature, but I’ll make the “presence is the membership, the container is a choice” point more explicit, because that’s the part that generalises beyond NDArrays.

Numpy recently (not sure when) added the float16 type for arrays. Understanding numpy.float16 type (4 examples) - Sling Academy has example code, but omitted the time comparison output for example 4. Modern NVidea CUDA graphic cards compute with these directly. I believe numpy converts to float32 and back on current cpu float-processors. The missing results would show if half the cache line reads compensates for the conversions.

Terry,
I don’t see a question mark, but I presume that the question is about what the results would look like?
I didn’t know, so I measured. Here is what I found.

Sum, elementwise multiply, and dot, across float16 / float32 / float64, at five sizes spanning L1 (16 KB) to deep RAM (128 MB at FP64), on two x86 boxes — a 2015 Intel Broadwell NUC (i3-5010U) and an AMD Zen 4 dev machine (Ryzen 9 270). The code is at the bottom of this post.

At N = 33M (Broadwell / Zen 4, ns per element):

  ┌─────────────┬─────────────┬─────────────┬────────────┐
  │     op      │     f16     │     f32     │  f16/f32   │
  ├─────────────┼─────────────┼─────────────┼────────────┤
  │ sum(a)      │ 15.7 / 6.1  │ 0.54 / 0.12 │ 29× / 53×  │
  ├─────────────┼─────────────┼─────────────┼────────────┤
  │ a * b → out │ 139 / 60    │ 1.5 / 0.34  │ 92× / 177× │
  ├─────────────┼─────────────┼─────────────┼────────────┤
  │ a @ b       │ 31.8 / 12.6 │ 0.79 / 0.17 │ 40× / 74×  │
  └─────────────┴─────────────┴─────────────┴────────────┘

On these CPUs the bandwidth saving doesn’t compensate. The diagnostic is that f16 per-element time is flat across the full sweep (4 KB → 128 MB working sets) — the loop is compute-bound, the f16↔f32 conversion is so expensive that cache distance is invisible, and the halved bandwidth never gets exercised because bandwidth isn’t the bottleneck. The f64/f32 column meanwhile sits near 2× at large N, a clean bandwidth-bound regime with no conversion penalty. So the cache hierarchy is doing what we’d expect; it’s specifically the f16 path that’s broken on these chips.

What’s missing on both is the AVX-512 FP16 extension (avx512_fp16 in /proc/cpuinfo). The Broadwell predates AVX-512 entirely. The Zen 4 has wide AVX-512, including AVX-512 BF16, but not AVX-512 FP16 — so it’s specifically the half-precision arithmetic extension that’s absent, not AVX-512 in general. NumPy’s f16 paths fall back to scalar emulation on both.

GPU side, I find no evidence to the contrary. NVIDIA datasheets:

  • V100 (2017, first Tensor Cores): FP16 Tensor 125 TFLOPS vs FP32 15.7 → ~8×
  • A100 (2020): FP16 Tensor 312 vs FP32 19.5 → ~16×
  • H100 (2022): FP16 Tensor 1,979 (with 2:1 sparsity) vs FP32 67 → ~30× sparse / ~15× dense

NVIDIA marks the lower-precision Tensor numbers “* with sparsity” — half the headline unless the weights are 2:4 structured-sparse. Canonical reference for the technique appears to be Micikevicius et al., “Mixed Precision Training” (ICLR 2018, arXiv:1710.03740) — FP32 master weights plus loss scaling.

One sub-fact worth flagging: Intel AVX-512 FP16 shipped in Sapphire Rapids (2023). As far as I can tell, NumPy has wired the new instructions into the sort path (via x86-simd-sort, NumPy 2.3+) but not into .sum(), *, or @. So even on a Sapphire Rapids Xeon the numbers above would likely look much the same for general arithmetic. I haven’t measured this; I don’t have access to a chip with avx512_fp16, but a row from there would be the most interesting next datum.

Sources:

Code: float_widths.py

# /// script
# requires-python = ">=3.11"
# dependencies = ["numpy"]
# ///
"""
exhibit — float16 vs float32 vs float64: does half the footprint pay for itself?

float16 is a 2-byte IEEE 754 half. CUDA hardware runs it natively; most CPUs
do not. On a typical x86 / ARM core without AVX-512 FP16, numpy implements
float16 ops by widening each element to float32, computing, and (for stores)
narrowing back. So the trade-off per element is:

    + half the bytes through the cache hierarchy
    - two extra conversions wrapped around the arithmetic

The trade-off only resolves with a measurement. The crossover (if any) is
size-dependent: when the working set is in L1 the conversion penalty
dominates because cache bandwidth is not the bottleneck; once the working
set spills to RAM the bandwidth saving may overtake the conversion cost.

Three operations probe different ratios of bytes-to-arithmetic:

    sum     — one read per element, one accumulate. Bandwidth-bound when
              the array doesn't fit in cache.
    a*b→c   — two reads, one write, one multiply per element. Maximum
              bandwidth pressure.
    a @ b   — two reads, one multiply-add per element. Same memory traffic
              as a*b→c but the FMA is fused; arithmetic intensity matters
              more here.

Five sizes span L1 (~32 KB) to deep RAM (~128 MB at float64). Subprocess
per (size, dtype) so allocations don't interact.

Run:
    uv run code/measurement/float_widths.py
"""

import gc
import multiprocessing as mp
import time


# Element counts. At float32: 4_096 = 16 KB (L1), 65_536 = 256 KB (L2),
# 1_048_576 = 4 MB (L3 border), 8_388_608 = 32 MB (RAM), 33_554_432 = 128 MB
# (deep RAM). Double those byte counts for float64, halve for float16.
SIZES = [4_096, 65_536, 1_048_576, 8_388_608, 33_554_432]
DTYPES = ["float16", "float32", "float64"]

WARMUP_REPS = 1
MEASURE_REPS = 5


def time_call(fn):
    for _ in range(WARMUP_REPS):
        fn()
    best = float("inf")
    for _ in range(MEASURE_REPS):
        gc.collect()
        t0 = time.perf_counter()
        fn()
        t1 = time.perf_counter()
        best = min(best, t1 - t0)
    return best


def measure(n, dtype_name, q):
    import numpy as np
    rng = np.random.default_rng(seed=0xC0FFEE)
    # Scale to keep sums and dot products inside float16's representable
    # range (max ~65504) across all N. With values in [0, 2**-16), the
    # largest sum at N=33M is ~512 and the largest dot is ~8 — well inside
    # range, well above f16's subnormal threshold (~6e-5).
    scale = np.float32(2.0 ** -16)
    a = (rng.random(n, dtype=np.float32) * scale).astype(dtype_name, copy=False)
    b = (rng.random(n, dtype=np.float32) * scale).astype(dtype_name, copy=False)
    out = np.empty_like(a)

    t_sum = time_call(lambda: float(a.sum()))
    t_mul = time_call(lambda: np.multiply(a, b, out=out))
    t_dot = time_call(lambda: float(a @ b))

    q.put({
        "n": n,
        "dtype": dtype_name,
        "sum_ns": t_sum / n * 1e9,
        "mul_ns": t_mul / n * 1e9,
        "dot_ns": t_dot / n * 1e9,
    })


def run_grid():
    results = {}
    for n in SIZES:
        for dtype in DTYPES:
            q = mp.Queue()
            p = mp.Process(target=measure, args=(n, dtype, q))
            p.start()
            r = q.get()
            p.join()
            results[(n, dtype)] = r
    return results


def print_table(results, op_key, op_label):
    print(f"{op_label}  (ns per element, lower is faster)")
    header = (f"{'N':>12}  {'float16':>10}  {'float32':>10}  {'float64':>10}  "
              f"{'f16/f32':>9}  {'f64/f32':>9}")
    print(header)
    print("-" * len(header))
    for n in SIZES:
        r16 = results[(n, "float16")][op_key]
        r32 = results[(n, "float32")][op_key]
        r64 = results[(n, "float64")][op_key]
        ratio_16 = r16 / r32
        ratio_64 = r64 / r32
        print(f"{n:>12,}  "
              f"{r16:>9.3f}   "
              f"{r32:>9.3f}   "
              f"{r64:>9.3f}   "
              f"{ratio_16:>8.2f}×  "
              f"{ratio_64:>8.2f}×")
    print()


def main():
    print(__doc__.strip().splitlines()[0])
    print()
    print(f"Sizes (elements):   {SIZES}")
    print(f"Each cell:          best of {MEASURE_REPS} runs after {WARMUP_REPS} warmup")
    print()

    results = run_grid()

    print_table(results, "sum_ns", "sum(a)")
    print_table(results, "mul_ns", "a * b -> out (elementwise)")
    print_table(results, "dot_ns", "a @ b (dot product)")

    print("Read the columns:")
    print("  f16/f32 < 1: float16 is winning — halved bytes beat the conversion.")
    print("  f16/f32 > 1: conversion overhead dominates — the array fits well")
    print("               enough in cache that bandwidth savings don't pay.")
    print("  f64/f32 is the 'pure width' reference: no conversion penalty, just")
    print("               twice the bytes. If f64/f32 < 2 the op is partly")
    print("               compute-bound; if ~2 it is bandwidth-bound.")


if __name__ == "__main__":
    main()

Wow. Quite a response to a half question (‘answer if you feel like it’). Anyway, CPU F16 currently has substantial time cost. Thanks for the answer.

I have the impression that large neural-network and language model and PyTorck for AI – all best done with CUDA, use BF16 with full F32 range and less than the half precision. So it does not surprise me that Zen4 has that. Maybe F16 is not yet used enough (chicken and egg problem) or the Zen4 die does not have quite enough room.