Greetings all, I’d like to present lafleur, a CPython JIT fuzzer. We’d love and welcome any kind of collaborations, from questions to vague ideas to working code to being part of a research project. Please start discussions, file issues or propose pull requests if you want to contribute.
Summary
What
lafleur is a specialized fuzzer designed to find crashes in CPython’s experimental JIT, having found 4 JIT crashes so far.
How
It works by mutating code samples and reading JIT debug information when running the new code, monitoring the process to detect crashes. If the new mutated sample generates interesting output, it’s saved to be further mutated later.
Why
Finding crashes in CPython’s JIT is a way of making it more robust and correct in corner cases.
History
lafleur descends from fusil and has been in active development for about 2 months.
Intro to lafleur
CPython’s JIT compiler represents an exciting experiment in improving Python performance, but like all complex optimizing compilers, it’s susceptible to subtle bugs that only surface under specific conditions. Over the past two months, we’ve been developing lafleur, a specialized fuzzer designed specifically to hunt these elusive JIT bugs, which has already found some crashes.
lafleur is a feedback-driven, evolutionary fuzzer for the CPython JIT compiler. That means that unlike fusil and traditional fuzzers that generate code randomly, lafleur uses a coverage-guided, success-rewarding approach that executes test cases, observes their effect on the JIT’s behavior by analyzing verbose trace logs, and uses that feedback to guide its mutations, theoretically becoming progressively smarter at finding interesting code paths over time.
Being coverage-guided means that lafleur decides whether a test case it generated is interesting by checking the JIT debug logs for new behavior after the mutation is applied. It considers both novel micro-ops (UOPs) and novel edges between pairs of UOPs as interesting. Each of those can either be globally novel (never seen before in any session in a given run) or relatively novel (never seen before in a test case’s lineage).
The fuzzer works on a corpus of files containing code that will be mutated. It’s evolutionary in the sense that new interesting test cases are added to this corpus to be further mutated, and selection of which file to mutate is influenced by a score representing their success metrics, including contributing to increased coverage. It randomly alternates between breadth-first (once a mutation is found, select another parent to mutate) and depth-first (once a mutation is found, select the child as the new parent to mutate).
The mutations it applies are AST-based, so the mutated code is almost always syntactically valid. The selection of which mutations to apply considers the success rates of each mutation in the current fuzzing run, with a decaying factor so new successes count more than old successes. Most of these mutations were modeled after techniques that have a chance of fooling the JIT into problematic behavior.
lafleur has its origins in the fusil project created by Victor Stinner, where it was initially a feedback-driven component of that fuzzer. fusil is also able to generate and mutate code targeting the CPython JIT, besides fuzzing all of CPython and other projects with random function and method calls. The usual way of creating a corpus of seed files to mutate is using fusil to generate them.
Some terminology
- Session: a collection of fuzzing script executions targeting a single mutation.
- Run: a collection of sessions, corresponding to an invocation of a
lafleurinstance. - Campaign: a collection of runs, including results from many
lafleurinstances. - Parent: a code sample chosen for mutation.
- Child: a mutated code sample.
- Micro-ops (UOPs): the low-level operations that the JIT works with internally (e.g.,
_LOAD_ATTR,_STORE_FAST). - Edges: control flow between consecutive UOPs (e.g.,
_LOAD_ATTR->_STORE_ATTR), providing context about operation sequences. - Coverage: our measure, based on edges and UOPs, of how thoroughly we’ve explored the JIT’s internal state space.
- Corpus: the collection of “interesting” test cases that have discovered new JIT state (that is, increased coverage).
- Instance: a long-lived
lafleurprocess that mutates code, selects parents, monitors fuzzing, and spawns short-lived fuzzing processes.
Finding crashes in CPython’s JIT
Fuzzing a robust JIT
One thing we always keep in mind is that we’re looking for bugs in a new, robust, and relatively small codebase. This creates its own challenges and, in comparison to fuzzing CPython with fusil, causes a lower success rate.
Besides the high code quality, fuzzing a JIT is hard because you’re not (just) looking for interesting code paths, but have to find interesting combinations of operations that might break expectations and invariants, overwhelm the JIT’s monitoring, cause trouble in optimizations and deoptimizations, and reach all sorts of problematic states. We’re talking about a truly huge JIT state space stemming from a small codebase, as opposed to a small surface area stemming from a large codebase when fuzzing CPython with fusil.
Running lafleur so far
I’ve been running a lafleur campaign on two personal computers and three free cloud instances during the two months it’s been in development. As can be seen in the Problems section, this is probably way too little computing resources for successful fuzzing campaigns.
Currently lafleur lacks many automation and quality of life features, so running it is an artisanal process that requires manually starting instances and stopping them when the run gets too long, using too much disk space and memory. I run several parallel instances on each machine, but lafleur doesn’t actually support parallelism: it’s just being run independently from different root directories. Watching for and filtering hits (cases that cause a crash), deduplicating (identifying which cases cover the same issue) them, and test case reduction are all manual processes. Any help in making these processes more automated and reproducible would be highly valuable and make it easier for other people to run lafleur and search for JIT bugs.
Instructions about how to run lafleur can be found in the README.
Issues found
lafleur has found 4 JIT crashes so far:
- Issue #136996: “JIT:
executor->vm_data.validassertion failure inunlink_executor”, in which a wrong assertion about freeing the exit trace executor is triggered. - Issue #137007: “JIT: assertion failure in _PyObject_GC_UNTRACK”, in which deallocating the executor when it’s not yet tracked by GC causes a segfault or an abort.
- Issue #137728: “Assertion failure or
SystemErrorin_PyEval_EvalFrameDefaultin a JIT build”, in which too many local variables would lead to an abort or aSystemError. - Issue #137762: “Assertion failure in
optimize_uopsin a JIT build”, in which, similarly to the previous issue, too many local variables of specific types would lead to an abort or a segfault.
I consider that these bugs are mostly extreme edge cases that aren’t highly valuable, but that fixing them does improve the JIT’s robustness and correctness. Hence, finding them moves us towards lafleur’s goals.
Challenges and opportunities
Low hit rate
For two of the four issues found so far, the hit rate has been very low, meaning we don’t find the same issues repeatedly. In fact, one of them was never found again, while the other was only found twice. This indicates that our coverage of JIT state space is low, and hence that increasing our current computing resources may lead to better results.
Community involvement could make a huge difference here. Distributed fuzzing across multiple contributors’ machines would dramatically expand our search coverage.
Resource usage
lafleur needs quite a bit of RAM (over 2GB per instance after a couple of hours) and disk space (several GB after a couple of hours) for long runs. This comes from the need to keep coverage data in memory and on disk, the growth of the corpus and interesting results, and logs. So far, I’ve been restarting the runs from zero when the memory and data grow too large, which might mean we’re losing interesting code samples that would result in crashes.
There’s room for many optimizations, including better corpus management, coverage data compression, and smarter state persistence, all which could make lafleur much more efficient.
Slow execution
lafleur is able to run at less than one session per second on average, in a single instance. This makes covering JIT space harder as it slowly mutates the corpus and executes the resulting scripts.
How You Can Help
We’re actively seeking collaboration in several specific areas and welcome any kind of contributions:
New Mutation Strategies
Our AST-based mutator library is the heart of lafleur’s effectiveness. We’re particularly interested in:
- Mutators targeting UOPs we don’t currently cover or only cover in a restricted set of edges.
- Strategies for exercising the JIT’s function inlining logic.
- New patterns that reproduce known JIT stressing behavior (we have a set of them in fusil).
Coverage Signal Enhancement
Help us improve our feedback mechanisms by:
- Identifying new JIT log messages that indicate interesting states.
- Developing better heuristics for parent test case selection.
- Contributing ideas for alternative coverage metrics.
Performance & Scale
Assist in making lafleur more efficient:
- Ideas for automated distributed fuzzing across multiple machines.
- Memory usage optimizations for long-running campaigns.
- Strategies for corpus management, refinement and deduplication.
Analysis & Tooling
Help us build better ways of understanding the fuzzing behavior:
- Post-processing tools for analyzing fuzzing results.
- Better visualization of coverage data and campaign progress.
- Polling of runs from different instances for global campaign analysis.
- Tools for describing, comparing and ranking runs and campaigns.
Quality of life improvements
Features that will make lafleur easier to run and develop:
- Create and improve CI, including linting and requiring updates to the changelog.
- Create tests and improve the testing system.
- Add support for a configuration file.
- Instance management tools to help running larger campaigns.
- Automate and improve patching of CPython sources to make triggering JIT bugs easier.
Research opportunities
Join us in describing, improving and experimenting with lafleur:
- Identify and publish novel fuzzing approaches already integrated into
lafleur. - Suggest and/or implement improvements based on fuzzing literature.
- Create and publish novel fuzzing features to add to
lafleur.
Increasing computing resources
Help us run more lafleur instances or find more computing resources:
- Make it easier to run
lafleurby increasing automation and improving QoL features. - Run
lafleurinstances on computing resources at your disposal. - Help finding available computing resources in academia, companies, the cloud, etc.
- Create a research project based on
lafleurto make it eligible for free cloud research credits.
Improving documentation
Assist in adding, improving and publishing documentation:
- Develop a broader, better structured user/usage documentation.
- Document how to build seed samples by hand.
- Keep developer documentation up-to-date with latest developments.
- Publish documentation somewhere, both to make it more findable and easier to navigate.
- Apply the diataxis approach to improve documentation.
Getting Started: Check out our Developer Documentation Index and contribution guide. Even if you’re new to fuzzing, there are good first issues available!
Ideas for the future
Comparative timing fuzzing
If a code sample runs slower under the JIT than without it, it’s an interesting bug to flag. Given that lafleur can be made to optimize for different kinds of signals, instead of randomly generating code and benchmarking it, we can select code that already displays the desired behavior (slower under the JIT) to further mutate, increasing the chance of finding pathological cases.
The basic idea is easy to implement: we generate a single source containing many loops of a code sample, then run it with JIT enabled and compare the timings to a run with JIT disabled.
The hard part is gathering a reliable, statistically valid signal from timing runs. First off, we must account for machine state: performing one run when the machine is idle and the other when it’s busy will degrade or invalidate the timing signal. Then, we need to gather enough timing samples to make our data robust. We also need ways to assess the quality of the collected data.
Due to the benchmarking complexity involved, we’d like to request help from those with experience in benchmarking and its tooling to guide us in implementing this approach.
Characterization by retrospective campaigns
The current fuzzing campaign has been run with a changing code base, variable computing resources and over different CPython versions without a fixed pattern. This makes it hard to characterize lafleur’s behavior, effectiveness, efficiency, etc.
We plan to run controlled, instrumented campaigns using different lafleur versions under specific variable computing resources, targeting specific CPython versions known to contain bugs that lafleur can find. If this turns into a bona fide research project, even synthetic bugs (that is, arbitrary added issues) can be used to model fuzzing behavior and performance.
This experiment will be important to assess correlations of resource usage and new features to fuzzing success, allowing us to better understand how to improve lafleur to attain better results.
The gathered data will also allow us to implement and improve our run and campaign analysis and visualization framework, resulting in better tools for our real fuzzing campaigns.
A closer look
lafleur works on specially formatted code samples, adding and mutating code as it goes. Here’s an example of a seed file content:
int_v1 = 600627076259156299
int_v2 = 52508104421
list_v3 = ["\uDC80",
False,
Exception('fuzzer_generated_exception'),
-997.726]
float_v4 = -5.1
# This harness function contains the synthesized code.
def uop_harness_f1():
# --- Core Pattern ---
int_v1 > int_v2
int_v2 > int_v2
for res_foriterlist in list_v3:
pass
for res_foriterlist in list_v3:
pass
float_v4 * float_v4
float_v4 * float_v4
# Execute the harness in a JIT-warming hot loop.
for i_f1 in range(300):
try:
uop_harness_f1()
except Exception as e:
print(f"EXCEPTION: {e.__class__.__name__}: {e}", file=stderr)
break
Running lafleur will create a directory structure containing test cases, data, logs and other artifacts, like the one shown below:
run_root
|-- corpus
| `-- jit_interesting_tests # Our corpus of interesting files
| |-- 1.py # Seed file, usually generated by fusil
| |-- 2.py # Seed or mutated file, depending on choosen number of seeds
| |-- 3.py
| `-- [...]
|-- coverage
| |-- coverage_state.pkl # Pickled coverage data, global (all seen edges and UOPs) and per file
| `-- mutator_scores.json # Detailed record of scores and numbers of attempts per mutator
|-- crashes # Records test cases that ended in a crash
| |-- crash_retcode_child_1510_20_1.log # Output from running the case that crashed
| |-- crash_retcode_child_1510_20_1.py # Case that crashed
| `-- [...]
|-- divergences # Holds output from differential execution mode, currently broken
|-- logs
| |-- deep_fuzzer_run_2025-08-16T11-23-07.897399.log # Complete output from fuzzing run
| |-- mutator_effectiveness.jsonl # Snapshot of current mutator effectiveness
| |-- timeseries_2025-08-16T14-23-07.907531Z00-00.jsonl # JSONL file holding snapshots of run stats
| `-- [...]
|-- timeouts # Records test cases that ended in a timeout
| |-- timeout_1095_1_child_1095_1_1.log # Output from running the case that timed out
| |-- timeout_1095_1_child_1095_1_1.py # Case that timed out
| `-- [...]
`-- tmp_fuzz_run # Temporarily holds generated child source and log files
`-- [...]
For a detailed breakdown of the state files and output directories, see the State and Data Formats documentation.
Further information
Documentation
We try to provide a lot of developer documentation for lafleur, including how to contribute to it.
That’s about it. Sorry for the huge post.
Thanks Ken Jin and Brandt Bucher for giving feedback on an earlier version of this post.
And thank you for reading this far!
As a thank you note, here’s the origin of the name:
lafleur grew from a fuzzer named fusil (meaning rifle), and when spinning off the project I wanted a name related to fusil, but avoiding another gun reference. The image of protesters putting flowers in rifles came to mind, and I searched for “fleur fusil”, which uncovered the French expression la fleur au fusil. This expression comes from some young French soldiers going to war with flowers in their rifles and means going into something with enthusiasm and optimism, but with naivete, being unprepared for and completely unaware of the complexities involved. And I was like “hey, that’s just how I went into this project!”, so lafleur it was.
Oh, and I sometimes post about lafleur on social media, keep an eye on it if you want to follow the development.