Introducing `lafleur`, a CPython JIT fuzzer

Greetings all, I’d like to present lafleur, a CPython JIT fuzzer. We’d love and welcome any kind of collaborations, from questions to vague ideas to working code to being part of a research project. Please start discussions, file issues or propose pull requests if you want to contribute.

Summary

What

lafleur is a specialized fuzzer designed to find crashes in CPython’s experimental JIT, having found 4 JIT crashes so far.

How

It works by mutating code samples and reading JIT debug information when running the new code, monitoring the process to detect crashes. If the new mutated sample generates interesting output, it’s saved to be further mutated later.

Why

Finding crashes in CPython’s JIT is a way of making it more robust and correct in corner cases.

History

lafleur descends from fusil and has been in active development for about 2 months.

Intro to lafleur

CPython’s JIT compiler represents an exciting experiment in improving Python performance, but like all complex optimizing compilers, it’s susceptible to subtle bugs that only surface under specific conditions. Over the past two months, we’ve been developing lafleur, a specialized fuzzer designed specifically to hunt these elusive JIT bugs, which has already found some crashes.

lafleur is a feedback-driven, evolutionary fuzzer for the CPython JIT compiler. That means that unlike fusil and traditional fuzzers that generate code randomly, lafleur uses a coverage-guided, success-rewarding approach that executes test cases, observes their effect on the JIT’s behavior by analyzing verbose trace logs, and uses that feedback to guide its mutations, theoretically becoming progressively smarter at finding interesting code paths over time.

Being coverage-guided means that lafleur decides whether a test case it generated is interesting by checking the JIT debug logs for new behavior after the mutation is applied. It considers both novel micro-ops (UOPs) and novel edges between pairs of UOPs as interesting. Each of those can either be globally novel (never seen before in any session in a given run) or relatively novel (never seen before in a test case’s lineage).

The fuzzer works on a corpus of files containing code that will be mutated. It’s evolutionary in the sense that new interesting test cases are added to this corpus to be further mutated, and selection of which file to mutate is influenced by a score representing their success metrics, including contributing to increased coverage. It randomly alternates between breadth-first (once a mutation is found, select another parent to mutate) and depth-first (once a mutation is found, select the child as the new parent to mutate).

The mutations it applies are AST-based, so the mutated code is almost always syntactically valid. The selection of which mutations to apply considers the success rates of each mutation in the current fuzzing run, with a decaying factor so new successes count more than old successes. Most of these mutations were modeled after techniques that have a chance of fooling the JIT into problematic behavior.

lafleur has its origins in the fusil project created by Victor Stinner, where it was initially a feedback-driven component of that fuzzer. fusil is also able to generate and mutate code targeting the CPython JIT, besides fuzzing all of CPython and other projects with random function and method calls. The usual way of creating a corpus of seed files to mutate is using fusil to generate them.

Some terminology

  • Session: a collection of fuzzing script executions targeting a single mutation.
  • Run: a collection of sessions, corresponding to an invocation of a lafleur instance.
  • Campaign: a collection of runs, including results from many lafleur instances.
  • Parent: a code sample chosen for mutation.
  • Child: a mutated code sample.
  • Micro-ops (UOPs): the low-level operations that the JIT works with internally (e.g., _LOAD_ATTR, _STORE_FAST).
  • Edges: control flow between consecutive UOPs (e.g., _LOAD_ATTR->_STORE_ATTR), providing context about operation sequences.
  • Coverage: our measure, based on edges and UOPs, of how thoroughly we’ve explored the JIT’s internal state space.
  • Corpus: the collection of “interesting” test cases that have discovered new JIT state (that is, increased coverage).
  • Instance: a long-lived lafleur process that mutates code, selects parents, monitors fuzzing, and spawns short-lived fuzzing processes.

Finding crashes in CPython’s JIT

Fuzzing a robust JIT

One thing we always keep in mind is that we’re looking for bugs in a new, robust, and relatively small codebase. This creates its own challenges and, in comparison to fuzzing CPython with fusil, causes a lower success rate.

Besides the high code quality, fuzzing a JIT is hard because you’re not (just) looking for interesting code paths, but have to find interesting combinations of operations that might break expectations and invariants, overwhelm the JIT’s monitoring, cause trouble in optimizations and deoptimizations, and reach all sorts of problematic states. We’re talking about a truly huge JIT state space stemming from a small codebase, as opposed to a small surface area stemming from a large codebase when fuzzing CPython with fusil.

Running lafleur so far

I’ve been running a lafleur campaign on two personal computers and three free cloud instances during the two months it’s been in development. As can be seen in the Problems section, this is probably way too little computing resources for successful fuzzing campaigns.

Currently lafleur lacks many automation and quality of life features, so running it is an artisanal process that requires manually starting instances and stopping them when the run gets too long, using too much disk space and memory. I run several parallel instances on each machine, but lafleur doesn’t actually support parallelism: it’s just being run independently from different root directories. Watching for and filtering hits (cases that cause a crash), deduplicating (identifying which cases cover the same issue) them, and test case reduction are all manual processes. Any help in making these processes more automated and reproducible would be highly valuable and make it easier for other people to run lafleur and search for JIT bugs.

Instructions about how to run lafleur can be found in the README.

Issues found

lafleur has found 4 JIT crashes so far:

  • Issue #136996: “JIT: executor->vm_data.valid assertion failure in unlink_executor”, in which a wrong assertion about freeing the exit trace executor is triggered.
  • Issue #137007: “JIT: assertion failure in _PyObject_GC_UNTRACK”, in which deallocating the executor when it’s not yet tracked by GC causes a segfault or an abort.
  • Issue #137728: “Assertion failure or SystemError in _PyEval_EvalFrameDefault in a JIT build”, in which too many local variables would lead to an abort or a SystemError.
  • Issue #137762: “Assertion failure in optimize_uops in a JIT build”, in which, similarly to the previous issue, too many local variables of specific types would lead to an abort or a segfault.

I consider that these bugs are mostly extreme edge cases that aren’t highly valuable, but that fixing them does improve the JIT’s robustness and correctness. Hence, finding them moves us towards lafleur’s goals.

Challenges and opportunities

Low hit rate

For two of the four issues found so far, the hit rate has been very low, meaning we don’t find the same issues repeatedly. In fact, one of them was never found again, while the other was only found twice. This indicates that our coverage of JIT state space is low, and hence that increasing our current computing resources may lead to better results.

Community involvement could make a huge difference here. Distributed fuzzing across multiple contributors’ machines would dramatically expand our search coverage.

Resource usage

lafleur needs quite a bit of RAM (over 2GB per instance after a couple of hours) and disk space (several GB after a couple of hours) for long runs. This comes from the need to keep coverage data in memory and on disk, the growth of the corpus and interesting results, and logs. So far, I’ve been restarting the runs from zero when the memory and data grow too large, which might mean we’re losing interesting code samples that would result in crashes.

There’s room for many optimizations, including better corpus management, coverage data compression, and smarter state persistence, all which could make lafleur much more efficient.

Slow execution

lafleur is able to run at less than one session per second on average, in a single instance. This makes covering JIT space harder as it slowly mutates the corpus and executes the resulting scripts.

How You Can Help

We’re actively seeking collaboration in several specific areas and welcome any kind of contributions:

New Mutation Strategies

Our AST-based mutator library is the heart of lafleur’s effectiveness. We’re particularly interested in:

  • Mutators targeting UOPs we don’t currently cover or only cover in a restricted set of edges.
  • Strategies for exercising the JIT’s function inlining logic.
  • New patterns that reproduce known JIT stressing behavior (we have a set of them in fusil).

Coverage Signal Enhancement

Help us improve our feedback mechanisms by:

  • Identifying new JIT log messages that indicate interesting states.
  • Developing better heuristics for parent test case selection.
  • Contributing ideas for alternative coverage metrics.

Performance & Scale

Assist in making lafleur more efficient:

  • Ideas for automated distributed fuzzing across multiple machines.
  • Memory usage optimizations for long-running campaigns.
  • Strategies for corpus management, refinement and deduplication.

Analysis & Tooling

Help us build better ways of understanding the fuzzing behavior:

  • Post-processing tools for analyzing fuzzing results.
  • Better visualization of coverage data and campaign progress.
  • Polling of runs from different instances for global campaign analysis.
  • Tools for describing, comparing and ranking runs and campaigns.

Quality of life improvements

Features that will make lafleur easier to run and develop:

  • Create and improve CI, including linting and requiring updates to the changelog.
  • Create tests and improve the testing system.
  • Add support for a configuration file.
  • Instance management tools to help running larger campaigns.
  • Automate and improve patching of CPython sources to make triggering JIT bugs easier.

Research opportunities

Join us in describing, improving and experimenting with lafleur:

  • Identify and publish novel fuzzing approaches already integrated into lafleur.
  • Suggest and/or implement improvements based on fuzzing literature.
  • Create and publish novel fuzzing features to add to lafleur.

Increasing computing resources

Help us run more lafleur instances or find more computing resources:

  • Make it easier to run lafleur by increasing automation and improving QoL features.
  • Run lafleur instances on computing resources at your disposal.
  • Help finding available computing resources in academia, companies, the cloud, etc.
  • Create a research project based on lafleur to make it eligible for free cloud research credits.

Improving documentation

Assist in adding, improving and publishing documentation:

  • Develop a broader, better structured user/usage documentation.
  • Document how to build seed samples by hand.
  • Keep developer documentation up-to-date with latest developments.
  • Publish documentation somewhere, both to make it more findable and easier to navigate.
  • Apply the diataxis approach to improve documentation.

Getting Started: Check out our Developer Documentation Index and contribution guide. Even if you’re new to fuzzing, there are good first issues available!

Ideas for the future

Comparative timing fuzzing

If a code sample runs slower under the JIT than without it, it’s an interesting bug to flag. Given that lafleur can be made to optimize for different kinds of signals, instead of randomly generating code and benchmarking it, we can select code that already displays the desired behavior (slower under the JIT) to further mutate, increasing the chance of finding pathological cases.

The basic idea is easy to implement: we generate a single source containing many loops of a code sample, then run it with JIT enabled and compare the timings to a run with JIT disabled.

The hard part is gathering a reliable, statistically valid signal from timing runs. First off, we must account for machine state: performing one run when the machine is idle and the other when it’s busy will degrade or invalidate the timing signal. Then, we need to gather enough timing samples to make our data robust. We also need ways to assess the quality of the collected data.

Due to the benchmarking complexity involved, we’d like to request help from those with experience in benchmarking and its tooling to guide us in implementing this approach.

Characterization by retrospective campaigns

The current fuzzing campaign has been run with a changing code base, variable computing resources and over different CPython versions without a fixed pattern. This makes it hard to characterize lafleur’s behavior, effectiveness, efficiency, etc.

We plan to run controlled, instrumented campaigns using different lafleur versions under specific variable computing resources, targeting specific CPython versions known to contain bugs that lafleur can find. If this turns into a bona fide research project, even synthetic bugs (that is, arbitrary added issues) can be used to model fuzzing behavior and performance.

This experiment will be important to assess correlations of resource usage and new features to fuzzing success, allowing us to better understand how to improve lafleur to attain better results.

The gathered data will also allow us to implement and improve our run and campaign analysis and visualization framework, resulting in better tools for our real fuzzing campaigns.

A closer look

lafleur works on specially formatted code samples, adding and mutating code as it goes. Here’s an example of a seed file content:

int_v1 = 600627076259156299

int_v2 = 52508104421

list_v3 = ["\uDC80",
 False,
 Exception('fuzzer_generated_exception'),
 -997.726]

float_v4 = -5.1

# This harness function contains the synthesized code.
def uop_harness_f1():
    # --- Core Pattern ---
    int_v1 > int_v2
    int_v2 > int_v2
    for res_foriterlist in list_v3:
        pass
    for res_foriterlist in list_v3:
        pass
    float_v4 * float_v4
    float_v4 * float_v4

# Execute the harness in a JIT-warming hot loop.
for i_f1 in range(300):
    try:
        uop_harness_f1()
    except Exception as e:
        print(f"EXCEPTION: {e.__class__.__name__}: {e}", file=stderr)
        break

Running lafleur will create a directory structure containing test cases, data, logs and other artifacts, like the one shown below:

run_root
|-- corpus
|   `-- jit_interesting_tests  # Our corpus of interesting files
|       |-- 1.py  # Seed file, usually generated by fusil
|       |-- 2.py  # Seed or mutated file, depending on choosen number of seeds
|       |-- 3.py
|       `-- [...]
|-- coverage
|   |-- coverage_state.pkl  # Pickled coverage data, global (all seen edges and UOPs) and per file
|   `-- mutator_scores.json  # Detailed record of scores and numbers of attempts per mutator
|-- crashes  # Records test cases that ended in a crash
|   |-- crash_retcode_child_1510_20_1.log  # Output from running the case that crashed
|   |-- crash_retcode_child_1510_20_1.py  # Case that crashed
|       `-- [...]
|-- divergences  # Holds output from differential execution mode, currently broken
|-- logs
|   |-- deep_fuzzer_run_2025-08-16T11-23-07.897399.log  # Complete output from fuzzing run
|   |-- mutator_effectiveness.jsonl  # Snapshot of current mutator effectiveness
|   |-- timeseries_2025-08-16T14-23-07.907531Z00-00.jsonl  # JSONL file holding snapshots of run stats
|   `-- [...]
|-- timeouts  # Records test cases that ended in a timeout
|   |-- timeout_1095_1_child_1095_1_1.log  # Output from running the case that timed out
|   |-- timeout_1095_1_child_1095_1_1.py  # Case that timed out
|   `-- [...]
`-- tmp_fuzz_run  # Temporarily holds generated child source and log files
    `-- [...]

For a detailed breakdown of the state files and output directories, see the State and Data Formats documentation.

Further information

Documentation

We try to provide a lot of developer documentation for lafleur, including how to contribute to it.


That’s about it. Sorry for the huge post.

Thanks Ken Jin and Brandt Bucher for giving feedback on an earlier version of this post.

And thank you for reading this far! :slight_smile: As a thank you note, here’s the origin of the name:

lafleur grew from a fuzzer named fusil (meaning rifle), and when spinning off the project I wanted a name related to fusil, but avoiding another gun reference. The image of protesters putting flowers in rifles came to mind, and I searched for “fleur fusil”, which uncovered the French expression la fleur au fusil. This expression comes from some young French soldiers going to war with flowers in their rifles and means going into something with enthusiasm and optimism, but with naivete, being unprepared for and completely unaware of the complexities involved. And I was like “hey, that’s just how I went into this project!”, so lafleur it was.

Oh, and I sometimes post about lafleur on social media, keep an eye on it if you want to follow the development.

29 Likes

Great work!

All the bugs you found were assertion failures. Do you also look for cases where the behavior differs between JIT and non-JIT runs? I can imagine running a code sample with and without the JIT and asserting that the behavior is identical.

1 Like

Differential testing of the JIT sounds nice. We technically already do that by running the JIT on the entire test suite.

Also I think Daniel is being humble regarding the importance of some of the bugs they found :). For example,

Issue #137728: “Assertion failure or SystemError in _PyEval_EvalFrameDefault in a JIT build”, in which too many local variables would lead to an abort or a SystemError.
Issue #137762: “Assertion failure in optimize_uops in a JIT build”, in which, similarly to the previous issue, too many local variables of specific types would lead to an abort or a segfault. 

Those are pretty important. Those asserts would translate to actual segfaults in release builds. They meant the JIT would crash on every “big”-ish function that has many locals.

11 Likes

Are there tools to manage the distribution of these sorts of tasks across contributed resources? I recall Seti at Home (many years ago) and something more recently from Berkeley. Could such a tool be developed relatively easily within the Python dev community which yielded a secure task distributor for these sorts of jobs?

1 Like

We have a differential mode that works like you suggest, running code a single time as control and in a hot loop as the JITted sample, then comparing the locals() at the end of execution.

However, it isn’t being constantly run so it became broken when we started having functions and classes in both code blocks, as these don’t compare equal. Just a matter of fixing the comparison function.

Running with JIT enabled and disabled would be a more robust design, we don’t have that yet but if aligns with what we need for differential timing checks.

That’s a great idea, I also remember Seti@Home, distributed dot net, and the Great Internet Mersenne Prime Search, all projects I contributed measly cycles to back in the day.

The Berkeley tool you remember is BOINC, and I believe it can indeed be used for creating a real distributed fuzzing system.

Both using BOINC and developing a similar Python-focused tool are worthwhile and would help with other kinds of fuzzing. Maybe even with reproducing test failures or reported issues in different OSes and configurations.

But it seems out of my reach for now: I’m focusing on getting lafleur more efficient and effective (vertical scaling) instead, which is within my current available time and resources. Distributed horizontal scaling is an important goal, but a whole project in itself. I sure could help such an effort, but won’t be able to spearhead it at least for a while.

Thank you @devdanzin for this cool project. I’m currently working on something similar with pysource-codegen, but I think with a different approach. Let me know if you have any ideas on how I can help you.

2 Likes

Thank you very much for your offer @15r10nk, and thanks for your projects too!

I had taken a look at both pysource-codegen and pysource-minimize a while ago (before working on lafleur) and found both very interesting and likely to help with fuzzing, but not a good match for fusil, as it only needed/used interesting fuzzing vectors, not random code.

Now that we need seed files with random code for lafleur, I think pysource-codegen may be an excellent optional substitute for test case generation, taking the place of fusil or enhancing the code it generates, and offering much greater diversity.

I think with a bit of AST-based massaging (like scanning the generated code and assigning something to each named variable used, something like we do in fusil), some of the generated code may happen to be both syntactically and semantically valid, which would be enough to use it as seed code for fuzzing.

I’ll try to write a PoC sometime soon and report back, that way I can get some ideas about what, if anything, we’d need to make that work. Thanks again!

2 Likes

Good news for you. I already started working on something like this in the last days. I put try: ... except: pass around every statement and use a special Generic class for all used variables. It does not work perfect and it is still WIP but I found already 2 bugs with it.

The open problem is still to generate expressions which can be executed without raising exceptions. I have some ideas, but I don’t know how well they will work out. Here is my currcent code.

3 Likes

Wow, that’s wonderful! Really great results too!

I will try to integrate this with lafleur. Hopefully I can contribute something to your project.

Here’s how fusil generates valid code: fusil/fusil/python/jit/ast_pattern_generator.py at main · devdanzin/fusil · GitHub. It’s much less versatile than your code generation, but some of the tricks (like scanning and initializing the variables, or using interesting objects in place of your Generic class) might be useful.

Have you thought about simply generating small code blocks, compiling them and checking whether they run without raising an exception? Then you could stitch them together.