Feedback on the recent fusil fuzzing campaign of CPython

Greetings all,

For the last six months, I’ve been fuzzing CPython using a tool called fusil, created by @vstinner about 20 years ago. This campaign resulted in 52 issues filed, and in these six months that corresponds to a little less than 30% of all crash issues created.

Now that this effort is coming to a (temporary!) end, I’m writing a technical report about it. For that report, core developers’ feedback would be very important to assess this campaign’s:

  • usefulness (e.g. did it help CPython development in any tangible way?);
  • cost/benefit (e.g. was it worth doing, and are the results compatible with the resources used?);
  • impact (e.g. were any significant issues fixed?);
  • quality of reports (e.g. is filing issues without understanding the cause and before having diagnosed what’s happening a problem?);
  • disruption of normal development flow (e.g. how bad is it to have a constant trickle of issues being filed?);
  • suggestions for any future efforts (e.g. file a single issue with all findings and let developers create issues from that?);
  • approach to getting help triaging found issues (e.g. was it a nuisance to have constant questions in the community Python Discord about what did a crash mean?).

That is, any positive or negative feedback would be very welcome. Suggestions for improvements and constructive criticism would be wonderful, but if all you have to contribute is something like “I don’t think it really helps”, “never heard of it”, “I liked some issues”, that would still be valuable.

This thread is for free-form feedback, to gather different opinions.If you want your response to be quotable in the report, please indicate that quoting with attribution (or anonymously if you’d rather) is fine. Depending on response/interest, I might create a poll too.

I’d like to thank @vstinner, @ZeroIntensity, @Jelle, @picnixz, @sobolevn, @kumaraditya303, and everybody else who helped triage, diagnose and fix the issues.

Thank you for your time!
Daniel

P.S.: As a thank you for reading this far, here’s a draft visualization of the temporal pattern of issue filing, in which it’s clear success is clustered. The reason is that after a while all fusil known tricks have paid-off and stopped uncovering new crashes, until a new feature is added and it starts finding new hits again.


23 Likes

Thanks a lot for your work! Lots of great bug reports!

5 Likes

Fusil has been very helpful! In particular, it’s really good at:

  • Finding free-threaded races in the standard library (or core, in some cases).
  • Catching rare edge-cases that we didn’t test for. This is especially helpful when the crash is a new regression.

I think it was definitely worth doing, and I’d like to see more of it!

Some big issues were definitely found through fuzzing. A good example is gh-126366, which revealed a 3.14 regression where an __iter__ that raised an exception would crash the interpreter when used with yield from. __iter__ doesn’t commonly raise exceptions, and even less so with yield from, so there was a very real chance that could’ve slipped through the betas and landed in 3.14.0.

I think that the reports were generally fine. In fact, sometimes it’s actually less helpful to see reports with all sorts of confusing analysis on top of them (without a linked PR, that is).

No issue there either.

I think the most useful thing that could be done is to make sure that the reproducer is as small as it can be. There were a few times when the original repro used something in the standard library, but then turned out to be a bug with builtins or an extension module.

IMO, that approach was the best way to do it. I don’t think people commonly get notifications from single messages there, so it’s a good way to get some solid triaging done without flooding inboxes.

3 Likes

First of all, this is really impressive work. I didn’t know people were fuzzing the JIT (I knew you were using fusil on free-threading, but didn’t know there was a whole matrix of architectures).

One question out of curiosity that’s not really addressed in the report: Why/how is fusil so effective? There have been other projects fuzzing CPython and I don’t recall them having this much success.

1 Like

Thank you! It has only found one JIT bug so far, getting it to properly exercise the JIT and related areas is something I’d love help with.

It’s all thanks to Victor’s vision and execution of a great framework for creating fuzzers. Fusil’s original design for the Python fuzzer found at least 5 release blockers back in 2008, and many more Python issues.

There were many other projects where fusil was effective.

However, once you fuzz a lot, new interesting issues become harder to find. And so fusil went dormant for a long time, during which CPython came to accumulate more issues of the kind fusil is good for finding.

Then I got fusil back into action and it performed very well, again. I blame Victor :slight_smile:.

6 Likes

Thank you very very much, a lot of incredibly valuable points and considerations!

I’ll try to teach fusil new tricks so we can have more fun with it :smiley:

Thank you!

Thanks for your interest in fuzzing the JIT!

One way to stress-test the JIT a bit harder is to lower the warmup thresholds before you build (they’re hardcoded in the source). There are a few knobs that are currently tuned for performance, but can be cranked in one direction or another to be more aggressive.

Maybe try:

  • Decreasing JUMP_BACKWARD_INITIAL_VALUE, JUMP_BACKWARD_INITIAL_BACKOFF, SIDE_EXIT_INITIAL_VALUE, and SIDE_EXIT_INITIAL_BACKOFF. These control the “warmup” period before code is compiled. Maybe set them to something more like 63, 6, 63, and 6 respectively to compile any loop or side exit run 64 times.
  • Increasing MAX_CHAIN_DEPTH, which allows the JIT to compile many versions of highly polymorphic code. Careful though, this needs to fit in a small-ish bitfield.
  • Increasing UOP_MAX_TRACE_LENGTH and TRACE_STACK_SIZE will allow the JIT to handle larger regions of code and more inlined calls, respectively. Careful, these both control stack allocations, so don’t go too crazy.
  • MAX_ABSTRACT_INTERP_SIZE will allow the JIT’s optimizer to do more stuff with these longer traces. Again, this controls a stack allocation, so be careful.
  • Changing JIT_CLEANUP_THRESHOLD: lower values will throw away more cold code, higher values will keep it around longer.

One key thing to keep in mind is that, right now, the JIT only kicks in for loopy code. One way to force more code to be compiled is to just wrap the fuzzer’s code in a loop. You can confirm that JIT compilation is happening by setting the PYTHON_LLTRACE environment variable on debug builds. A value of 1 will print a line whenever the JIT tries to compile something, while higher values will result in more verbose output about what is being compiled and how it’s being optimized.

If you want even more details, here’s a non-exhaustive list of the sorts of code patterns that the JIT tries to optimize most aggressively (I already had it laying around for another group that’s fuzzing the JIT):

  • Attribute loads:
    • x.<a>, where:
      • <a> is an instance attribute, a class attribute, a property, or a method.
    • <a>.x, where:
      • <a> is a module or a Python class.
  • Building collections:
    • [<a, ...>], (<a, ...>), or {<a, ...>}, where:
      • <a> is a comma-separated listing of items.
    • {<a, ...>}, where:
      • <a> is a comma-separated listing of <key>: <value> pairs.
    • f"{x}{y}" or w[x:y:z].
  • Calls:
    • <a>(), where:
      • <a> is a Python class or a Python function.
    • <a>(x), where:
      • <a> is a Python class, a Python function, the type builtin, the str builtin, or the tuple builtin.
    • <a>(<b, ...>), where:
      • <a> is a Python class or a Python function.
      • <b> is a comma-separated listing of items and/or <name>=<value> pairs.
    • x.<a>(), where:
      • <a> is a Python function.
    • x.<a>(<b, ...>), where:
      • <a> is a Python function.
      • <b> is a comma-separated listing of items and/or <name>=<value> pairs.
    • <a>.append(x), where:
      • <a>'s type is list.
  • Containment checks:
    • x in <a> or x not in <a>, where:
      • <a>'s type is dict or set.
  • Iteration:
    • for x in <a> or x for y in <a> where:
      • <a>'s type is generator, list, range, or tuple.
  • Math:
    • <a> <op> <b>, where:
      • <op> is + +=, ==, !=.
      • <a>'s type is float, int, or str.
      • <b>'s type is the same as <a>.
    • <a> <op> <b>, where:
      • <op> is -, -=, *, *=, <, <=, >, or >=.
      • <a>'s type is float or int.
      • <b>'s type is the same as <a>.
  • Names:
    • <a>, where:
      • <a> is looked up from either the global or builtin namespace.
  • Returns:
    • return, yield, return x, yield x, or yield from x.
  • Subscripts:
    • <a>[<b>], where:
      • <a>'s type is str or tuple.
      • <b>'s type is int.
    • <a>[x], where:
      • <a>'s type is a Python class defining __getitem__.
  • Truth tests:
    • not <a>, if <a>, if not <a>, elif <a>, elif not <a>, while <a>, while not <a>, x if <a> else y, x if not <a> else y, <a> and x, not <a> and x, <a> or x, or not <a> or x, where:
      • <a>'s type is bool, int, list, str, or None.
    • x is None, x is not None, if x is None, if x is not None, elif x is None, elif x is not None, while x is None, while x is not None, x if y is None else z, x if y is not None else z, x is None and y, x is not None and y, x is None or y, or x is not None or y.
  • Unpacking collections:
    • <a, ...> = <b>, where:
      • <a> is a comma-separated listing of names.
      • <b>'s type is list or tuple.

So, for example, if i < 42: yield str(foo)[0] would exercise math (on ints), truth tests (on a bool), calls (of str), subscripts (of str), and returns.

A lot of edge cases and previous bugs that may be interesting are tested in Lib/test/test_capi/test_opt.py, too.

Hopefully this helps!

4 Likes

Great work and interesting results and report. Thanks!

1 Like

Wow, thank you very much! This answers so many questions I’ve wanted to ask for só long!

It will take some time before I’m able to incorporate these patterns into the code fusil generates, but it’s much clearer how to tackle stressing the JIT now. I’ll do my best to get it running early in the beta cycle.

2 Likes

Thank you, for ithe feedback and for helping fix issues!

1 Like

This is interesting work. I’m currently a Masters student working with my professor and a PHD student on fuzzing CPython, mainly the JIT compiler using AFL++. It would be cool if we can connect and hopefully work together. Feel free to send me an email at hsahib@uci.edu if you are interested

2 Likes

Greetings,

The technical report on fuzzing CPython with fusil is nearly finished, thank you all for the contributions!

One obvious bit of missing information is: how many of the issues were trivial, unimportant, useful, important, severe, critical etc. (real qualitative buckets TBD)?

Does anyone here have interest in (and free time to) going through the reported issues and classifying them regarding relevance/severity/importance/any-better-name-for-impact? Do you know anyone who would be able and want to do that?

Another kind of help I’d appreciate is if you could read/skim the report and offer any suggestions/criticisms/lacking information/heckling you can. It’s a dry, monotone text, so I understand if you’re not available to suffer through it :smile:.

And as final note, I don’t intend to publish the report and/or data other than leaving it in a public GitHub repository. But if you want to create anything based on it, be it a blog post, a publication or a resentful rant, I’d be thrilled and happy to help :smile:.

Edit: oh, also, as a final final note, if you’d like to answer the questions from the first post in this thread, your input still can be incorporated into the report and will be appreciated.

Daniel

P.S.: As a thank you for reading this far, here’s an excerpt of the data presentation on issues and PRs (that “unknown” should be categorized soon), and I take this opportunity to again thank all those involved in this effort.

Kind Number of Issues
Segfault/Crash 23
Abort/AssertionError 22
SystemError 2
Fatal Python Error 2
Unknown 3
Total 52

Even though abort issues only affect debug builds directly, in many cases they point to causes that would also create problems in release builds. Segfault issues sound more serious, but some were very shallow crashes in seldom used corners of CPython’s standard librady.

Configuration Number of Issues
Debug 19
Free-Threaded 18
Release 14
JIT 1
Total 52

The high number of issues resulting in aborts and the fact that most segfaults also work on them make debug builds the most fruitful configuration, followed by free-threaded builds.

GitHub User Issues Involved
vstinner 14
sobolevn 12
ZeroIntensity 9
picnixz 4
markshannon 3
kumaraditya303 3
tom-pytel 2
erlend-aasland 1
skirpichev 1
ritvikpasham 1
Fidget-Spinner 1
JelleZijlstra 1
LindaSummer 1
freakboy3742 1
devdanzin 1
gaogaotiantian 1
tomasr8 1
dura0ok 1

A significant number of CPython core developers and contributors engaged with the issues identified by this fuzzing campaign. In total, 18 unique developers were recorded as authoring pull requests to address these findings. Many other Core Developers and contributors helped triage, discuss and fix the issues, but this data isn’t recorded in the results.
The developer involved in addressing the highest number of distinct issues was Victor Stinner (vstinner), who contributed to fixing 14 of the reported bugs. Nikita Sobolev (sobolevn) was the next most involved, with contributions to 12 different issues. This broad participation underscores the community’s collaborative effort in improving CPython’s robustness.

The 52 reported issues resulted in considerable development activity and a total of 98 distinct pull requests (PRs) aimed at addressing these defects. This effort involved 18 unique developers who were listed as authors on these PRs.

Regarding individual contributions, Victor Stinner (vstinner) was associated with 41 of these PRs, representing involvement in approximately 41.8% of the total PRs stemming from this fuzzing campaign. Nikita Sobolev (sobolevn) also played a significant role, with involvement in 26 PRs (approximately 26.5%). Table Y provides a breakdown of PR associations per author.

GitHub User PRs Associated With Involvement Share (of 98 PRs)
vstinner 41 41.8%
sobolevn 26 26.5%
ZeroIntensity 16 16.3%
picnixz 7 7.1%
kumaraditya303 5 5.1%
erlend-aasland 3 3.1%
markshannon 3 3.1%
ritvikpasham 3 3.1%
JelleZijlstra 2 2.0%
devdanzin 2 2.0%
gaogaotiantian 2 2.0%
tom-pytel 2 2.0%
Fidget-Spinner 1 1.0%
LindaSummer 1 1.0%
dura0ok 1 1.0%
freakboy3742 1 1.0%
skirpichev 1 1.0%
tomasr8 1 1.0%
Total Unique PRs 98
Total Associations 118

Oh boy, I just saw that the number of unique PR authors appears in two paragraphs. Adding another todo to the list :sweat_smile:

8 Likes