Report of buildbots and GitHub Actions status: September 2023

Hi,

Over the last weeks, I fixed a bunch of unstable tests and test failures. Right now, there are only 3 failing buildbot workers on the main branch (trust me, it’s low!):

Screenshot 2023-09-14 at 10-50-06 Python Release Status

It’s rare, so I took a screenshot! :slight_smile: Over the last month, there were always 10 to 15 failing buildbot workers, which made buildbots way less eficient to detect new regressions.

FreeBSD got a lot of love: many tests failing on FreeBSD have been fixed, a new buildbot worker and a new GitHub Action (Cirrus CI FreeBSD job) were added.

I fixed the super annoying test_concurrent_futures “hang” failure in test_deadlock: it was a real bug in multiprocessing on Windows. This bug prevented to merge pull requests, since it started to very likely to fail on GitHub Action Windows x64 job (fail, and fail again when re-run in verbose mode in a fresh process!).

I fixed test_cppext (test C API with a C++ compiler): omit -std=c11 option when running the C++ compiler.

I fixed all known test_gdb failures, mostly by skipping tests when gdb is known to be unable to retrieve the traceback or information needed by the test. Python doesn’t control how much the C compiler optimizes the code, and gdb is likely to fail to retrieve information if the code is optimized higher than… no optimization as all (-O0).

As usual, The Night’s Watch is looking for volunteers to help fixing remaining unstable tests, and incoming new funny CI issues. Contact me if you’re interested.

buildbot_watch


I reworked (lib)regrtest, Python test runner, to make the code easier to maintain, and I made a few enhancements:

  • Failing tests are now re-run in a fresh process. It helps to make tests more deterministic. Since some tests (like test_sys) pass in this case, I added --fail-rerun which exits with exit code 5. This option is now used on buildbots (on the main branch) to mark a build as “warnings” (orange), when a test failed and then passed when re-run in verbose mode in a fresh process.

  • Before, sometimes no tests were run when a failing test was re-run. The build was declared as a success whereas no tests were run! I fixed this bug.

  • Call random.seed(random_seed) before running each test file when the -r option is used to make tests more deterministic (easier to reproduce).

  • When using -j option (run tests in multiple worker processes), don’t spawn more worker threads than the number of tests to execute to not waste resources.


My commits related to tests since August 16th.

Fix tests:

  • e35c722d22 gh-106659: Fix test_embed.test_forced_io_encoding() on Windows (#108010)
  • 531930f47f Fix test_generators: save/restore warnings filters (#108246)
  • 58f9c63500 Fix test_faulthandler for sanitizers (#108245)
  • 9173b2bbe1 gh-105776: Fix test_cppext when CC contains -std=c11 option (#108343)
  • fa6933e035 gh-107211: Fix test_peg_generator (#108435)
  • 83e191ba76 test_sys: remove debug print() (#108642)
  • f59c66e8c8 gh-108297: Remove test_crashers (#108690)
  • 23f54c1200 Make test_fcntl quiet (#108758)
  • cd2ef21b07 gh-108962: Skip test_tempfile.test_flags() if not supported (#108964)
  • fbce43a251 gh-91960: Skip test_gdb if gdb cannot retrive Python frames (#108999)
  • 8ff1142578 gh-108851: Fix tomllib recursion tests (#108853)
  • a52a350977 gh-109015: Add test.support.socket_helper.tcp_blackhole() (#109016)
  • 5b7303e265 gh-109162: Refactor Regrtest.main() (#109163)
  • ac8409b38b gh-109162: Regrtest copies ‘ns’ attributes (#109168)
  • 2fafc3d5c6 gh-108996: Skip broken test_msvcrt for now (#109169)
  • cbb3a6f8ad gh-109237: Fix test_site for non-ASCII working directory (#109238)
  • e55aab9578 gh-109230: test_pyexpat no longer depends on the current directory (#109233)
  • a9b1f84790 gh-107219: Fix concurrent.futures terminate_broken() (#109244)
  • 517cd82ea7 gh-108987: Fix _thread.start_new_thread() race condition (#109135)
  • 09ea4b8706 gh-109295: Clean up multiprocessing in test_asyncio and test_compileall (#109298)
  • 7dedfd36dc gh-109295: Fix test_os.test_access_denied() for TEMP=cwd (#109299)
  • 9363769161 gh-109295: Skip test_generated_cases if different mount drives (#109308)
  • 388d91cd47 gh-109357: Fix test_monitoring.test_gh108976() (#109358)
  • 44d9a71ea2 gh-104736: Fix test_gdb tests on ppc64le with clang (#109360)

Enhance tests:

  • a541e01537 gh-90791: Enable test___all__ on ASAN build (#108286)
  • 174e9da083 gh-108388: regrtest splits test_asyncio package (#108393)
  • aa9a359ca2 gh-108388: Split test_multiprocessing_spawn (#108396)
  • aa6f787faa gh-108388: Convert test_concurrent_futures to package (#108401)

Misc test changes:

  • 7a6cc3eb66 test_peg_generator and test_freeze require cpu (#108386)
  • 4f9b706c6f gh-108794: doctest counts skipped tests (#108795)

Big work on regrtest, refactoring, new features:

  • d4e534cbb3 regrtest computes statistics (#108793)
  • 31c2945f14 gh-108834: regrtest reruns failed tests in subprocesses (#108839)
  • 1170d5a292 gh-108834: regrtest --fail-rerun exits with code 5 (#108896)
  • 489ca0acf0 gh-109162: Refactor Regrtest.action_run_tests() (#109170)
  • a56c928756 gh-109162: Refactor libregrtest WorkerJob (#109171)
  • e9e2ca7a7b gh-109162: Refactor libregrtest.runtest (#109172)
  • e21c89f984 gh-109162: Refactor libregrtest.RunTests (#109177)
  • 24fa8f2046 gh-109162: libregrtest: fix _decode_worker_job() (#109202)
  • 0c0f254230 gh-109162: libregrtest: remove WorkerJob class (#109204)
  • 0553fdfe30 gh-109162: Refactor libregrtest.runtest_mp (#109205)
  • a341750078 gh-109162: Refactor libregrtest.Regrtest (#109206)
  • db5bfe91f8 gh-109162: libregrtest: add TestResults class (#109208)
  • 0eab2427b1 gh-109162: libregrtest: add Logger class (#109212)
  • a939b65aa6 gh-109162: libregrtest: add worker.py (#109229)
  • 1ec45378e9 gh-109162: libregrtest: add single.py and result.py (#109243)
  • 0b6b05391b gh-109162: libregrtest: fix Logger (#109246)
  • 0c139b5f2f gh-109162: libregrtest: rename runtest_mp.py to run_workers.py (#109248)
  • 7aa8fcc8eb gh-109162: libregrtest: use relative imports (#109250)
  • c439f6a72d gh-109162: libregrtest: move code around (#109253)
  • de5f8f7d13 gh-109276: libregrtest: use separated file for JSON (#109277)
  • 4e77645986 gh-109276: libregrtest only checks saved_test_environment() once (#109278)
  • a84cb74d42 gh-109276: libregrtest calls random.seed() before each test (#109279)
  • 8c813faf86 gh-109276: libregrtest: limit number workers (#109288)
  • d13f782a18 gh-109276: libregrtest: fix worker working dir (#109313)
  • 75cdd9a904 gh-109276: libregrtest: WASM use filename for JSON (#109340)
  • 715f663258 gh-109276: libregrtest: WASM use stdout for JSON (#109355)
  • b544c2b135 gh-109276: libregrtest: fix work dir on WASI (#109356)

Enhance test.pythoninfo, collect more data:

  • babdced23f test.pythoninfo logs freedesktop_os_release() (#109057)
  • df4f0fe203 gh-109276: Complete test.pythoninfo (#109312)
  • d12b3e3152 gh-109276: test.pythoninfo gets more test.support data (#109337)

Victor

Night gathers, and now my watch begins. It shall not end until my death.

23 Likes

You may want to have a look at the list of open issues about failing and unstable tests. Some examples:

3 Likes

Thank you, Victor! This is important work.

3 Likes

If you are at the Core Dev sprint, I would be game to get more involved with this.

1 Like

I recently set up a new buildbot worker (ubuntu 22.04). Initially it was consistently failing on test_gdb with errors related to failure to get frames after setting a breakpoint or something of that sort.
For some reason, this was happening only with the buildbot user that was auto-created by the ubuntu buildbot installer. I switched over the worker to use a “regular” user and it was resolved.
Is this an issue we want to fix? (I didn’t investigate further why the buildbot user ran into these issues)

On my side, I would like to make a break with test_gdb, I’m kind of fed up :rofl:. If you consider that it’s an issue that should be fixed, and you can even propose a fix, please go ahead. I didn’t understand the “user auto-created” part, you should explain it in the issue.

For comparison, a few hours later, 10 buildbot workers are failing…

Perhaps it had some compiler-specific environment variables set, such as CFLAGS? That could have led Python to being compiled with some optimizations enabled.

I forgot to say, but thank you for this :slight_smile:

1 Like

Obivously with multiprocessing, there are always new issues. @storchaka is working on Unexpected traceback output in test_concurrent_futures and crash · Issue #109370 · python/cpython · GitHub and gh-109370: Support closing Connection and PipeConnection from other thread by serhiy-storchaka · Pull Request #109397 · python/cpython · GitHub to fix more cases relared to this fix.

I made some changes to unify how tests are run on buildbots, GitHub Actions, in make test and make buildbottest.

I added --fast-ci and --slow-ci options:

  • Python options: -u -W default -bb -E
  • regrtest options: -j0 --randomize --fail-env-changed --fail-rerun --rerun --slowest --verbose3 --nowindows
  • test resources: all,-cpu for fast, all for slow
  • timeout: 10 min for fast, 20 min for slow

I removed now redundant options.

You can run tests with:

  • With --fast-ci:

    • ./python -m test --fast-ci (...)
    • make test
    • make test TESTOPTS="..." TESTTIMEOUT=seconds TESTPYTHONOPTS="..."
  • With --slow-ci:

    • ./python -m test --slow-ci (...)
    • make buildbottest
    • make buildbottest TESTOPTS="..." TESTTIMEOUT=seconds TESTPYTHONOPTS="..."

Last days, I fixed many tests with the help of others, but there are still many unstable tests.

3 Likes

Recently, I enabled --fail-rerun on buildbots and GitHub Actions jobs: any test failure now *marks the whole build as a failure. Previously, when were re-run on buildbots and GitHub Actions jobs, if a test failed (FAILURE) and then passed (SUCCESS) when re-run in verbose mode, the build was marked as a SUCCESS (later, I marked the build as WARNINGS in buildbots, using --fail-rerun).

The change was motivated by the fact that each failed test file is now re-run in a fresh process, and it’s more likely to pass when re-run. So there is a higher risk to miss most unstable tests.

With --fail-rerun, I discovered many unstable tests. Like, really, a lot! I fixed, I don’t know, maybe 20 to 50 tests, and others helped me to fix many unstable tests and to review my changes (thanks!). Every time I fixed one unstable test, buildbot and GHA failed with 2 to 5 “new” unstable tests. Not really “new”, I already saw many of them in the last 5 years, but I ignored them since they failed rarely, and the whole build was marked as “success”, so it wasn’t a big deal.

Screenshot of mostly happy buildbot, only 3 failures:

Screenshot 2023-09-27 at 15-16-51 Python Release Status

Well, the work is not done, there are still unstable tests. But the number is decreasing “slowly”.

10 Likes

Victor, this might be some of the most impactful work for our own sanity that anyone has contributed recently. Bravo!

10 Likes

Aaaaaaaand, here you have:

Screenshot 2023-09-28 at 15-57-13 Python Release Status

The test suite pass on all buildbots. Ok, so now the buildbots will become useful again to detect regressions, instead of just reporting flaky tests.

I would prefer to keep --rerun-fail on CIs: mark a whole build as failed if a test failed, even if the test pass when re-run in verbose mode in a fresh process. In the past, we tolerated flaky tests, and over the years, we accumulated many unstable tests. If you was lucky, most of time, you would never notice, unless you read carefully every CI build log. Sometimes, a CI job turned red, but well, just re-scheduling a new job would turn it green again.

The --rerun-fail is not the default: by default, we tolerate failures if you use --rerun. For example, if a Linux distribution runs the Python test suite, it’s ok to have some remaining flaky tests (fail then pass). It’s not the role of the Linux distribution to fix Python (but reports issues upstream). For me, it’s the job of the Python CI to detect unstable tests.

8 Likes

You’re welcome.

Well, while most bugs were fixed in tests, I also fixed some legit bugs in the stdlib. Examples:

Tests are good to uncover corner cases and tricky timing issues. I fixed a bunch of tests which failed on Windows. While Windows and FreeBSD are good to trigger some bugs, usually, if I insist, I can reproduce the issues on other platforms, it’s just that bugs are harder to reproduce on other operating systems like Linux because of “different timings”.

5 Likes