Why does cpython use -O3?

Recently I found out that cpython uses -O3 optimization by default. I wonder if there is any reason for this.
This level of optimization allows experimental optimizations that tend to make the binary too huge for CPU cache and therefore slower in many cases (unwinding of for loops etc.)
I suggest switching back to -O2, where all the tested optimizations are.
GCC developers move optimizations from -O3 to -O2 anyway if they decide that they are generally beneficial.

Some reading on this topic: https://wiki.gentoo.org/wiki/GCC_optimization#-O

Would you compare pyperformance with -O2 and -O3?

1 Like

This is the thing I was looking for.
I will try and test it on different machines and report the results.
Thank you.

Hi Marcel,

Victor Stinner might have some suggestions for you. I wouldn’t be surprised if he has already run benchmarks comparing -O2 to -O3. Please read this page from the pyperformance docs:

https://pyperformance.readthedocs.io/usage.html#how-to-get-stable-benchmarks

I think you must compile with the LTO/PGO option turned on. Performance without those options is significantly worse.

Well, Marcel is in my team, we already discussed -O3 and I have no idea why -O3 is used instead of -O2 :slight_smile:

It seems like -O2 was replaced with -O3 with no explanation in 2001:

commit 9f71582c14b9651cad329da83d284a6a3773d309
Author: Fred Drake <fdrake@acm.org>
Date:   Wed Jul 11 06:27:00 2001 +0000

    Check for --with-pydebug earlier, and record the result.
    
    When setting up the basic OPT value for GCC, only use optimization if
    not using debugging mode.
    
    Fix a typo in a comment in the IPv6 check.

(...)
-           OPT="-O2 -Wall -Wstrict-prototypes";;
+           OPT="-O3 -Wall -Wstrict-prototypes";;
(...)

-O3 can be found in a documentation to build Python on NEXT, doc written in 1996:

commit 566b35f1c367939dc25bd19bba3f65a348e730a0
Author: Guido van Rossum <guido@python.org>
Date:   Fri Sep 6 16:13:30 1996 +0000

    NEXT shared libs instructions

(...)
+make "OPT=-O3 -fschedule-insns2 -ObjC -arch m68k -arch i486"

Recent changes in .travis.yml:

commit 21cfae107e410bf4b0ab3c142ca4449bc33290f5
Author: Inada Naoki <songofacandy@gmail.com>
Date:   Fri Jun 28 02:05:37 2019 +0900

    bpo-30345: travis: use -Og with --with-pydebug (GH-14423)

diff --git a/.travis.yml b/.travis.yml
index 3e2cbb6fb0..cff401f5bc 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -166,7 +166,8 @@ install:
 
 # Travis provides only 2 cores, so don't overdo the parallelism and waste memory.
 before_script:
-  - ./configure --with-pydebug
+  # -Og is much faster than -O0
+  - CFLAGS="${CFLAGS} -Og" ./configure --with-pydebug

and

commit 667eaffb4e5d03bf8129773f79649c3befaa5b1a
Author: Jeroen Demeyer <J.Demeyer@UGent.be>
Date:   Thu Jun 27 13:17:44 2019 +0200

    bpo-33926: enable GDB tests on Travis CI (GH-14395)

diff --git a/.travis.yml b/.travis.yml
index addff77334..3e2cbb6fb0 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -14,8 +14,7 @@ env:
     - OPENSSL=1.1.1c
     - OPENSSL_DIR="$HOME/multissl/openssl/${OPENSSL}"
     - PATH="${OPENSSL_DIR}/bin:$PATH"
-    # Use -O3 because we don't use debugger on Travis-CI
-    - CFLAGS="-I${OPENSSL_DIR}/include -O3"
+    - CFLAGS="-I${OPENSSL_DIR}/include"
     - LDFLAGS="-L${OPENSSL_DIR}/lib"
     # Set rpath with env var instead of -Wl,-rpath linker flag
     # OpenSSL ignores LDFLAGS when linking bin/openssl
(...)

and

commit 8ff53564730f7ba503f5ad94418c309c48e8516d
Author: INADA Naoki <methane@users.noreply.github.com>
Date:   Sat Feb 10 20:35:17 2018 +0900

    travis: Use -O3 option (GH-5599)
    
    We don't use debugger on Travis.

diff --git a/.travis.yml b/.travis.yml
index 98d6b9a97b..dd0688717c 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -15,7 +15,8 @@ env:
     - OPENSSL=1.1.0g
     - OPENSSL_DIR="$HOME/multissl/openssl/${OPENSSL}"
     - PATH="${OPENSSL_DIR}/bin:$PATH"
-    - CFLAGS="-I${OPENSSL_DIR}/include"
+    # Use -O3 because we don't use debugger on Travis-CI
+    - CFLAGS="-I${OPENSSL_DIR}/include -O3"
     - LDFLAGS="-L${OPENSSL_DIR}/lib"
     # Set rpath with env var instead of -Wl,-rpath linker flag
     # OpenSSL ignores LDFLAGS when linking bin/openssl

In 2008, an user reported a crash on ARM when building with -O3, whereas it was fine with -O2:
https://bugs.python.org/issue4594#msg77315

Well, now we have ARM buildbots and they are fine with -O3.

I have to admit I never knew this about -O3 (that it’s unsafe).

1 Like

It’s “unsafe” in the sense that it is more likely to hit compiler bugs and it’s more likely to do the wrong thing with code not strictly following standards.

But if there are no bugs in the code nor in the compiler, there is no problem using -O3. Note that GCC also has -Ofast which drops standards compliance, allowing even more optimizations.

Ubuntu 19.04, without PGO nor LTO:

$ ./python -m pyperf compare_to o2_nt.json o3_nt.json  -G --min-speed=2
Slower (2):
- unpickle_list: 5.51 us +- 0.04 us -> 5.95 us +- 0.14 us: 1.08x slower (+8%)
- genshi_text: 38.2 ms +- 0.5 ms -> 39.8 ms +- 0.3 ms: 1.04x slower (+4%)

Faster (30):
- nbody: 195 ms +- 1 ms -> 179 ms +- 2 ms: 1.09x faster (-8%)
- scimark_monte_carlo: 151 ms +- 2 ms -> 141 ms +- 3 ms: 1.07x faster (-7%)
- scimark_lu: 232 ms +- 1 ms -> 217 ms +- 4 ms: 1.07x faster (-7%)
- chaos: 170 ms +- 1 ms -> 160 ms +- 1 ms: 1.06x faster (-6%)
- logging_silent: 272 ns +- 7 ns -> 258 ns +- 4 ns: 1.06x faster (-5%)
- float: 164 ms +- 1 ms -> 156 ms +- 1 ms: 1.05x faster (-5%)
- raytrace: 773 ms +- 22 ms -> 734 ms +- 17 ms: 1.05x faster (-5%)
- unpickle: 20.5 us +- 1.3 us -> 19.6 us +- 0.4 us: 1.05x faster (-4%)
- richards: 106 ms +- 1 ms -> 102 ms +- 3 ms: 1.04x faster (-4%)
- json_loads: 39.8 us +- 2.6 us -> 38.2 us +- 0.7 us: 1.04x faster (-4%)
- nqueens: 138 ms +- 1 ms -> 133 ms +- 1 ms: 1.04x faster (-4%)
- regex_compile: 246 ms +- 6 ms -> 237 ms +- 4 ms: 1.04x faster (-4%)
- json_dumps: 17.6 ms +- 0.6 ms -> 16.9 ms +- 0.2 ms: 1.04x faster (-4%)
- scimark_sor: 283 ms +- 2 ms -> 272 ms +- 3 ms: 1.04x faster (-4%)
- unpack_sequence: 64.6 ns +- 0.6 ns -> 62.3 ns +- 0.9 ns: 1.04x faster (-4%)
- scimark_fft: 518 ms +- 3 ms -> 502 ms +- 2 ms: 1.03x faster (-3%)
- deltablue: 10.7 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.03x faster (-3%)
- xml_etree_generate: 133 ms +- 1 ms -> 129 ms +- 2 ms: 1.03x faster (-3%)
- unpickle_pure_python: 470 us +- 3 us -> 456 us +- 4 us: 1.03x faster (-3%)
- logging_simple: 13.3 us +- 0.2 us -> 12.9 us +- 0.2 us: 1.03x faster (-3%)
- xml_etree_process: 105 ms +- 1 ms -> 102 ms +- 1 ms: 1.03x faster (-3%)
- mako: 23.8 ms +- 0.1 ms -> 23.2 ms +- 0.1 ms: 1.03x faster (-3%)
- pickle_pure_python: 697 us +- 39 us -> 680 us +- 47 us: 1.03x faster (-3%)
- logging_format: 14.8 us +- 0.2 us -> 14.4 us +- 0.3 us: 1.02x faster (-2%)
- hexiom: 14.2 ms +- 0.0 ms -> 13.8 ms +- 0.1 ms: 1.02x faster (-2%)
- xml_etree_iterparse: 133 ms +- 1 ms -> 130 ms +- 2 ms: 1.02x faster (-2%)
- regex_v8: 28.3 ms +- 0.2 ms -> 27.6 ms +- 0.3 ms: 1.02x faster (-2%)
- regex_dna: 218 ms +- 2 ms -> 213 ms +- 1 ms: 1.02x faster (-2%)
- telco: 9.29 ms +- 0.25 ms -> 9.09 ms +- 0.16 ms: 1.02x faster (-2%)
- go: 363 ms +- 3 ms -> 355 ms +- 4 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (28):  2to3, chameleon, crypto_pyaes, django_template, dulwich_log, fannkuch, genshi_xml, html5lib, meteor_contest, pathlib, pickle, pickle_dict, pickle_list, pidigits, python_startup, python_startup_no_site, regex_effbot, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, xml_etree_parse

@methane This shows -O3 faster than -O2, right?

I have never read that before (“experimental optimizations that tend to make the binary too huge for CPU cache and therefore slower in many cases”). I don’t think that’s the case. -O3 optimizations are not really experimental, they are just potentially costly.

Yes, O3 is faster on 30 benchmarks.

And this is binary size comparison:

$ ll python-o*
-rwxr-xr-x 1 inada-n inada-n 16640128 Jul  2 20:14 python-o2*
-rwxr-xr-x 1 inada-n inada-n 19870480 Jul  2 17:05 python-o3*

$ size python-o2 python-o3
   text    data     bss     dec     hex filename
2891752  244600  133056 3269408  31e320 python-o2
3255084  244304  133056 3632444  376d3c python-o3

I am still about to run the benchmarks.
This is not a high priority for me and I don’t think anyone should really be putting too much effort into it, as no real issues seem to emerge from -O3 and so far it indeed seems faster.
If you have spare CPU-cycles on your machines, feel free to post your results on -O3 and -O2 with PGO and LTO enabled, as Python is distributed with these flags enabled. Hardware info is also welcome. ($ lscpu, # dmidecode --type 17)
My machine is currently never at idle state, so the benchmarks could get affected.

That’s a real issue. I recall working on some code which was fastest when compiled with -Os (faster than -O2 and -O3).

You left out the word “experimental” that I was responding to.

I recall working on some code which was fastest when compiled with -Os (faster than -O2 and -O3 ).

Optimizations can do better on some programs and worse on some others. Here we are focussing on CPython.

I left out the word “experimental” precisely because I wanted to comment on the size only.

It was not clear to me that you were replying only to the “experimental” part. In that case, I agree with you :slight_smile: