Recently I found out that cpython uses -O3 optimization by default. I wonder if there is any reason for this.
This level of optimization allows experimental optimizations that tend to make the binary too huge for CPU cache and therefore slower in many cases (unwinding of for loops etc.)
I suggest switching back to -O2, where all the tested optimizations are.
GCC developers move optimizations from -O3 to -O2 anyway if they decide that they are generally beneficial.
Victor Stinner might have some suggestions for you. I wouldn’t be surprised if he has already run benchmarks comparing -O2 to -O3. Please read this page from the pyperformance docs:
Well, Marcel is in my team, we already discussed -O3 and I have no idea why -O3 is used instead of -O2
It seems like -O2 was replaced with -O3 with no explanation in 2001:
commit 9f71582c14b9651cad329da83d284a6a3773d309
Author: Fred Drake <fdrake@acm.org>
Date: Wed Jul 11 06:27:00 2001 +0000
Check for --with-pydebug earlier, and record the result.
When setting up the basic OPT value for GCC, only use optimization if
not using debugging mode.
Fix a typo in a comment in the IPv6 check.
(...)
- OPT="-O2 -Wall -Wstrict-prototypes";;
+ OPT="-O3 -Wall -Wstrict-prototypes";;
(...)
-O3 can be found in a documentation to build Python on NEXT, doc written in 1996:
commit 21cfae107e410bf4b0ab3c142ca4449bc33290f5
Author: Inada Naoki <songofacandy@gmail.com>
Date: Fri Jun 28 02:05:37 2019 +0900
bpo-30345: travis: use -Og with --with-pydebug (GH-14423)
diff --git a/.travis.yml b/.travis.yml
index 3e2cbb6fb0..cff401f5bc 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -166,7 +166,8 @@ install:
# Travis provides only 2 cores, so don't overdo the parallelism and waste memory.
before_script:
- - ./configure --with-pydebug
+ # -Og is much faster than -O0
+ - CFLAGS="${CFLAGS} -Og" ./configure --with-pydebug
and
commit 667eaffb4e5d03bf8129773f79649c3befaa5b1a
Author: Jeroen Demeyer <J.Demeyer@UGent.be>
Date: Thu Jun 27 13:17:44 2019 +0200
bpo-33926: enable GDB tests on Travis CI (GH-14395)
diff --git a/.travis.yml b/.travis.yml
index addff77334..3e2cbb6fb0 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -14,8 +14,7 @@ env:
- OPENSSL=1.1.1c
- OPENSSL_DIR="$HOME/multissl/openssl/${OPENSSL}"
- PATH="${OPENSSL_DIR}/bin:$PATH"
- # Use -O3 because we don't use debugger on Travis-CI
- - CFLAGS="-I${OPENSSL_DIR}/include -O3"
+ - CFLAGS="-I${OPENSSL_DIR}/include"
- LDFLAGS="-L${OPENSSL_DIR}/lib"
# Set rpath with env var instead of -Wl,-rpath linker flag
# OpenSSL ignores LDFLAGS when linking bin/openssl
(...)
and
commit 8ff53564730f7ba503f5ad94418c309c48e8516d
Author: INADA Naoki <methane@users.noreply.github.com>
Date: Sat Feb 10 20:35:17 2018 +0900
travis: Use -O3 option (GH-5599)
We don't use debugger on Travis.
diff --git a/.travis.yml b/.travis.yml
index 98d6b9a97b..dd0688717c 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -15,7 +15,8 @@ env:
- OPENSSL=1.1.0g
- OPENSSL_DIR="$HOME/multissl/openssl/${OPENSSL}"
- PATH="${OPENSSL_DIR}/bin:$PATH"
- - CFLAGS="-I${OPENSSL_DIR}/include"
+ # Use -O3 because we don't use debugger on Travis-CI
+ - CFLAGS="-I${OPENSSL_DIR}/include -O3"
- LDFLAGS="-L${OPENSSL_DIR}/lib"
# Set rpath with env var instead of -Wl,-rpath linker flag
# OpenSSL ignores LDFLAGS when linking bin/openssl
It’s “unsafe” in the sense that it is more likely to hit compiler bugs and it’s more likely to do the wrong thing with code not strictly following standards.
But if there are no bugs in the code nor in the compiler, there is no problem using -O3. Note that GCC also has -Ofast which drops standards compliance, allowing even more optimizations.
I have never read that before (“experimental optimizations that tend to make the binary too huge for CPU cache and therefore slower in many cases”). I don’t think that’s the case. -O3 optimizations are not really experimental, they are just potentially costly.
I am still about to run the benchmarks.
This is not a high priority for me and I don’t think anyone should really be putting too much effort into it, as no real issues seem to emerge from -O3 and so far it indeed seems faster.
If you have spare CPU-cycles on your machines, feel free to post your results on -O3 and -O2 with PGO and LTO enabled, as Python is distributed with these flags enabled. Hardware info is also welcome. ($ lscpu, # dmidecode --type 17)
My machine is currently never at idle state, so the benchmarks could get affected.