Moving expat version (or symbol) checks from buildtime to runtime

Hello.

In Fedora and RHEL, we have recently been bitten by a problem of the pyexpat module being linked to the libexpat.so.1 library as we build Python --with-system-expat.

When Python is built with expat version 2.7.2 or newer, it uses the XML_SetAllocTrackerActivationThreshold symbol added in that expat version. The usage in pyexpat was added in Expose Expat XML attack protection APIs · Issue #90949 · python/cpython · GitHub

However, as the libexpat.so.1 “system” library is a standalone thing, it may happen that the user then installs this Python with expat 2.7.1 and the thing blows up on import time:

>>> import pyexpat
Traceback...
ImportError: ...pyexpat.cpython-315-x86_64-linux-gnu.so: undefined symbol: XML_SetAllocTrackerActivationThreshold 

This is not the first time we have encountered this problem; in the past, it also occurred with expat 2.6.0 and another symbol. The problem boils down to the fact that the check for expat version happens on buildttime, but is not checked at runtime at all.

In Fedora, we have hotpatched this situation for now by adding a runtime dependency on an expat version that is >= the expat version which was used during the build. This is a bit too strict, because when we build Python with expat 2.7.3, it will require expat >= 2.7.3, despite it really only needing >= 2.7.2. But we decided it was better to be safe than sorry.

An alternative approach, which we used in RHEL, is to hardcode the required expat version (to >= 2.7.2) and naïvly parse the pyexpat sources for all the relevant #if XML_COMBINED_VERSION >= ... or #if XML_COMBINED_VERSION > ... conditionals, find the highest satisfiable version, and assert that our hard-coded required minimal expat version matches. This reduces the needlessly harsh expat dependency constraint, but might turn out to be fragile if the XML_COMBINED_VERSION conditionals ever start looking differently than expected. It is also only feasible because RHEL is a slow-moving target.


We would very much like to avoid this in the future and were wondering if it was possible to move the checks to runtime instead.

@fweimer suggested that it should be possible to make the symbol references weak to solve this:

Another possibility would be to make the symbol reference weak in Python, and use the functionality only if it is available in the installed version of Expat.

You can think of it as a slightly more type-safe version of dlsym.

You’d write

#pragma weak XML_SetAllocTrackerActivationThreshold

and then before calling the function, you’d check

if (XML_SetAllocTrackerActivationThreshold != NULL) {
   // Use new functionality here.
} else {
  // Other code, presumably with vulnerability.
}

I assume the two code paths already exist because Python can still build with the old version of Expat. In a sense, it’s just a matter of moving the check from compile to run time.

At the same time, Gordon Messmer thinks we should ask expat to use versioned symbols, which is something we can turn into an automatic runtime dependency in RPM packages. We’d like to explore both options, so here I am.

Is moving the checks to runtime something Python would be inclined to accept? Of course, I am willing to work on this myself.

cc @picnixz

1 Like

I do not think Python should do this. A built binary always needs to depend on a version of the library >= to the version it was compiled against as a rule of thumb unless the library itself was specifically designed for use otherwise. This is not unique to Python.

4 Likes

You are distributing Python, and you are the one choosing to use the system lib, so you need to make sure it is not used with an incompatible library.

That said, I think making optional functions weak references is something we should be doing more. It relaxes the minimun version of the libraries we link against, and gives us much better control of the error handling — we can fail with helpful information instead of having the user enconter linker errors.

We discussed this recently with regards to relocatable builds, if I am not mistaken, and the overall consensus seems to be in favor of using weak symbols to detect optional features at runtime, instead of them using compile time conditionals.

1 Like

The proposal here is to move a compile time check to runtime, so the code already has some kind of knowledge of the feature being optional. I think this is generally a good change, but it always depends on each particular case.

In this specific case, I think the proposal is reasonable, however Fedora should be building against the oldest library version made available to the user. New features would still be available via runtime checks, if we happen to get a newer version of the library, but I agree with you that distributors should not be building with newer versions than the one the user might have.

1 Like

Does not we already do something like this for many optional Posix functions (like statat()) on macOS?

Oh I think I read this as being about a function only available in a recent enough version rather than a function that is optionally available in libexpat.

Regardless, from a distro shared library packaging sense, I’d think you would need to track differently shaped builds of the same library that include or exclude optional functions from being present as being effectively different libraries? that feels gross.

FWIW In practice, this seems to be unique to how pyexpat is written. Other libraries used dynamically from Python seem to either bump the soname when they add new symbols, or use versioned symbols, or Python does not use build time checks to use new symbols. Or, the danger exists there with all the other libraries, but we have not yet encountered it in reality.

I agree 100 %. We do. But I do not like the solutions we have and I am looking for a better one.

In this case, it is about “function only available in a recent expat” but yes, you are not incorrect about the other case (which is not the case of expat).

We already do that in Python packaging though, with manylinux, etc. so it’s not that strange, and if downstream distributors are able to properly manage this in their systems, I don’t really think is problematic.

Yeah, that’s perfectly fine. Re-reading my reply I noticed it might seem like I am being dismissive, while that was not my intention.

1 Like

This is the other solution, on libexpat side: Use versioned symbols where supported · Issue #1129 · libexpat/libexpat · GitHub

If that works, no change on Python side would be needed. FWIW I think using versioned symbols in expat is a better solution to our problem.

2 Likes

@fweimer suggested that it should be possible to make the symbol references weak to solve this:

https://sourceware.org/bugzilla/show_bug.cgi?id=24718#c19

suggests to revisit whether VER_FLG_WEAK version references should support optional dependencies—e.g., an executable with a weak reference to foo links against libA.so.1 (which provides foo@@v1), but at runtime runs with a libA.so.1 that lacks the version definition for v1.

I’ve proposed LLVM linker (lld) patch [ELF] Set vna_flags to VER_FLG_WEAK if all references are weak by MaskRay · Pull Request #176673 · llvm/llvm-project · GitHub and made a comment the GNU ld feature request (24718).

If the linker features are implemented, there should be no action item on Python side (assuming the XML_SetAllocTrackerActivationThreshold reference is already weak)

1 Like

The reference is not already weak, and my opinion is that weak references are not a very good idea.

When weak references are used, there are probably three code paths, and one of them is extremely difficult to test (at least in Fedora).

The first code path exists in binaries that are compiled in an environment where the feature (XML_SetAllocTrackerActivationThreshold) is not present. The feature will not be used by the resulting binary, whether it is present in later runtimes or not. This code path will be tested by the project’s test suite following the build.

The second code path exists in binaries that are compiled in an environment where the feature is present. The resulting binary will require the feature in the runtime. Because the feature is present in the build environment, this code path will also be tested by the project’s test suite following the built.

Weak references require an additional code path, where the binary checks the symbol reference at runtime and either uses the feature or handles the failure. The failure handling often means a duplicate of whatever code path exists to handle the “feature not present during build” code path (and the inherent risk that missing-at-build-time and missing-at-run-time behaviors drift away from each other), but it’s difficult to test because the environment has the feature during the build and that code path only executes in an environment that doesn’t have the feature. So this arrangement usually ends up shipping a code path that doesn’t get tested, and that’s bad.