The next manylinux specification

pitrou · March 25, 2019, 3:05pm

It’s more complicated than that. The way Tensorflow built their “manylinux” wheels made Python crash when loaded side-by-side with other manylinux wheels, due to discrepancies at the C++ ABI level (this is a summary; I’m not sure anyone understands exactly what happened, and I’m part of the people who looked into it).

It seems the client applies a relaxed heuristic (check for glibc version), and that’s fine. But the spec is more precise than that and enumerates explicitly which library versions can be assumed.

pf_moore · March 25, 2019, 3:45pm

So it’s not simply tensorflow violating the spec that causes the problem, but tensorflow in combination with something else that’s spec-compliant that causes the issue? And tensorflow on its own is fine?

Ouch. That sounds really nasty. I’ll duck out at this point, because clearly my simplistic viewpoint isn’t going to help

pitrou · March 25, 2019, 3:59pm

I’m not sure that Tensorflow on its own is fine. The Tensorflow wheels may not work on an old system (for example, with an old glibc). But the failure mode on a recent system in combination with other wheels is definitely nastier.

takluyver · March 25, 2019, 5:40pm

My understanding - someone correct me if I’m wrong - is that the client would be responsible for either parsing the version fields out of the tag and comparing them, or generating a list of every possible tag it supports. I don’t think anyone is suggesting that we publish wheels with a massive sequence of manylinux_glibc_2_x tags.

So this mechanism would have to generate all the possible tags for each glibc version. We might take 2.12 (manylinux2010) as a starting point, and the current version is 2.29. So that would be 18 tags, adding about two per year. It looks like the code generates the product of that with the ABI tags.

This doesn’t seem like impractical numbers on the face of it, but it depends how the list of supported tags is used; there may be things which are fine with ~10 tags but become unwieldy with ~100.

brettcannon · March 25, 2019, 6:46pm

FYI my proposed code for packaging creates a sorted list of all possible combinations. That way determining compatible wheels is just a matter of iterating through that list and finding the first wheel for a package that’s compatible.

njs · March 25, 2019, 9:09pm

This sounds like some bug that’s orthogonal to all the wheel tagging issues. There’s no reason “discrepancies at the C++ ABI level” should cause unrelated wheels to interfere with each other. Is there a bug report or any public analysis?

We already went through this exact discussion on macOS :-). It used to be that pip would only install a macosx_10_6_intel wheel on systems where python was built with the 10.6 SDK, exactly. This led to unwieldy wheel names with lots of dots in them. The fix was to change things so if you’re on macOS 10.12, then pip will install wheels tagged macosx_10_1_intel, macosx_10_2_intel, …, macosx_10_12_intel:

https://github.com/pypa/pip/blob/master/src/pip/_internal/pep425tags.py#L319-L328

We should treat perennial manylinux the same way. If you have a system with glibc 2.12, then you can install manylinux_glibc_2_1_x86_64, …, manylinux_glibc_2_12_x86_64.

pitrou · March 25, 2019, 9:53pm

There have been several manifestations of this over the time, despite trying various (brave and/or desperate) workarounds.

I think the closest to an analysis can be found here: https://github.com/apache/arrow/pull/2096, though it’s more speculation. The root cause seems to be that gcc implements C++ thread-safe static singletons (as per the C++ reference: “If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once”) using library calls that involve libstdc++. If two libraries linked with different libstdc++ versions, or linked in different ways, are loaded in the same process, you can get a crash (not sure exactly why, there can be several explanations).

The RH devtoolset which is generally used when building manylinux wheels can link some libstdc++ symbols statically if they aren’t available in the baseline C++ library (see ARROW-2657: [C++] Undo export pthread_once related symbols. by robertnishihara · Pull Request #2096 · apache/arrow · GitHub for one example of issues this can cause), which would happen when using some C++11 functionality which needs library support… such as thread-safe static singletons. If Tensorflow uses a different toolchain for building, it may produce shared libraries which cannot safely be loaded side-by-side with devtoolset-produced shared libraries. And Tensorflow cannot use the devtoolset because it needs a newer system to build AFAIU.

You wouldn’t guess such issues can exist before you encounter them.

jjhelmus · March 26, 2019, 3:18am

Would a perennial manylinux which specifies only the glibc version be sufficient for binary compatibility between different Linux distributions? The manylinux1 and manylinux2010 tags specify symbol version bounds for two other libraries, libstdc++ (CXXABI and GLIBCXX) and libgcc_s (GCC). Linux distributions can and do use the same glibc version but different versions of gcc. Binaries produced using the distribution with a newer gcc may contain symbols from libstdc++ and libgcc_s that would not be present in these libraries on the distribution with the older gcc.

For a specific example, Debian 8 (Jessie) ships with glibc 2.19 and gcc 4.9.2. Wheels produced by a system running this distribution would be labeled with the manylinux_glibc_2_19_x86_64 tag and the gcc toolchain would include symbols with maximum versions of CXXABI_1.3.8 and GLIBCXX_3.4.20. Ubuntu 14.04 (Trusty) also ships with glibc 2.19 but the gcc version is 4.8.2 which contains maximum version symbols of CXXABI_1.3.7 and GLIBCXX_3.4.18. Therefore a wheel produced on a Debian 8 system, properly labeled with the glibc version would not work on a Ubuntu 14.04 system despite matching the glibc version. Among just Fedora, Ubuntu and Debian there are a number of other examples where a given glibc version is shipped with different versions of gcc.

From the discussion so far, it seems as if this limitation may be known and that the glibc version is being used only as heuristic with the actual specification being encoded in the reference build environment and/or auditwheel. If the glibc version is being used only as a heuristic, perhaps it would be better to use a different name so that users are not confused by this apparent mismatch.

njs · March 26, 2019, 4:23am

…Yeah, I read all that and I’m still baffled at how a definition of std::call_once in one ELF namespace is affecting another namespace entirely. In any case, it sounds like this is effectively some kind of toolchain bug, and not something that we can solve with compatibility tags, so we should probably move the discussion eleswhere.

Right, the idea is that manylinux_glibc_2_12_x86_64 doesn’t mean “this was built on some system that happened to have glibc 2.12”, it means “we believe that if you have a system with glibc 2.12 on x86-64, then this wheel will work”. It’s a claim about compatibility, not a record of the build environment.

That’s a fair point. Up above @ncoghlan suggested having a range of broadly-compatible-linux-environments named like manylinux_${compatibility rule}_${some kind of version info}. Another approach would be to decide that manylinux is the name for “broadly compatible with mainstream glibc-based distros” (that’s pretty much what it means now!), and use other tags for other compatibility heuristics. So e.g. you might have:

manylinux_12_x86_64 (= installs on systems with glibc 2.12+)
vfxlinux_cy2019_x86_64 (= installs on systems that are believed to implement the VFX Reference Platform Calendar Year 2019 compatibility profile)
alpinelinux_somekindofversioninfo_x86_64 (… you get the idea)

It’s basically a question of taste, but since we’ve had a few rounds of confusion over the string manylinux_glibc then maybe it’s better to avoid it.

pitrou · March 26, 2019, 10:01am

Such as “manylinux2014”?

pitrou · March 26, 2019, 10:02am

Not entirely agreed, since the only occurrence of this is when mixing manylinux-compliant wheels and non-manylinux-compliant wheels. In any case, yes, it’s more of a datapoint in the larger discussion.

njs · March 26, 2019, 2:59pm

What I mean is, there are really two problems here: (1) their wheels are labeled as being built on centos5-era ABIs, but this is not true, (2) their wheels have some mysterious issue that causes crashes in other code.

We can solve (1) by writing a PEP to add more fine grained labels. But a PEP can’t do much about (2). Slapping a manylinux2014 or manylinux_glibc_2_17 tag on their wheels won’t fix crashes. For that we need to figure out why the mechanisms that normally isolate different wheels from each other are failing, and fix it.

(Well, it’s possible that the problem is essentially a bug in the old centos5-based manylinux1 toolchain, in which case it might disappear once everyone switches to newer toolchains. But we don’t know that.)

pf_moore · March 26, 2019, 3:31pm

To be accurate, we can “solve” (1) by enforcing more strictly the rule that people shouldn’t publish wheels with inaccurate tags. Unfortunately, that will not solve the underlying issue, which is (or at least appears to me to be) that there are currently no defined tags that the tensorflow project can legitimately use to publish their wheels. That is the issue that can be solved by adding extra labels.

As a process point, I think it’s important that we don’t implicitly sanction publishing wheels with inaccurate tags.

gpshead · March 26, 2019, 8:57pm

If we’re going to make that claim, we should actively prevent wheels with inaccurate tags from being publishable. Analyze and reject from within twine, and again analyze and reject at upload time on warehouse to avoid people who have hacked up twine to circumvent the checks. Otherwise all we’re doing is making a statement that says, “oh, BTW, don’t do this unintended thing that works for the majority of your desired users when there is literally no other option”.

pf_moore · March 26, 2019, 10:37pm

In principle, I agree. My understanding is that auditwheel is too heavyweight to do that easily.

What concerns me most, though, is that no-one seems to be pushing for the incorrectly tagged wheels to be removed from PyPI. In particular, I think the manylinux people should be more concerned about the impact this might have on the credibility of manylinux as a way of ensuring compatibility. Sure “practicality beats purity” and all that, but if we, as the authors of the standards, aren’t trying to make sure they are followed, what’s the point?

Anyway, this is off-topic. Let’s kill the digression here, and let this topic return to discussing how to define the next manylinux spec. If anyone wants to discuss the process of enforcing tag standards, they should start a separate topic (and to be honest, I’ve said all I want to say on the matter, so I won’t be participating).

dustin · March 26, 2019, 10:56pm

If anyone wants to discuss enforcing these standards, the right place to do it would be on the issue I linked to in the OP: “Run auditwheel on new manylinux uploads, reject if it fails”.

dustin · March 28, 2019, 7:40pm

We haven’t had any additional comments here in a few days, so I wanted to attempt to summarize the relevant discussion and talk about next steps:

(Please feel free to correct me if I’m missing, misconstruing or misunderstanding anything that was said)

“Perennial manylinux”:

There is some confusion here about how only specifying the libc provider / libc version determines valid libraries to link against
Saying “if it works for your users, it’s valid” might be too vague, and we would still need to explicitly define install-time checks to determine compatibility
Generally it would be “if auditwheel (or another tool which does publication-time checking) allows it”, but we would need to point to such a tool in the specification
If the glibc provider/version is only being used as a heuristic, it’s confusing to use it in the tag name
It might be better instead make individual “heuristics” which can explicitly specify symbol version bounds for relevant libraries
This sounds a lot like manylinux!
Additional heuristics would also allow us to distro/VFX-specific heuristics
It seems like folks are on board with allowing additional heuristics if possible

manylinux2014:

Generally seems like folks value the detailed specification for binary compatibility such a proposal would provide
Doesn’t seem to be any complaints about adding the additional PowerPC/ARM architectures that CentOS 7 supports.
- Unclear whether the proposed ARM architectures would be compatible w/ modern Raspbian

A combined proposal

If we can accept that manylinux is the “de-facto” glibc heuristic, perhaps we can continue to use this, with some additional flexibility, while remaining open to defining additional heuristics in the future.

@njs suggested a tag like:

manylinux_12_x86_64 (= installs on systems with glibc 2.12+)

This allows us to let the glibc version “slide” along the manylinux specification, but still has the issue of vaguely-specified symbol version bounds for relevant libraries.

What if we used a heuristic like:

manylinux2014_2_12_x86_64 (= installs on systems with glibc 2.12+, backed by a manylinux2014 spec)

We could explicitly specify some things (CXXABI, GLIBCXX, GCC, and externally provided libraries) in a manylinux2014 PEP, while still using the glibc as the “heuristic version identifier”. This means we don’t need a new PEP to use a different glibc version.

At a certain point in time, the glibc version will become too new compared to (CXXABI, GLIBCXX, GCC, etc) and we would still need to write a new manylinuxNNNN specification.

Would this work? Does it make sense? Would it be a good balance between the two ideas?

njs · March 28, 2019, 9:14pm

This is really the worst of both worlds IMO. The goal isn’t to let people use arbitrary glibc versions; the goal is to let us migrate to new broadly-useful manylinux profiles without having to write new PEPs or add more hard-coded logic to pip, like how it works on Windows/macOS. We don’t care about the glibc version per se. We only care about it because it’s (1) a convenient shorthand to refer to a given “era” of linux distros, (2) it’s one that pip can mechanically compute on any system.

We can bikeshed over the exact spelling of the tag. The only technical requirement to achieve our goal is that the tag should be a fixed string with slots to put in glibc version and platform. And we can bikeshed over exactly what the PEP says, e.g. whether it defines auditwheel as “normative” or “not normative, but strongly recommended”. Those don’t really matter. But there shouldn’t be any 2014 anywhere, that’s just making life difficult for ourselves and doesn’t provide any benefits.

dustin · March 28, 2019, 9:52pm

OK, fair enough. But how else can we strike a balance between having this flexibility, and also satisfying maintainers desire to determine if they can build compatible distributions without just saying “auditwheel will tell you if you did it right or not”?

This sounds like we’re just pushing the “specification” into auditwheel, where it would be a) more difficult for interested parties to determine what the rules are, exactly, and b) more work for auditwheel maintainers (who are few and far between). It just wouldn’t require a PEP.

I didn’t see this question get answered specifically. Somewhere, whether it’s in a PEP or not, we need to specify bounds for these libraries, right? With “perennial manylinux”, at what point would we update them, and how? Or are they just completely unspecified, and we just let incompatibilities like this happen:

njs · March 28, 2019, 10:36pm

Everything I’m proposing is about standardizing the things we’re already doing, and that we know work. Auditwheel is already the most accurate, useful, and up-to-date record of how to build compatible wheels on Linux, because it’s what maintainers actually use. I’m pretty sure there have been fixes to the spec in auditwheel that we forgot to merge back into PEP 513, or where the PEP 513 update came later as an afterthought.

In practice, the specification already lives in auditwheel, so there’s no additional maintenance burden. It’s true that reading the rules would require reading some code, but in practice (a) the code is guaranteed to be accurate and up to date, unlike the PEP, (b) you’re still better off than on Windows or macOS, where the rules aren’t written down at all…

A manylinux_glibc_2_19_x86_64 tag would mean that the wheel is compatible with all mainstream distros using glibc 2.19+ on x86-64.

Ubuntu 14.04 is a mainstream distro using glibc 2.19+ on x86-64.

Therefore, to get a manylinux_glibc_2_19_x86_64 tag, you need to be compatible with Ubuntu 14.04.

Therefore, a random wheel built on Debian 8 probably does not qualify for the manylinux_glibc_2_19_x86_64 tag.

A random wheel built on Ubuntu 14.04 might qualify for the tag, but to be certain we’d have to first do some research to check whether there were any distros shipping glibc 2.19 with even older versions of gcc. And of course we’d also need to vendor any necessary libraries, etc.