Crashes combining TensorFlow with other packages - who can reproduce?

One of the issues that keeps coming up in the future-of-manylinux discussion is a poorly-understood crash that sometimes happens when more than one binary wheel that uses C++ is loaded into the same interpreter. All I know for sure about it is that TensorFlow, which does include a lot of complex C++, has been reported to be one of the wheels involved, and that someone observed the machine-code call stack at the time of the crash to involve std::call_once (which suggests a problem deep in the guts of the C++ runtime).

The TensorFlow maintainers and I are trying to investigate this crash. It would really help if we had a reproducible test case, or at least some more information from someone who’s observed the crash themselves: which modules besides TensorFlow were involved, the order in which they were loaded, whether the crash happened immediately upon loading the modules or whether some more application code was executed first (and if so, what was it?), the identity and age of the base Linux distribution in use, and anything else that might be relevant.

If you have this information or you know where I can get it, I would love to hear from you.

@pitrou may have more information, in “The next manylinux specification” he said:

It’s more complicated than that. The way Tensorflow built their “manylinux” wheels made Python crash when loaded side-by-side with other manylinux wheels, due to discrepancies at the C++ ABI level (this is a summary; I’m not sure anyone understands exactly what happened, and I’m part of the people who looked into it).

I don’t know if the crashes still exist currently. On the Arrow side of things, we have stopped looking at this, given that it’s an issue with how Tensorflow builds non-compliant manylinux1 wheels.

I still see the issue of segfault

>>> import pyarrow
>>> import tensorflow
Segmentation fault (core dumped)

Tensorflow will not build manylinux1 wheel and manylinux2010 is the target.
I checked with Uwe L. Korn on this issue and he mentioned that pyarrow will be manylinux2010 compliant soon. In that case I think this segfault issue will be resolved.

1 Like

There are from my notes:
https://issues.apache.org/jira/browse/ARROW-3466,
https://issues.apache.org/jira/browse/ARROW-2657 and
https://issues.apache.org/jira/browse/ARROW-5130 for past issues.
The resolution was to make linker script of libarrow.so quite strict: https://github.com/pitrou/arrow/blob/master/cpp/src/arrow/symbols.map

Thanks, I can reproduce this crash on my computer with pyarrow-0.14.1 and tensorflow-1.14.0 (both tagged manylinux1).

We started producing manylinux2010 wheels for PyArrow but pulled them at the last minute, as they failed loading (I don’t remember the details, but IIRC auditwheel repair is buggy).

Just to let you know TF is not manylinux1 compliant.The auditwheel will fail.

Last time I tried it failed to repair and auditwheel mentioned pyarrow was linux_x86_64 and not manylinux1.

$ auditwheel show /input/wheelhouse/pyarrow-0.13.0-cp36-cp36m-manylinux1_x86_64.whl

pyarrow-0.13.0-cp36-cp36m-manylinux1_x86_64.whl is consistent with the
following platform tag: “linux_x86_64”.

That’s weird, 'cause we precisely use auditwheel to produce manylinux1 wheels…

The latest PyArrow wheels seem fine (at least according to auditwheel…):

$ auditwheel show pyarrow-0.14.1-cp37-cp37m-manylinux1_x86_64.whl 

pyarrow-0.14.1-cp37-cp37m-manylinux1_x86_64.whl is consistent with the
following platform tag: "manylinux1_x86_64".
[....]
$ auditwheel show pyarrow-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl 

pyarrow-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl is consistent with
the following platform tag: "manylinux2010_x86_64".
[....]

here are the manylinux2010 wheels of tensorflow if you need to test
https://tensorflow.pypi.thoth-station.ninja/index/manylinux2010/jemalloc/simple/tensorflow/

I observe the same crash with this tensorflow wheel.

I don’t see it on centos6 which OS are you on ?

> root@7768fc1175ca:/workspace# python
> Python 3.6.3 (default, May 14 2019, 16:15:26)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> root@7768fc1175ca:/workspace# gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-7/root/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
> Target: x86_64-redhat-linux
> Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-7/root/usr --mandir=/opt/rh/devtoolset-7/root/usr/share/man --infodir=/opt/rh/devtoolset-7/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --with-linker-hash-style=gnu --enable-initfini-array --with-default-libstdcxx-abi=gcc4-compatible --with-isl=/builddir/build/BUILD/gcc-7.3.1-20180303/obj-x86_64-redhat-linux/isl-install --enable-libmpx --with-mpc=/builddir/build/BUILD/gcc-7.3.1-20180303/obj-x86_64-redhat-linux/mpc-install --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
> Thread model: posix
> gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)
> root@7768fc1175ca:/workspace# python
> Python 3.6.3 (default, May 14 2019, 16:15:26)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> import tensorflow
> >>>
> >>>
> root@7768fc1175ca:/workspace# python
> Python 3.6.3 (default, May 14 2019, 16:15:26)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import tensorflow
> >>> import pyarrow
> >>>
> >>>

I use Debian unstable (kernel 4.19, GNU libc 2.28) and I tested with Python 3.7. I wonder if there might have been some change to the dynamic linker in between CentOS 6 and now.

It definitely depends on the system version of libstdc++ and perhaps glibc.
For example I can reproduce on Ubuntu 18.04:

$ python -c "import pyarrow, tensorflow"
Erreur de segmentation

gdb backtrace:

#0  0x000000000000003f in ?? ()
#1  0x00007ffff77d4827 in __pthread_once_slow (once_control=0x7fffb98fcd28 <tensorflow::port::(anonymous namespace)::cpuid_once_flag>, 
    init_routine=0x7fffc67ff830 <__once_proxy>) at pthread_once.c:116
#2  0x00007fffb8f440ca in void std::call_once<void (&)()>(std::once_flag&, void (&)()) ()
   from /home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#3  0x00007fffb8f4410e in tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
   from /home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#4  0x00007fffb8655bd5 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
   from /home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#5  0x00007ffff7de5733 in call_init (env=0xb7c690, argv=0x7fffffffd8f8, argc=3, l=<optimized out>) at dl-init.c:72
#6  _dl_init (main_map=main_map@entry=0xfd3ff0, argc=3, argv=0x7fffffffd8f8, env=0xb7c690) at dl-init.c:119
#7  0x00007ffff7dea1ff in dl_open_worker (a=a@entry=0x7fffffff7dc0) at dl-open.c:522
#8  0x00007ffff7b4b2df in __GI__dl_catch_exception (exception=0x7fffffff7da0, operate=0x7ffff7de9dc0 <dl_open_worker>, args=0x7fffffff7dc0)
    at dl-error-skeleton.c:196
#9  0x00007ffff7de97ca in _dl_open (file=0x7fffc5738e20 "/home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so", 
    mode=-2147483646, caller_dlopen=0x641d20 <_PyImport_FindSharedFuncptr+112>, nsid=<optimized out>, argc=3, argv=<optimized out>, env=0xb7c690) at dl-open.c:605
#10 0x00007ffff75c1f96 in dlopen_doit (a=a@entry=0x7fffffff7ff0) at dlopen.c:66
#11 0x00007ffff7b4b2df in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff7f90, operate=0x7ffff75c1f40 <dlopen_doit>, args=0x7fffffff7ff0)
    at dl-error-skeleton.c:196
#12 0x00007ffff7b4b36f in __GI__dl_catch_error (objname=0xb80140, errstring=0xb80148, mallocedp=0xb80138, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:215
#13 0x00007ffff75c2735 in _dlerror_run (operate=operate@entry=0x7ffff75c1f40 <dlopen_doit>, args=args@entry=0x7fffffff7ff0) at dlerror.c:162
#14 0x00007ffff75c2051 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#15 0x0000000000641d20 in _PyImport_FindSharedFuncptr ()

As you can see __pthread_once_slow (a private glibc function) is involved in the implementation of the C++ function std::call_once, which itself is used (sometimes? always? depending on libstdc++ version?) for the implementation of static local variables (which feature thread-safe initialization, see “Static local variables” in https://en.cppreference.com/w/cpp/language/storage_duration).