Crashes combining TensorFlow with other packages - who can reproduce?

zwol · July 23, 2019, 4:11pm

One of the issues that keeps coming up in the future-of-manylinux discussion is a poorly-understood crash that sometimes happens when more than one binary wheel that uses C++ is loaded into the same interpreter. All I know for sure about it is that TensorFlow, which does include a lot of complex C++, has been reported to be one of the wheels involved, and that someone observed the machine-code call stack at the time of the crash to involve std::call_once (which suggests a problem deep in the guts of the C++ runtime).

The TensorFlow maintainers and I are trying to investigate this crash. It would really help if we had a reproducible test case, or at least some more information from someone who’s observed the crash themselves: which modules besides TensorFlow were involved, the order in which they were loaded, whether the crash happened immediately upon loading the modules or whether some more application code was executed first (and if so, what was it?), the identity and age of the base Linux distribution in use, and anything else that might be relevant.

If you have this information or you know where I can get it, I would love to hear from you.

dustin · July 23, 2019, 4:31pm

@pitrou may have more information, in “The next manylinux specification” he said:

It’s more complicated than that. The way Tensorflow built their “manylinux” wheels made Python crash when loaded side-by-side with other manylinux wheels, due to discrepancies at the C++ ABI level (this is a summary; I’m not sure anyone understands exactly what happened, and I’m part of the people who looked into it).

pitrou · July 23, 2019, 4:43pm

I don’t know if the crashes still exist currently. On the Arrow side of things, we have stopped looking at this, given that it’s an issue with how Tensorflow builds non-compliant manylinux1 wheels.

sub-mod · July 23, 2019, 4:54pm

I still see the issue of segfault

>>> import pyarrow
>>> import tensorflow
Segmentation fault (core dumped)

Tensorflow will not build manylinux1 wheel and manylinux2010 is the target.
I checked with Uwe L. Korn on this issue and he mentioned that pyarrow will be manylinux2010 compliant soon. In that case I think this segfault issue will be resolved.

sub-mod · July 23, 2019, 4:57pm

There are from my notes:
https://issues.apache.org/jira/browse/ARROW-3466,
https://issues.apache.org/jira/browse/ARROW-2657 and
https://issues.apache.org/jira/browse/ARROW-5130 for past issues.
The resolution was to make linker script of libarrow.so quite strict: https://github.com/pitrou/arrow/blob/master/cpp/src/arrow/symbols.map

zwol · July 23, 2019, 5:03pm

Thanks, I can reproduce this crash on my computer with pyarrow-0.14.1 and tensorflow-1.14.0 (both tagged manylinux1).

pitrou · July 23, 2019, 5:13pm

We started producing manylinux2010 wheels for PyArrow but pulled them at the last minute, as they failed loading (I don’t remember the details, but IIRC auditwheel repair is buggy).

sub-mod · July 23, 2019, 5:15pm

Just to let you know TF is not manylinux1 compliant.The auditwheel will fail.

sub-mod · July 23, 2019, 5:18pm

Last time I tried it failed to repair and auditwheel mentioned pyarrow was linux_x86_64 and not manylinux1.

$ auditwheel show /input/wheelhouse/pyarrow-0.13.0-cp36-cp36m-manylinux1_x86_64.whl

pyarrow-0.13.0-cp36-cp36m-manylinux1_x86_64.whl is consistent with the
following platform tag: “linux_x86_64”.

pitrou · July 23, 2019, 5:30pm

That’s weird, 'cause we precisely use auditwheel to produce manylinux1 wheels…

pitrou · July 23, 2019, 5:33pm

The latest PyArrow wheels seem fine (at least according to auditwheel…):

$ auditwheel show pyarrow-0.14.1-cp37-cp37m-manylinux1_x86_64.whl 

pyarrow-0.14.1-cp37-cp37m-manylinux1_x86_64.whl is consistent with the
following platform tag: "manylinux1_x86_64".
[....]
$ auditwheel show pyarrow-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl 

pyarrow-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl is consistent with
the following platform tag: "manylinux2010_x86_64".
[....]

sub-mod · July 23, 2019, 6:52pm

here are the manylinux2010 wheels of tensorflow if you need to test
https://tensorflow.pypi.thoth-station.ninja/index/manylinux2010/jemalloc/simple/tensorflow/

zwol · July 23, 2019, 7:15pm

I observe the same crash with this tensorflow wheel.

sub-mod · July 23, 2019, 7:43pm

I don’t see it on centos6 which OS are you on ?

> root@7768fc1175ca:/workspace# python
> Python 3.6.3 (default, May 14 2019, 16:15:26)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> root@7768fc1175ca:/workspace# gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-7/root/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
> Target: x86_64-redhat-linux
> Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-7/root/usr --mandir=/opt/rh/devtoolset-7/root/usr/share/man --infodir=/opt/rh/devtoolset-7/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --with-linker-hash-style=gnu --enable-initfini-array --with-default-libstdcxx-abi=gcc4-compatible --with-isl=/builddir/build/BUILD/gcc-7.3.1-20180303/obj-x86_64-redhat-linux/isl-install --enable-libmpx --with-mpc=/builddir/build/BUILD/gcc-7.3.1-20180303/obj-x86_64-redhat-linux/mpc-install --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
> Thread model: posix
> gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)
> root@7768fc1175ca:/workspace# python
> Python 3.6.3 (default, May 14 2019, 16:15:26)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> import tensorflow
> >>>
> >>>
> root@7768fc1175ca:/workspace# python
> Python 3.6.3 (default, May 14 2019, 16:15:26)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import tensorflow
> >>> import pyarrow
> >>>
> >>>

zwol · July 25, 2019, 1:49pm

I use Debian unstable (kernel 4.19, GNU libc 2.28) and I tested with Python 3.7. I wonder if there might have been some change to the dynamic linker in between CentOS 6 and now.

pitrou · July 25, 2019, 2:55pm

It definitely depends on the system version of libstdc++ and perhaps glibc.
For example I can reproduce on Ubuntu 18.04:

$ python -c "import pyarrow, tensorflow"
Erreur de segmentation

gdb backtrace:

#0  0x000000000000003f in ?? ()
#1  0x00007ffff77d4827 in __pthread_once_slow (once_control=0x7fffb98fcd28 <tensorflow::port::(anonymous namespace)::cpuid_once_flag>, 
    init_routine=0x7fffc67ff830 <__once_proxy>) at pthread_once.c:116
#2  0x00007fffb8f440ca in void std::call_once<void (&)()>(std::once_flag&, void (&)()) ()
   from /home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#3  0x00007fffb8f4410e in tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
   from /home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#4  0x00007fffb8655bd5 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
   from /home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#5  0x00007ffff7de5733 in call_init (env=0xb7c690, argv=0x7fffffffd8f8, argc=3, l=<optimized out>) at dl-init.c:72
#6  _dl_init (main_map=main_map@entry=0xfd3ff0, argc=3, argv=0x7fffffffd8f8, env=0xb7c690) at dl-init.c:119
#7  0x00007ffff7dea1ff in dl_open_worker (a=a@entry=0x7fffffff7dc0) at dl-open.c:522
#8  0x00007ffff7b4b2df in __GI__dl_catch_exception (exception=0x7fffffff7da0, operate=0x7ffff7de9dc0 <dl_open_worker>, args=0x7fffffff7dc0)
    at dl-error-skeleton.c:196
#9  0x00007ffff7de97ca in _dl_open (file=0x7fffc5738e20 "/home/antoine/t/venv/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so", 
    mode=-2147483646, caller_dlopen=0x641d20 <_PyImport_FindSharedFuncptr+112>, nsid=<optimized out>, argc=3, argv=<optimized out>, env=0xb7c690) at dl-open.c:605
#10 0x00007ffff75c1f96 in dlopen_doit (a=a@entry=0x7fffffff7ff0) at dlopen.c:66
#11 0x00007ffff7b4b2df in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff7f90, operate=0x7ffff75c1f40 <dlopen_doit>, args=0x7fffffff7ff0)
    at dl-error-skeleton.c:196
#12 0x00007ffff7b4b36f in __GI__dl_catch_error (objname=0xb80140, errstring=0xb80148, mallocedp=0xb80138, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:215
#13 0x00007ffff75c2735 in _dlerror_run (operate=operate@entry=0x7ffff75c1f40 <dlopen_doit>, args=args@entry=0x7fffffff7ff0) at dlerror.c:162
#14 0x00007ffff75c2051 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#15 0x0000000000641d20 in _PyImport_FindSharedFuncptr ()

As you can see __pthread_once_slow (a private glibc function) is involved in the implementation of the C++ function std::call_once, which itself is used (sometimes? always? depending on libstdc++ version?) for the implementation of static local variables (which feature thread-safe initialization, see “Static local variables” in https://en.cppreference.com/w/cpp/language/storage_duration).