How to use pip install to build some scientific packages from sources with custom build arguments

Hi,
I’m new to Python and Docker, learning a lot these days. I’m facing the problem of learning how to dockerize my Python app and I wanted to reduce his size. Due to some scientific depedencies the image is huge. I read about the possibility to build core scientific libraries like numpy, scipy, pandas, etc. from source with custom building arguments, e.g. by removing symbols and debug stuff or by selecting the blas/palack libraries, e.g. openblas (small) vs MLK (huge). Please see here for a reference.

Anyway I am struggling to do this through the pip python package manager.
To date, what I tried is installing some building dependencies, for instance (numpy and scipy deps):

build-essential
cmake
ninja-build
gfortran
pkg-config
python-dev
libopenblas-dev
liblapack-dev
cython3
patchelf
libatlas-base-dev
libffi-dev

And the tried to use pip as follows (tried, for instance, just numpy from source):

pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --use-pep517 --check-build-dependencies --no-binary numpy --global-option=build_ext --global-option=“-g0” --global-option=“-Wl,–strip-all” --requirement dependencies.txt

I am strugging to unstarstand:

  • The difference between the --compile and the --no-binary options of pip install. Which should I use?
  • The use and implifications of the option --no-build-isolation, should I exploit it in my case?
  • How to drop the option --global-option (being deprecated as I understood) and still passing custom build arguments. Should I exploit the option --config-settings? Where can I find real world examples of using it in different ways?
  • Can I do the above per package, e.g. custom options for building numpy and different options for building scipy?
  • How can I speed-up the building time? Should I investigate the -j4 option for istance?

Faced with this problem I would not use pip at all.

I would write scripts to take the sources code of that you want to compile and use their build instructions to build them as you need.

You would build the dependencies and then supply them to the higher level packages as input.

What OS are you working on?

Also does the size of the docker image really matter vs. the time you will invest in creating a smaller version?

3 Likes

Exactly – pip is not a build tool – it is a package manager that is overloaded to provide an interface to build tools, for an easier one-stop-shopping experience. In your case, you want to call the build tool directly (which I think is still a heavily patched setuptools/distutils for numpy / scipy, but I may be wrong). You probably want to build wheels and install those in your docker image.

Also – you might want to give conda a look-see – if the packges in conda-forge aren’t quite what you need, you could rebuild just a couple.

Also – scipy is pretty darn big – and you are probably only using a small bit of it, but it’s hard to know exactly which bits – I’ve tried to spilt it up in the past and it hasn’t been worth it :slight_smile:

1 Like

FYI, nowadays as I understand SciPy has switched and NumPy is switching to the new standards-based Meson-Python, and NumPy.distutils is deprecated.

1 Like

Let me first point out that the article you’re linking is from 2018. A lot has changed in the meantime. In general:

  • Wheels on PyPI are already stripped and do not contain debug symbols. If they do, that’d be a bug, so if you see this for any of the packages you are trying to build here, please open an issue on the project’s issue tracker.
  • Wheels do not include MKL, so at best you can get rid of a duplicate copy of libopenblas and libgfortran - perhaps shaving off 40 MB from your final Docker image.
  • If you really need small images, then build with Musl rather than glibc. So use Alpine Linux as a base for example.
  • Use --no-binary
  • Yes, you should use --no-build-isolation here. The dependencies in pyproject.toml contain numpy == pins that are specific to building binaries for redistribution on PyPI (handwaving, there’s a bit more to that - not too relevant here). If you are building your whole stack from source and then deploy that, you want to build against the packages in that stack. With build isolation, you’d instead be pulling different versions down from PyPI.
  • Yes, you should --config-settings. Its interface is unfortunately pretty cumbersome, and there isn’t a canonical place with good examples AFAIK. Here is one for SciPy: http://scipy.github.io/devdocs/dev/contributor/meson_advanced.html#select-a-different-blas-or-lapack-library. I’ll note that as of today, using python -m build to build wheels, and then pip to install them, is nicer (will be solved in a next pip release).
  • You can specify different options per package. It’s anyway a good idea to build packages one by one in this case.
  • You can pass -j4 in --config-settings indeed. For scipy this isn’t needed, the Meson build will use all available cores by default. For anything still using setup.py based builds, you need to manually control it.

It may, unfortunately. For example, AWS Lamba has size constraints that can cause your application to fail to upload. Last time I tried was a few years ago, but then numpy + scipy + pandas was already very close to the limit.

This advice is incorrect. You should be using a build frontend, so either build or pip. Invoking setup.py (or meson, or …) usually still works may cause some subtle problems, like missing .dist-info metadata, which then can cause for example importlib.resources to not work. There is no reason to avoid pip or build here - it’s just a way to say “build me a wheel”.

Let me also point out that when you’re building a conda package, that will do the exact same thing - invoke pip under the hood (look at the average conda-forge recipe to confirm that).

4 Likes

Also, this is only peripherally related but I spent > 5 minutes figuring out where this happens (and sharing for posterity); conda-build sets the appropriate environment variable such that pip doesn’t use build isolation, i.e. it uses the conda environment that the recipe provisions.

I am working with a Debian slim docker image at the moment. I will consider switching to Alpine Linux later on if that is available with little effort or consider also a building a distroless image and trasfering only needed packages, e.g. see here for a reference. Although this approach seems not mature enough for production yet.

The time spent on is greater than the results but I’m learning these technologies so it’s nice to dig a bit and learn more. The know-how is easily trasferable in many applications. Also note that some cloud providers has size limits like AWS Lambda functions and/or you pay for every pull/push into a private cloud image registry, so minimizing the size can impact your deployment/running costs.

Would you mind elaborating a bit more about splitting scipy up and give me some reference on how to do it? Thanks!

I wish I could, but I don’t think there is any such reference. Way back when, there was some discussion of breaking scip up into sub-packages, but for teh most part:

  • The user experience for most is much better if there a single step install (and import) of scipy gets you a whole bunch of stuff – that’s kinda what scipy is for – after all we have numpy as a core pacakge already.

  • while quite a bit of scipy is optional, there is also a fair bit of inter-dependence – so hard to know where to draw a line.

So you kind a need o do it by hand – hand -build (copy and pasting from scipy) a package that has only what you need, and keep adding stuff until it works :frowning:

In my case, I only ended up doing this when I literally only needed perhaps a couple function is scipy.special, for instance.

1 Like

Calling the build tool directly is fine if the build tool says you can do it. Some do say that (e.g. flit), some say not to do it (e.g. setuptools, and apparently meson?)

I deliberately built pymsbuild to be invoked directly for developer tasks that are more complex than “just turn these sources into a wheel”.

But I think the advice everyone was trying to get at here is that the packages probably need to be built independently. That is, when there’s a chain of native dependencies, build each one directly and keep building on top of those, rather than trying to grab the end of the chain and expecting pip to sort it out.

Importantly, right now if you’re using --config-settings there’s really no good way to make sure your settings only apply to the package you care about (and not its dependencies, and potentially not even build dependencies if they have to be built). So you may have to understand the dependency chain first and use a separate build for each one.

1 Like

It’s fine for developer tasks to call meson directly, as you mentioned (and setup.py too for that matter - but you need to know what you’re doing exactly at that point, and what you’re missing out on). But that didn’t really seem to be the question here. Meson doesn’t have an opinion - it does have a very good CLI that’s meant to be used, but to get .dist-info you need meson-python which does not have a CLI and can only be used via pip/build.

2 Likes

I am responding here and not directly to you because your discussion is beyond my expertise.
I report some personal considerations:

  1. Using the command:

–check-build-dependencies

causes a dump as it detects unresolvable conflicts for me, see for example:

ERROR: Some build dependencies for scipy from … conflict with the backend dependencies: numpy==1.24.2 is incompatible with numpy==1.19.5; … , pybind11==2.10.3 is incompatible with pybind11==2.10.1.

If, as I believe, I must choose all package versions in order to have a compatible stack, I will not force this check since I am unable to understand the implications.

  1. System packages and pip packages are different. I found myself having to install

Cython

with pip even though I had installed

cython3

from apt-get previously.

  1. Some dependencies are not straightforward fixable: a missing package

mesonpy

was fixed by installing a package with different name

meson-python

or installing both scipy and scikit-learn fails as scikit-learn is not able to recognize the building chain order (?) (scipy was already processed but not found): I don’t know if that is caused because scikit-learn needs scipy to be installed already.

Anyway given these problems and time spent on, I guess this is a rabbit hole for me, considering also my lack of competences.

At the moment I am trying to build like this:

# Create virtual enviroment without bootstrapped pip
#!!! TODO: check if setuptools is bootstrapped, if yes then should be deleted to optimize image size
# https://docs.python.org/3/library/venv.html
RUN python -m venv --without-pip ${VIRTUAL_ENV}
# tools needed to build requirements from source:
# https://docs.scipy.org/doc//scipy-1.4.1/reference/building/linux.html
# https://numpy.org/doc/stable/user/building.html
# https://numpy.org/install/
# https://packages.debian.org/source/stable/cython
RUN set -eux \
    && buildScietificPackagesDeps=' \
            build-essential \
            cmake \
            ninja-build \
            gfortran \
            pkg-config \
            python-dev \
            libopenblas-dev \
            liblapack-dev \
            #cython3 \
            #patchelf \
            autoconf \
            automake \
            libatlas-base-dev \
            # TODO: check if python-ply is needed
            python-ply \
            libffi-dev \
        ' \
    && apt-get update \
    && apt-get install -y --no-install-recommends $buildScietificPackagesDeps
# Install dependencies list
# --prefix
#       used to install inside virtual enviroment path
# --use-pep517 --check-build-dependencies --no-build-isolation
#       used to solve https://github.com/pypa/pip/issues/8559
#       "# DEPRECATION: psycopg2 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed"
# --compile --global-option=build_ext --global-option=-g0 --global-option=-Wl
#       used to pass flags to C compiler and compile to bytecode from source, see:
#       https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
#       https://blog.mapbox.com/aws-lambda-python-magic-e0f6a407ffc6
# 
# https://pip.pypa.io/en/stable/cli/pip_install/#options
RUN pip install --upgrade --no-cache-dir pip wheel setuptools Cython meson-python pythran pybind11 \
    && pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --use-pep517 --no-build-isolation \
        #--check-build-dependencies \
        --requirement dependencies.txt \
        # https://discuss.python.org/t/how-to-use-pip-install-to-build-some-scientific-packages-from-sources-with-custom-build-arguments/
        # https://github.com/pypa/pip/issues/11325
        --no-binary numpy,scipy,pandas --config-settings="build_ext=-j4"\
    && pip cache purge

I will report the results when the build ends. Feel free to share your opinions, thanks.

Results:

  • building numpy, scipy, and pandas with --no-build-isolation (as stated in the above script) did not reduce the size of the image, I saved just 1 MB. I could try few other tests if suggestions are provided (they are welcome :slightly_smiling_face:)

  • Educating myself about python bytecode I found useful incorporating the --no-compile alongside --no-cache-dir during pip install commands. This trick saved me around 180 MB, which is huge. The donwside is that I have to compile .py into .pyc during run-time for the execution. I saw the official Python docker image also strip off .pyc files, I guess the benefit of the size is better than the performance hit, if there is any. I have to check if importing the needed functions from modules is a best practice in this sense, in order to compile into bytecode only what is actually needed.
    EDIT: actually the image does not work, failing to import the numpy module. :face_with_hand_over_mouth:

Arrggh! These issues are being actively talked about in other threads, but the short version is: don’t use the system Python, at least not without virtual environments.

Note that conda makes a point of installing python packages in a pip-compatible way just for this reason.

Also – if you are doing custom builds, I would probably use --no-deps anyway.

1 Like

Great thread, thanks to all for the constructive discussion. I’ve upgraded my build Dockerfile for SciPy and NumPy which was modelled on the same article (the one from 5 years ago), targetting Amazon Linux 2 (a “layer” for a packaged AWS Lambda microservice). The build process from source enables it to fit within the AWS Lambda size constraints, which I appreciate is a bit of a misuse of the purpose but got to do what you’ve got to do to deploy! :smiley_cat:

There was a bit of a cascade of build dependency requirements with recent version updates’ requirements (cmake, gcc, etc.) so a significant portion of the dependencies had to be built from source in the Docker image.

I hope nobody minds if I share this here where it will might be visible to others facing a similar challenge, or for comparison against the solution above. There were a few other places describing this problem (e.g. here) which trailed off without a clear indication of whether they solved it or not.

I haven’t found the magic combination of options that works yet, but at least I think you’ve set me on the right path now, thanks.

Basic setup and Yum packages installed before build:

FROM mlupin/docker-lambda:python3.10-build AS build

USER root

WORKDIR /var/task

# https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
ENV CFLAGS "-g0 -Wl,--strip-all -DNDEBUG -Os -I/usr/include:/usr/local/include -L/usr/lib64:/usr/local/lib64:/usr/lib:/usr/local/lib"

RUN yum install -y wget curl git nasm openblas-devel.x86_64 lapack-devel.x86_64 python-dev file-devel make Cython libgfortran10.x86_64 openssl-devel

# Download and install CMake
WORKDIR /tmp

ENV CMAKE_VERSION=3.26.4

# Download and install CMake
RUN wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz
RUN tar -xvzf cmake-${CMAKE_VERSION}.tar.gz
RUN cd cmake-${CMAKE_VERSION} && ./bootstrap && make -j4 && make install

# Clean up temporary files
RUN rm -rf /tmp/cmake-${CMAKE_VERSION}
RUN rm /tmp/cmake-${CMAKE_VERSION}.tar.gz

WORKDIR /var/task
RUN pip install --upgrade pip

RUN pip --version

# Specify the version to use for numpy and scipy
ENV NUMPY_VERSION=1.24.3
ENV SCIPY_VERSION=1.10.1

# Download numpy and scipy source distributions
RUN pip download --no-binary=:all: numpy==$NUMPY_VERSION

# Upgrade GCC to version 8 for SciPy Meson build system
RUN wget https://ftp.gnu.org/gnu/gcc/gcc-8.4.0/gcc-8.4.0.tar.gz && \
    tar xf gcc-8.4.0.tar.gz && \
    rm gcc-8.4.0.tar.gz && \
    cd gcc-8.4.0 && \
    ./contrib/download_prerequisites && \
    mkdir build && \
    cd build && \
    ../configure --disable-multilib && \
    make -j$(nproc) && \
    make install && \
    cd / && \
    rm -rf gcc-8.4.0

# Set environment variables
ENV CC=/usr/local/bin/gcc
ENV CXX=/usr/local/bin/g++
ENV FC=/usr/local/bin/gfortran

# Verify GCC version
RUN gcc --version
RUN /usr/local/bin/gfortran --version

# Extract the numpy package and build the wheel
RUN pip install Cython
RUN ls && tar xzf numpy-$NUMPY_VERSION.tar.gz
RUN ls && cd numpy-$NUMPY_VERSION && python setup.py bdist_wheel build_ext -j 4

ENV BUILT_NUMPY_WHEEL=numpy-$NUMPY_VERSION/dist/numpy-$NUMPY_VERSION-*.whl

RUN ls $BUILT_NUMPY_WHEEL

NumPy and SciPy build (for simplicity I installed a wheel with the same version of NumPy as I was building from source, the wheel being purely for building SciPy)

# Don't install NumPy from the built wheel but use same version (it's a SciPy dependency)
RUN pip install numpy==$NUMPY_VERSION
RUN python -c "import numpy"

# Install build dependencies for the SciPy wheel
RUN pip install pybind11 pythran

# Extract the SciPy package and build the wheel
# RUN wget https://github.com/scipy/scipy/archive/refs/tags/v$SCIPY_VERSION.tar.gz -O scipy-$SCIPY_VERSION.tar.gz
RUN git clone --recursive https://github.com/scipy/scipy.git scipy-$SCIPY_VERSION && \
    cd scipy-$SCIPY_VERSION && \
    git checkout v$SCIPY_VERSION && \
    git submodule update --init

RUN cd scipy-$SCIPY_VERSION && python setup.py bdist_wheel build_ext -j 4

ENV BUILT_SCIPY_WHEEL=scipy-$SCIPY_VERSION/dist/SciPy-*.whl
RUN ls $BUILT_SCIPY_WHEEL

# Install the wheels with pip
# (Note: previously this used --compile but now we already did the wheel compilation)
RUN pip install --no-compile --no-cache-dir \
  -t /var/task/np_scipy_layer/python \
  $BUILT_NUMPY_WHEEL \
  $BUILT_SCIPY_WHEEL

RUN ls /var/task/np_scipy_layer/python

# Clean up the sdists and wheels
RUN rm numpy-$NUMPY_VERSION.tar.gz
RUN rm -r numpy-$NUMPY_VERSION scipy-$SCIPY_VERSION

# Uninstall non-built numpy after building the SciPy wheel
RUN pip uninstall numpy -y

RUN cp /var/task/libav/avprobe /var/task/np_scipy_layer/ \
    && cp /var/task/libav/avconv /var/task/np_scipy_layer/

RUN cp /usr/lib64/libblas.so.3.4.2 /var/task/np_scipy_layer/lib/libblas.so.3 \
    && cp /usr/lib64/libgfortran.so.4.0.0 /var/task/np_scipy_layer/lib/libgfortran.so.4 \
    && cp /usr/lib64/libgfortran.so.5.0.0 /var/task/np_scipy_layer/lib/libgfortran.so.5 \
    && cp /usr/lib64/liblapack.so.3.4.2 /var/task/np_scipy_layer/lib/liblapack.so.3 \
    && cp /usr/lib64/libquadmath.so.0.0.0 /var/task/np_scipy_layer/lib/libquadmath.so.0 \
    && cp /usr/lib64/libmagic.so.1.0.0 /var/task/np_scipy_layer/lib/libmagic.so.1 \
    && cp /usr/local/lib/libmp3lame*.so* /var/task/np_scipy_layer/lib \
    && cd /var/task/np_scipy_layer  \
    && zip -j9 np_scipy_layer.zip /var/task/np_scipy_layer/avconv \
    && zip -j9 np_scipy_layer.zip /var/task/np_scipy_layer/avprobe \
    && zip -r9 np_scipy_layer.zip magic  \
    && zip -r9 np_scipy_layer.zip python  \
    && zip -r9 np_scipy_layer.zip lib