How to use pip install to build some scientific packages from sources with custom build arguments

Hi,
I’m new to Python and Docker, learning a lot these days. I’m facing the problem of learning how to dockerize my Python app and I wanted to reduce his size. Due to some scientific depedencies the image is huge. I read about the possibility to build core scientific libraries like numpy, scipy, pandas, etc. from source with custom building arguments, e.g. by removing symbols and debug stuff or by selecting the blas/palack libraries, e.g. openblas (small) vs MLK (huge). Please see here for a reference.

Anyway I am struggling to do this through the pip python package manager.
To date, what I tried is installing some building dependencies, for instance (numpy and scipy deps):

build-essential
cmake
ninja-build
gfortran
pkg-config
python-dev
libopenblas-dev
liblapack-dev
cython3
patchelf
libatlas-base-dev
libffi-dev

And the tried to use pip as follows (tried, for instance, just numpy from source):

pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --use-pep517 --check-build-dependencies --no-binary numpy --global-option=build_ext --global-option=“-g0” --global-option=“-Wl,–strip-all” --requirement dependencies.txt

I am strugging to unstarstand:

  • The difference between the --compile and the --no-binary options of pip install. Which should I use?
  • The use and implifications of the option --no-build-isolation, should I exploit it in my case?
  • How to drop the option --global-option (being deprecated as I understood) and still passing custom build arguments. Should I exploit the option --config-settings? Where can I find real world examples of using it in different ways?
  • Can I do the above per package, e.g. custom options for building numpy and different options for building scipy?
  • How can I speed-up the building time? Should I investigate the -j4 option for istance?

Faced with this problem I would not use pip at all.

I would write scripts to take the sources code of that you want to compile and use their build instructions to build them as you need.

You would build the dependencies and then supply them to the higher level packages as input.

What OS are you working on?

Also does the size of the docker image really matter vs. the time you will invest in creating a smaller version?

3 Likes

Exactly – pip is not a build tool – it is a package manager that is overloaded to provide an interface to build tools, for an easier one-stop-shopping experience. In your case, you want to call the build tool directly (which I think is still a heavily patched setuptools/distutils for numpy / scipy, but I may be wrong). You probably want to build wheels and install those in your docker image.

Also – you might want to give conda a look-see – if the packges in conda-forge aren’t quite what you need, you could rebuild just a couple.

Also – scipy is pretty darn big – and you are probably only using a small bit of it, but it’s hard to know exactly which bits – I’ve tried to spilt it up in the past and it hasn’t been worth it :slight_smile:

1 Like

FYI, nowadays as I understand SciPy has switched and NumPy is switching to the new standards-based Meson-Python, and NumPy.distutils is deprecated.

1 Like

Let me first point out that the article you’re linking is from 2018. A lot has changed in the meantime. In general:

  • Wheels on PyPI are already stripped and do not contain debug symbols. If they do, that’d be a bug, so if you see this for any of the packages you are trying to build here, please open an issue on the project’s issue tracker.
  • Wheels do not include MKL, so at best you can get rid of a duplicate copy of libopenblas and libgfortran - perhaps shaving off 40 MB from your final Docker image.
  • If you really need small images, then build with Musl rather than glibc. So use Alpine Linux as a base for example.
  • Use --no-binary
  • Yes, you should use --no-build-isolation here. The dependencies in pyproject.toml contain numpy == pins that are specific to building binaries for redistribution on PyPI (handwaving, there’s a bit more to that - not too relevant here). If you are building your whole stack from source and then deploy that, you want to build against the packages in that stack. With build isolation, you’d instead be pulling different versions down from PyPI.
  • Yes, you should --config-settings. Its interface is unfortunately pretty cumbersome, and there isn’t a canonical place with good examples AFAIK. Here is one for SciPy: Advanced Meson build topics — SciPy v1.11.0.dev0+1679.edf236d Manual. I’ll note that as of today, using python -m build to build wheels, and then pip to install them, is nicer (will be solved in a next pip release).
  • You can specify different options per package. It’s anyway a good idea to build packages one by one in this case.
  • You can pass -j4 in --config-settings indeed. For scipy this isn’t needed, the Meson build will use all available cores by default. For anything still using setup.py based builds, you need to manually control it.

It may, unfortunately. For example, AWS Lamba has size constraints that can cause your application to fail to upload. Last time I tried was a few years ago, but then numpy + scipy + pandas was already very close to the limit.

This advice is incorrect. You should be using a build frontend, so either build or pip. Invoking setup.py (or meson, or …) usually still works may cause some subtle problems, like missing .dist-info metadata, which then can cause for example importlib.resources to not work. There is no reason to avoid pip or build here - it’s just a way to say “build me a wheel”.

Let me also point out that when you’re building a conda package, that will do the exact same thing - invoke pip under the hood (look at the average conda-forge recipe to confirm that).

4 Likes

Also, this is only peripherally related but I spent > 5 minutes figuring out where this happens (and sharing for posterity); conda-build sets the appropriate environment variable such that pip doesn’t use build isolation, i.e. it uses the conda environment that the recipe provisions.

I am working with a Debian slim docker image at the moment. I will consider switching to Alpine Linux later on if that is available with little effort or consider also a building a distroless image and trasfering only needed packages, e.g. see here for a reference. Although this approach seems not mature enough for production yet.

The time spent on is greater than the results but I’m learning these technologies so it’s nice to dig a bit and learn more. The know-how is easily trasferable in many applications. Also note that some cloud providers has size limits like AWS Lambda functions and/or you pay for every pull/push into a private cloud image registry, so minimizing the size can impact your deployment/running costs.

Would you mind elaborating a bit more about splitting scipy up and give me some reference on how to do it? Thanks!

I wish I could, but I don’t think there is any such reference. Way back when, there was some discussion of breaking scip up into sub-packages, but for teh most part:

  • The user experience for most is much better if there a single step install (and import) of scipy gets you a whole bunch of stuff – that’s kinda what scipy is for – after all we have numpy as a core pacakge already.

  • while quite a bit of scipy is optional, there is also a fair bit of inter-dependence – so hard to know where to draw a line.

So you kind a need o do it by hand – hand -build (copy and pasting from scipy) a package that has only what you need, and keep adding stuff until it works :frowning:

In my case, I only ended up doing this when I literally only needed perhaps a couple function is scipy.special, for instance.

1 Like

Calling the build tool directly is fine if the build tool says you can do it. Some do say that (e.g. flit), some say not to do it (e.g. setuptools, and apparently meson?)

I deliberately built pymsbuild to be invoked directly for developer tasks that are more complex than “just turn these sources into a wheel”.

But I think the advice everyone was trying to get at here is that the packages probably need to be built independently. That is, when there’s a chain of native dependencies, build each one directly and keep building on top of those, rather than trying to grab the end of the chain and expecting pip to sort it out.

Importantly, right now if you’re using --config-settings there’s really no good way to make sure your settings only apply to the package you care about (and not its dependencies, and potentially not even build dependencies if they have to be built). So you may have to understand the dependency chain first and use a separate build for each one.

1 Like

It’s fine for developer tasks to call meson directly, as you mentioned (and setup.py too for that matter - but you need to know what you’re doing exactly at that point, and what you’re missing out on). But that didn’t really seem to be the question here. Meson doesn’t have an opinion - it does have a very good CLI that’s meant to be used, but to get .dist-info you need meson-python which does not have a CLI and can only be used via pip/build.

2 Likes

I am responding here and not directly to you because your discussion is beyond my expertise.
I report some personal considerations:

  1. Using the command:

–check-build-dependencies

causes a dump as it detects unresolvable conflicts for me, see for example:

ERROR: Some build dependencies for scipy from … conflict with the backend dependencies: numpy==1.24.2 is incompatible with numpy==1.19.5; … , pybind11==2.10.3 is incompatible with pybind11==2.10.1.

If, as I believe, I must choose all package versions in order to have a compatible stack, I will not force this check since I am unable to understand the implications.

  1. System packages and pip packages are different. I found myself having to install

Cython

with pip even though I had installed

cython3

from apt-get previously.

  1. Some dependencies are not straightforward fixable: a missing package

mesonpy

was fixed by installing a package with different name

meson-python

or installing both scipy and scikit-learn fails as scikit-learn is not able to recognize the building chain order (?) (scipy was already processed but not found): I don’t know if that is caused because scikit-learn needs scipy to be installed already.

Anyway given these problems and time spent on, I guess this is a rabbit hole for me, considering also my lack of competences.

At the moment I am trying to build like this:

# Create virtual enviroment without bootstrapped pip
#!!! TODO: check if setuptools is bootstrapped, if yes then should be deleted to optimize image size
# https://docs.python.org/3/library/venv.html
RUN python -m venv --without-pip ${VIRTUAL_ENV}
# tools needed to build requirements from source:
# https://docs.scipy.org/doc//scipy-1.4.1/reference/building/linux.html
# https://numpy.org/doc/stable/user/building.html
# https://numpy.org/install/
# https://packages.debian.org/source/stable/cython
RUN set -eux \
    && buildScietificPackagesDeps=' \
            build-essential \
            cmake \
            ninja-build \
            gfortran \
            pkg-config \
            python-dev \
            libopenblas-dev \
            liblapack-dev \
            #cython3 \
            #patchelf \
            autoconf \
            automake \
            libatlas-base-dev \
            # TODO: check if python-ply is needed
            python-ply \
            libffi-dev \
        ' \
    && apt-get update \
    && apt-get install -y --no-install-recommends $buildScietificPackagesDeps
# Install dependencies list
# --prefix
#       used to install inside virtual enviroment path
# --use-pep517 --check-build-dependencies --no-build-isolation
#       used to solve https://github.com/pypa/pip/issues/8559
#       "# DEPRECATION: psycopg2 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed"
# --compile --global-option=build_ext --global-option=-g0 --global-option=-Wl
#       used to pass flags to C compiler and compile to bytecode from source, see:
#       https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
#       https://blog.mapbox.com/aws-lambda-python-magic-e0f6a407ffc6
# 
# https://pip.pypa.io/en/stable/cli/pip_install/#options
RUN pip install --upgrade --no-cache-dir pip wheel setuptools Cython meson-python pythran pybind11 \
    && pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --use-pep517 --no-build-isolation \
        #--check-build-dependencies \
        --requirement dependencies.txt \
        # https://discuss.python.org/t/how-to-use-pip-install-to-build-some-scientific-packages-from-sources-with-custom-build-arguments/
        # https://github.com/pypa/pip/issues/11325
        --no-binary numpy,scipy,pandas --config-settings="build_ext=-j4"\
    && pip cache purge

I will report the results when the build ends. Feel free to share your opinions, thanks.

Results:

  • building numpy, scipy, and pandas with --no-build-isolation (as stated in the above script) did not reduce the size of the image, I saved just 1 MB. I could try few other tests if suggestions are provided (they are welcome :slightly_smiling_face:)

  • Educating myself about python bytecode I found useful incorporating the --no-compile alongside --no-cache-dir during pip install commands. This trick saved me around 180 MB, which is huge. The donwside is that I have to compile .py into .pyc during run-time for the execution. I saw the official Python docker image also strip off .pyc files, I guess the benefit of the size is better than the performance hit, if there is any. I have to check if importing the needed functions from modules is a best practice in this sense, in order to compile into bytecode only what is actually needed.
    EDIT: actually the image does not work, failing to import the numpy module. :face_with_hand_over_mouth:

Arrggh! These issues are being actively talked about in other threads, but the short version is: don’t use the system Python, at least not without virtual environments.

Note that conda makes a point of installing python packages in a pip-compatible way just for this reason.

Also – if you are doing custom builds, I would probably use --no-deps anyway.