PEP 518 and editable mode: don't install already-satisfied dependencies?

NicolasHug · February 4, 2020, 7:33pm

With a pyproject.toml file, pip will by default install the build dependencies in a temporary directory.

Is it conceivable to not install a dependency in the temp dir in case that dependency is already satisfied?

Rationale:

We maintain a package (scikit-learn) that has Cython and numpy as build dependencies. Cython compiles .pyx files to .c which are then compiled to python extensions. Some of the generated .c files depend on some numpy C headers.

We’re considering adding a pyproject.toml file.

However, this causes numpy to be installed in the temporary env everytime pip install -e . is run. As a result, Cython believes that the .pyx files that depend on numpy need to be re-compiled, since numpy itself is considered “new”. This means that every time we use pip install -e ., lots of our Cython files are re-compiled even though none of them actually changed.

I believe that if numpy wasn’t installed in the temp directory, we wouldn’t have this issue. Pip could just be using the already-installed numpy in the env (as long as it satisfies the constraints in the pyproject.toml).

We can also just use --no-build-isolation, but I am concerned that adding the .toml file changes pip’s default behavior.

ogrisel · February 6, 2020, 12:33pm

A solution would be to have pip keep its build venv around to reuse it later instead of creating a new temp venv and deleting it each time pip install -e . is called.

pf_moore · February 6, 2020, 3:47pm

In theory, I like this idea. In practice, however, how would you ensure that the build environment is still appropriate? If a new version of setuptools is released, then the build environment¹ would need to be rebuilt (or at least updated²) to get the new version of setuptools. But how would you know that, short of doing a lot of the work that’s involved in creating the build environment anyway?pip caches downloaded files, and wheels that it builds, so we already avoid duplicated work in those areas.

It may be possible to do something here - it’s just that it’s quite difficult to pin down the details of how we’d do it in practice. A PR implementing something like this idea - or even just a sketch in pseudocode of the actual changes that are being proposed in pip - would be a great place to start a design discussion. But something as broad and unspecific as “keep build environments around for reuse” is probably too general to get very far by itself.

¹ To be precise here, pip’s build environments aren’t actual virtual environments. They are lightweight directories containing only the requested build requirements. This has pros and cons when it comes to this proposal…
² One downside of the environments not being full venvs is that they don’t easily support upgrading packages (there’s no need if they are single-use).

uranusjr · February 7, 2020, 9:09am

I put together a tool called Setl a while ago to fix some of my workflow problems around using Setuptools in a pyproject.toml project, and one of the features I implemented was to reuse the environment during builds to avoid re-installing build requirements. My usage is likely far less sophisticated than scikit-learn, so there should be many rough edges to deal with, but I’d be extremely glad (honoured!) if there’s interest to help me improve it.

Regarding pip install -e though, this kind of feel like an A-B problem to me. Why does Cython think the Numpy files changed? pip currently sets mtime for wheel contents (if possible); what can it do to convince Cython the Numpy copies are the same?

ogrisel · February 7, 2020, 10:00pm

Why does Cython think the Numpy files changed? pip currently sets mtime for wheel contents (if possible); what can it do to convince Cython the Numpy copies are the same?

Here are the details: [MRG] BLD Specify build time dependencies via pyproject.toml by jeremiedbb · Pull Request #16244 · scikit-learn/scikit-learn · GitHub

Our build depends on files such as site-packages/Cython/Includes/libc/math.pxd and site-packages/Cython/Includes/numpy/__init__.pxd and Cython is a build dependency of scikit-learn (hence only installed in the tmp env when using pyproject.toml).

Thanks for setl, it looks interesting but we would rather have avoid asking our users to install yet another tool. If pip would support incremental building (and a -j 4 build option for multi-core parallel builds) that would be perfect.

chrahunt · February 8, 2020, 4:20am

As-is I don’t see how pip could support incremental building without there being pip-specific behavior that a build backend would have to rely on (which we want to avoid). It could be possible with something like the persistent cache directory proposed in Proposal: Adding a persistent cache directory to PEP 517 hooks.

ogrisel · February 9, 2020, 2:28pm

You mean the isolated environment folder create by pip would be created under the persistent cache folder instead of TMPDIR? Why not.

But to clarify the problem does not seem to be that “copying the repository into a temporary directory” (I don’t think it’s the case when running pip install in editable mode).

The root of the problem is that the site-packages folder that holds the build dependencies specified in the pyproject.toml file are reinstalled each time, therefore triggering the mtime-based dependency management system of the cython compiler and therefore preventing incremental building in editable mode.

uranusjr · February 9, 2020, 3:41pm

This is the part I don’t follow. It seems to me that pip copies metadata from the wheel to the installed files, so their mtime should be the same across each installation. Why would this trigger the mtime detection?

ogrisel · February 9, 2020, 3:48pm

If not sure that srcfile has the right metadata here. Maybe the mtime of srcfile is reset at each call to pip install -e .. I don’t have time to investigate at the moment though.