Providing a way to specify how to run tests (and docs?)

Maybe there was a communication misunderstanding with the expression: “virtual environment accessible to the test code”.

Let’s say you install tox/nox in a virtual environment, e.g. /tmp/.venv.

When you run the tests, the code will not have access to any package installed inside /tmp/.venv.

The test code will have access to the packages installed in /<project-root>/.tox/<testenv-name>. You will not be able to import tox when writing a test case…

The direct test dependencies and the task runner in this case are two distinct types of dependencies, installed in two completely different virtual environments.

We can make an analogy with the build process:

A backend can be seen as a dependency of a project. The project will need the backend installed somewhere to be built. However, during runtime, the backend is irrelevant.

It is the same for the task runner and the tests. You need the task runner to orchestrate the test, but you don’t need it when the test code is running.

Since setuptools introduced setup.cfg people didn’t need anything fancier than ConfigParser to be able to read extras_require, but some people still decided to use requirements.txt. I have been using extras_require in setup.cfg to store test dependencies because I like the convenience of centralising them in a single file. But I agree with Paul that these dependencies are not really metadata, they don’t allow my packages to expose extra features to my end users[1], so it is perfectly natural that some developers choose to not mix those two things.

  1. Some people might argue that “being able to run the tests” is an extra feature for the end user, but that it is debatable, specially considering that it is very common for a wheel to not include tests. ↩︎

Just to be clear, is this mainly for making repackaging easier for Debian, Conda, Nix, etc.?

Another use case is if you want to run the tests of known downstream libraries as part of your CI to make sure you aren’t accidentally breaking compatibility.

1 Like

This is a good point, and IMO it actually speaks to the same underlying distinction that @brettcannon alludes to above:

There’s really two separate layers (i.e. concerns) here—the testing tool (and its invocation), which collects, runs and reports the results of the project’s test suite, and the task runner, which is responsible for setting up environments and executes arbitrary project-defined tasks within them (tests, docs, linting, etc).

It would appear that simply providing a standard means of declaring the dependencies of and invocation for the project’s testing tool (and possibly also the docs builder), which downstream tooling could be responsible for installing and calling, would seem adequate to meet most of the immediate need here while being relatively straightforward to complex to implement—either following the form of @steve.dower 's suggestion (or paring it down further, just defining the invocation and using standardized extras names to single-source the dependencies for each).

This more or less follows the model of PEP 518 in defining the build entrypoint and the dependencies it needs (though in this case, it is a single hook rather than several), and make setting up and invoking the callee in a Python environment with the indicated dependencies the responsibility of the the caller, with the callee being responsible for the rest.

However, without clearly defined requirements and guidance, I could forsee two distinctly different usage patterns which could significantly complicate things for callers. The most obvious approach would be for projects might specify their test tool and test-specific deps as test requirements, and the test command as the invocation. Meanwhile, I could see others specifying simply their task runner as a dependency and its main/test entrypoint as the invocation, and defer the actual test environment creation and setup to it, outside the direct control of the caller.

Both provide value, but they solve somewhat different problems and target different layers in the stack, and I worry that unless we either define both, or be very clear about which one is expected, the result for callers will be worse than either since the caller will have no reliable way of anticipating which they’re calling, much less request one or the other. I would advise either providing both, or making it very clear it is intended for the former, since that seems to be what downstreams want here. But I’d like to hear more from repackagers about that.

There’s another alternative—allow defining arbitrary tasks in a standardized format in the pyproject.toml, with a handful of standardized names (test, docs, lint, etc), each with their own dependencies, invocation and perhaps other configuration, either a generalization of the task runner configuration format like PEP 621 for metadata (which would be nice in theory, but perhaps too limited in practice given the diversity of tools and approaches).

There appears to be some interest in standardizing this sort of thing, here and certainly elsewhere as a modern, standardized replacement for the various built-in and custom distutils/setuptools commands, but the scope is much more expansive and I worry it would get bogged down in complexity. Still, it might be something to at least keep in mind in terms of leaving the door open a future proposal when designing this one.

This is a good analogy, though arguably, a task runner is in some ways more analogous to a frontend, as it orchestrates the environment creation for the backend, the project’s actual testing system; and the two are only loosely coupled—the tests, like a build system, can be invoked with a modified frontend configuration, a different frontend or even directly, depending on the needs of the caller.

For example, upstream projects may use tox, nox, etc. during development, whereas repackagers may use tox with a custom plugin to skip dep management, their own env setup tooling or simply invoke the test tool directly, just like upstream developers typically use pip for installation but downstream distros really need something simpler and more customizable like installer.

Yes, but I think @brettcannon 's point is that is an unstandadrized bespoke format tied to one tool, which downstream tools cannot rely on being present or canonical, unlike pyproject.toml.

Just like any other packaging standard, nothing that is decided, specified and implemented as a result of this discussion would require you as a package author to adopt a certain method of specifying your test, etc. dependencies or invocation. Likewise, no one is required to fill out the various core metadata fields, specify declarative metadata in the PEP 621 [project] table, declare their build backend and dependencies in a PEP 518 build-system table, etc—or even use a task runner framework, write and run a test suite, or document how to build, install and run the code.

However, like providing and encouraging a standard, interoperable, tool-independent way of specifying test, docs, etc. invocation and dependencies, these things all help other people, tools and ecosystems use, distribute and contribute back to a given project, ultimately benefiting everybody. Of course, there is a cost-benefit, as it requires some amount of effort on the part of the package author, but I don’t see how at least offering and encouraging a standardized mechanism for this is such a bad thing.

That will depend on what do you mean by being present or canonical… Let’s consider the following:

  1. If you decide to specify test requirements as the test extra, whoever is interested in running your test suite can create a virtual environment and install yourpkg[test]. This is accessible independently of backend or configuration file. It is not bespoke to a tool. Note here the key point is that the interface for consuming extras is stable, backend independent and standardised.

  2. If a tool just want to “inspect” which dependencies are included in the test extra, still nowadays the only reliable standardised way that will cover all cases and all backends is to read the core metadata. That was possible before pyproject.toml came into the picture.

I don’t think the decision of using extras for test dependencies heavily depends on the configuration format… As Paul previously mentioned, some developers simply don’t like the idea in the conceptual level (and that is fair).

People that like the idea of using extras, could have been doing it since before pyproject.toml (I definitely was).

Repackagers and other downstreams (the main case we’re discussing here) aren’t going to want to (and often, their policies would not allow) creating a separate virtual environment and using pip to install packages from the internet into it, just so they can introspect it to determine what gets installed and figure out how to replicate it in their own environment. Of course, they can build the package at least to a sdist and then extract the extra’s dependencies from the core metadata in the PKG-INFO, but that’s option 2.

Previously, yes, the only reliable standardized way to get an extras deps is to actually build the package and inspect it. But thanks in part to your work implementing PEP 621 support for Setuptools, tools now have a second way, which is reading it from project.optional-dependencies in pyproject.toml, which is much easier and cheaper to do, as they can be read statically from a file rather than having to build the whole project, unpack it and then parse them out of the appropriate RFC 822 headers.

Sure, fair, but I believe @brettcannon 's point was that it is now much more accessible and practical for downstream consumers, thanks in no small part to your hard work. :smile:

Thanks @CAM-Gerlach, all restrictions you mentioned exist and shape the workflow of repackages/other downstream. However, I don’t think they affect anything I was discussing before (specifically, the fact that the interface for consuming extras exists, is stable and backend independent).

If repackagers/other downstream want to use extra dependencies to run the tests, at that point in time they already need the package to be built (and thus have access to core metadata in (2)). Moreover,
doesn’t it also mean that they will have to obtain these test dependencies somehow and install them anyway? It doesn’t matter if the dependencies come from PyPI or if they use their own installer/repositories instead of pip.

Going back to your original point, tools cannot rely on the project.optional-dependencies being present… Even if the chosen backend does support PEP 621, there is always the possibility it is specified as dynamic. The only truly universal way of inspecting the extras is via core metadata.


Other stuff that you most likely also support for a minimal viable product before running the test:

  • Discover targets defined (and select the ones to run)
  • Altering the current working directory
  • Setting environment variables
  • Passing through environment variables (which you should always pass and which you should remove)

And then the not must have, but probably nice to have concepts for a more robust/powerful usage:

  • per target temporary folder
  • setup/teardown commands
  • environment reuse between runs

Most django projects do, because the django version is included in the target name. Similarly, many people like having with coverage and without coverage variants (-cov suffix often). I’ve seen a few projects that also separate into unit and integration tests in separate targets so you have a quick test env and a slower but more robust one.

There’s a plan to add that. The interface haven’t been groomed and implemented just yet though but might be a reality next year. (PS. also tox is always all lowercase).

This likely is the easiest path ahead. It has to likely tell just the dependencies and default target(s) to call for OS repackagers. E.g. could specify that for OS repackage the style checks are not needed, so only call the py target (and the target can be interpreted by the tool). Something like:

requires = ["tox>=4"]
test-target = ["py-unit", "py-integration"]

This does imply that we need test runners to only support a PEP-517 style API that can take the target list. Alternatively, we could make the interface CLI bound:

requires = ["tox>=4"]
target = ["tox", "-e", "py-unit", "py-integration"]

I prefer tough the PEP-517 interface because we can then add an endpoint of get_valid_targets that could return not just test targets but lint targets too.


I just want to emphasize, again, there are really two different layers in the stack that are intersecting here that need to be distinguished—the testing tool (Pytest, Unittest, etc) and a task runner (tox, nox, etc), that if a careful distinction is not made, may make this proposed functionality not very usable for either most upstream developers or most downstream distros and repackagers.

And it’s worth keeping in mind that the primary motivation here—as it says on the tin, to provide a standardized way to specify how to invoke the project’s test suite, particularly for downstreams that need to run the project’s tests in their own environment, not necessarily to standardize task runner configuration (which certainly has distinct value too).

This proposed syntax would seem to require the project to use a Python task runner (never mind one compatible with the new hooks it would need) and work on the basis of targets, which already excludes the unfortunate large majority of Python projects [1] This creates a much larger barrier to entry for projects to adopt it, particularly when many maintainers are already hesitant to bundle tests and expose the appropriate config given its something that benefits downstream packagers, users and tools more than their own development.

Moreover, it would seem such an approach would seem to require additional complexity to actually solve the motivating problems for downstreams:

  • There would need to be a hook to invoke a task without (at least Python) environment isolation, in the current environment, so that downstreams could actually test the project as packaged for their distribution.
  • There would need to be sufficient standardization of at least the core tasks that downstreams need (tests, docs, etc) so they can be consistently programmatically invoked.
  • Both downstream tooling and task runners would need to implement support for these interfaces, and project authors adopt compatible versions (or switch task runner).

All of these things latter are likely doable, but adds complexity to this approach, while only working for the relatively small fraction of upstream projects that use a Python task runner, rather than providing a generalizable solution.

This syntax is closer to what others were suggesting and would bypass many of the problems above, (though it is still framed in terms of tasks and target without any clear indication of what the target actually does, which may simply be a copy/paste oversight). However, it runs into the issue discussed above—there’s conflation between two levels of the stack, where either a testing tool or a task runner may be invoked here.

This has significant implications for what actually gets tested and where when it comes to downstreams, who want to test the installed package in their environment (and as others mention, it needs to be clearly defined that this should only run the project’s actual tests, not linting checks), so its unclear if this would be actually useful as-is without being more strictly defined.

That said, I’m far from the expert like you all and still giving it more thought myself about what specifically to propose; I like the flexibility, extensibility and DRYness of defining a more generalized task-based interface and configuration rather than specific ones for tests, docs, and whatever else is needed, but on the other hand, that dramatically increases the scope and scale of this effort well beyond the original motivation, while being less well-suited for or requiring additional complexity to actually fulfill such, and I fear it may result in the great being the enemy of the good.

  1. Based on the disappointing fact that only 7% of Python developers in the Python Developers Survey 2020 used Tox for any of their projects, compared to around 50% usage for Pytest and 30% usage for unittest (for reference, it was a multiple-selection question; 63% of participants used some form of testing, and no other test/task runners were mentioned, with “other” at 1%, presumably including Nox) ↩︎

1 Like

It’s already the case with package build backend and frontends, so if anything we would just remain consistent and not deviate.

task runners can/should provide a mode to run in host mode, e.g. for tox see GitHub - fedora-python/tox-current-env: tox plugin to run tests in current Python environment that makes this a non-problem.

Would be fairly trivial to standardize those via defining your test and docs targets for the task runner:

requires = ["tox>=4"]
targets.test = ["py-unit", "py-integration"] = ["docs"]

This would be fairly simple tough on both ends. For task runners we only really need tox, nox and pyinvoke. There’s precedence here, adopting PEP-517 was fairly easy for flit/setuptools. They just had to expose what they already did under a new common API.

Hence why I don’t like it that much. If the user sets up a testing (or documentation) tool here it easily can fail downstream or on another machine because you never addressed all the other factors at play here:

I think it’s similarly important that whatever we come up can live together with task runners and not cannibalize it. It would be a bad place where some of your test setup/teardown logic is in the tasks section and the rest in nox/tox configuration files (ini, toml or python file).

1 Like

Sorry for the delay!

IMO this would indeed be useful, but not that much, and I am not convinced it is worth the trouble for everyone.

I think distros like Fedora would benefit the most, so please take their feedback into account.

I don’t think anyone is finding it hard to understand that there are differing opinions. Personally, my questions are around understanding and not stating that you or anybody else is doing anything wrong or poorly.

But we are an association of people who exist to help standardize stuff to make it easier for things to work together. Now most of our standards are optional, and I don’t view this entire topic as any different, so no one is being told they have to do something. But I think we are discussing whether there is a pattern here of people wanting a way to write down how to run their test suite, maybe build their documentation. And if so can we agree on something that covers the 80% case for those folks where it makes sense?

Yes, I think this covers well where I’m coming from. There are plenty of tools that potentially want a way to execute your test suite directly and know the results without all the nice extras that nox/tox provide.

To be even more concrete, VS Code needs access to the command used to run pytest. Why? Because we need to have pytest tell us what tests there are to populate the test explorer. Right now we either have to hope the command is nothing more than pytest or that folks fill in VS Code-specific settings to tell us what flags to pass. We could try to read your tox.ini file, or your file, or some other bespoke way of specifying your tests, but not everyone uses those tools (goes back to Paul’s point about not requiring folks to follow a specific workflow). But if we can give a carrot for getting help to Linux distros like Fedora (which also help test CPython prior to every release), potentially make a single command specification work across tools, etc., then I think it’s worth having this discussion.

From a Fedora perspective, you’re right. But from a VS Code perspective where I may want to help you install your dev dependencies, expecting core metadata via PKG-INFO or METADATA isn’t reasonable to assume.

1 Like

I completely agree with Bernát, perhaps in part due to both of us maintaining such a tool.

I think what most posters in this conversation are missing is that tox, hatch, nox, etc. should not be thought of as task runners but rather as environment managers, which is totally different and far more complex.

Through that lens, what seems to be happening here is distributions like Fedora & Conda want a universal way to map the config of such managers (since most projects use one) to their own build system’s format.

As such, I’m quite against standardization on this one. Perhaps tox and the like could offer a command that outputs the JSON config of the default or base environment that distributions could consume and translate to their liking.

Right, they both create environments (environment manager), and run user-configured tasks within those environments (task runner). Your comments raise the important point, which mine did not make clear, that it is the primarily the former (not the latter) that make them conceptually distinct from test runners, exist at a higher layer of the “stack” and behave fundamentally differently.

And indeed, it seems to be this environment management functionality that, without a standardized way for consumers to determine a test command will trigger it, and signal that it should be bypassed, appears to be the main potential pitfall for the primary consumers of the metadata proposed here—distros, packaging ecosystems and other downstreams who want to test the package in their respective real-world environments, not the isolated one produced by the upstream’s particular tooling.

I may certainly be wrong, but that’s not the impression I got from what downstream people and others have shared here and in many previous discussions, and that’s not how I would describe my needs in my (limited) role as a Conda-Forge package maintainer.

The primary thing package consumers seem to want is a standardized way of being able to programmatically extract the dependencies of and invoke the project’s test suite (and, to a somewhat lesser extent, build the project’s docs), as a modern, tool-independent replacement for the deprecated and to-be-removed test and build_sphinx/upload_docs commands, bespoke hacks or manual guesswork (see the comments on those issues for an example of some of those requests). Any further environment setup is the responsibility of the consuming tooling.

While given the challenges, its possible that it may simply not be practical to standardize how to provide that in a way sufficiently useful to enough of the downstream ecosystem for enough of the upstream ecosystem that justifies the effort and complexity, I don’t think we should flatly reject exploring possible approaches to such before having fully considered the problem space and potential solutions to the potential issues raised with the current proposals.

Could you share a source on that? As I cited above, per the official 2020 Python Developer survey, only 7% of Python developers used Tox in any of their projects, whereas nearly 70% used a testing framework of some kind.

My point is that we shouldn’t require upstream projects adopt a whole environment management and task running tool just to be able to expose their test dependencies and invocation in a standardized, programmatically accessible way, or adoption is likely to be very limited, and mostly concentrated in the projects that are already (per @encukou) the least effort for downstreams to test.

This sounds to me like you want to standardize the test runners’ interface, not define generic task runners. How do you get from how to run the test suite to what tests would be run? E.g.:

    pytest \
      --cov "{envsitepackagesdir}/tox" \
      --cov-config "{toxinidir}/tox.ini" \
      --junitxml {toxworkdir}/junit.{envname}.xml \

For pytest you can hardcode some flag in there, but what if the user uses another test runner? Or more importantly how do you know which flags are needed for running the test suite (e.g. coverage flags) vs which ones influence the test discovery (which you really care about I feel like).

I think for a task runners it’s critical to define not just how to run the task but what environment it should be run into. tox e.g. might use this information to create a runtime environment to run the task in it, while downstream build systems (like Fedora/Debian) might use it to make sure their current environment setup satisfies that and fail hard otherwise. See for example this feature addition by the Fedora team to achieve this Ability to disable provisioning · Issue #1921 · tox-dev/tox · GitHub
I don’t think downstream people don’t care about the environment the task needs to be run, just that they’d use it as a check rather than a setup.

You’re comparing apples to oranges here. The survey wasn’t made in between maintainers of projects that get repackaged downstream. I’m pretty sure that projects that end up being repackaged by various distributions the number is more like 70% use tox, nox, hatch or escons; and 95% have test suites. E.g. the data science community (included in that survey) tends to not write test due to the explorative nature of their task,

We can have a default task runner that basically implements your built-in assumption on how the environment is set up (pip install project and pip install requirements.txt if present, run task with cwd set to project root and inherit all env-vars from the host), but I truly believe that defining how to call targets without specifying in what environment to run it will have very limited benefit.

1 Like

What you actually need, surely, is a way to ask the project “what tests are there?” After all, the project might not even be using pytest.

I’m not trying to be difficult, just trying to pin down the actual requirements (something I’ve been trying to get clarity on for ages here). It sounds like your real requirement is “have a way to get a machine readable list of tests from the project” rather than “know how to run the tests (without any requirements on what, if any, output is produced)”. This may well be different from the requirements redistributors have.

We have a lot of this sort of problem with pip as a build frontend, where the PEP 517 interface is fairly tightly specified, but doesn’t allow any sort of introspection of backend output for error or progress reporting, for instance. It sounds like it would be good not to repeat that mistake here.


I asked that already once and was swiftly shot down, so I’m explicitly not asking that.

Right now we run pytest for you, so we can inject the appropriate flags for our discovery (which will be a pytest plug-in when we rewrite that bit of code). See Testing Python in Visual Studio Code for the configuration, but we have you specify what to add on after pytest, not the full command to run.

Then VS Code won’t support it. Not much I can do about that w/o an API and as I said, got shot down for that idea already. Plus, let’s be honest, I can support the vast majority of people bothering with testing by setting up pytest, and a bit more for unittest. But basically it’s my problem to worry about that detail.

Because most people don’t set up their tests to also run coverage with us. My bet is most people make coverage a separate task for faster test execution in the REPL developer loop they are typically in and not something they run quickly and frequently.

As mentioned above, pytest didn’t like that idea.

To be frank, there are not enough folks who fall outside of pytest and unittest for me to worry about that case by trying harder to get a common test API. Only so much time in the day. :wink:

I would say that’s ideal, but not a requirement because …

… I can at least work with this scenario. Pragmatically, we could check the command for pytest and then do the right thing in our case (same goes for unittest).

Considering pytest supports unittest, why not always just try with assuming pytest and bail out otherwise? :thinking: If that two are the two main type you want to support and nothing else.

“Almost all unittest features are supported”. The only limitation I’ve run into in practice was subtests. unittest.TestCase Support — pytest documentation