Structured, Exchangeable lock file format (requirements.txt 2.0?)

FRidh · August 5, 2019, 5:31pm

Yes, it would. But it could always do that as well, by simply recording a list of hashes that were used; it just needs to be done per platform.

To summarize what I would like to see in a lock file:

For each artifact:
– url(s) so it is possible to fetch
– name of artifact (basename of url). That allows us to identify the type of artifact, and to provide the artifact when not able to fetch
– hash
list with hashes, that were actually used when generating the lock file, so it is possible to achieve reproducibility.

Now, there is still the issue of platform. The list of hashes that were used are those for a specific platform (and Python version but because we’re considering applications and not development environments we lock that and thus that is not relevant). What now, if we want to support multiple platforms? We could have a list per platform, and lock it per platform. That means multiple runs and the risk of unnecessarily getting different versions of dependencies across platforms due to a change of state. This is a big issue that’s been blocking us also from further automating our package set creation in Nixpkgs.

uranusjr · August 5, 2019, 6:08pm

And this is probably the paradigm mismatch that brews all the disagreement. Python aims to be cross-platform, so each Python package theoratically should run anywhere the interpreter runs. It’s not possible to lock per platform, since the list of platform is indefinitely long (in practice). Python packaging therefore does it the other way around; wheels are tagged by their features (what ABI they use, what platform API they expect, etc.), and a platform chooses the best one for it. This is entirely different from how system packaging works, where you define a platform first, and then tag packages to show compatibility.

There is no right or wrong here, they are just different ways to position a package manager. But Python packaging does not aim to be system-centric, so a lock format mainly to serve Python packaging would be difficult to be fit into a system-centric packaging situation. At least not like this.

brettcannon · August 6, 2019, 10:51pm

I opened Document recommended keys for project_urls · Issue #5947 · pypi/warehouse · GitHub specifically to help standardize on the name for things like this. If we could do that then nix could get a convention for their desire to have a link from projects back to their repositories.

FRidh · November 20, 2019, 1:53pm

Recently a tool was developed called poetry2nix to use a Poetry project with the Nix package manager. It’s basic use is extremely simple, just point to your source folder that has a pyproject.toml and the lock file and say nix-build and your project along with dependencies will be built.

Now, there is some room for improvement though. For example, the way artifacts are chosen (remember, I wrote that for a certain dependency at a certain version, there can be multiple valid artifacts) so it is apparent to the user what is selected.

A more pressing issue is actually that the URLs are not recorded. What is done now is to simply take the project name, version number, and then use PyPI’s predictable URLs to fetch. Unfortunately, they are not entirely predictable. Specifically, they are not normalized https://github.com/pypa/warehouse/issues/2537.

Having predictable URLs or recording URLs is I think a must for a lock file. API calls cannot always be made. Nix cannot make these calls because its impure affecting reproducibilty. (Yes, I am repeating myself here )

h-vetinari · March 8, 2020, 12:10pm

Any chance you might return to this? Hope the fires didn’t last all year.

brettcannon · March 8, 2020, 7:12pm

A couple of us most definitely want to see this happen, so it has not fallen off of everyone’s radar (I for one will make sure this gets resolved eventually).

astraluma · April 16, 2020, 7:20pm

Ok, so my $.02 about some high-level conceptualizations. (Amongst other things, I make PursuedPyBear, and we’re looking at briefcase to enable the creation of end-user distributables. I also have made plenty of services and created many deployment pipelines.)

I’m sorry this is so long. I’m not sure there’s a good tl;dr.

I’m going to be summarizing/restating a lot of stuff, partly to set up my conceptualization, partly so that if my assumptions are wrong I can be refuted at that level, and partly so that we’re all on the same page.

There’s a lot of tools and a lot of uses for the dependency-locked environment workflow/concept/etc. The goal of this thread/effort is to standardize the format of the lockfile (and intrinsically some of the concepts) to improve interoperability–“pipenv creates a lockfile that heroku uses to deploy” kind of stuff. It would also be good if a set of libraries to work with this were created so that new tools and workflows can be created and meaningful innovation can happen. (It’s much easier to make new kinds of bikes if you don’t have to re-engineer the bolt every time.)

I’m going to define some terms, and they’re going to wildly conflict with already-overloaded terms. Hard problems and all.

Packages are things that can be installed into an actualized environment. They have versions, dependencies (both hard and optional), artifacts used in the actual installation, restrictions about when those artifacts can be used, etc. Note that different artifacts can have different lists dependencies. (This maps to how sdists and wheels work. “installation restrictions” just means stuff like platform, Python version, Python ABI, etc.) Note that for this discussion I only care about runtime requirements, not build requirements.

Conceptual environments are ideas like “production” or “debugging” or “building” or “formatting”, or briefcase’s “mac”, “windows”, “android”. They have a specification and lock data.

A specification is a list of “top-level” packages that a human specifies, with version requirements. This is where a person says “Give me flask, sqlalchemy with postgresql support, and flask-redis of version 4.3.*”.

Lock data is computed given a specification and package data. It describes the total set of packages that need to be installed to satisfy a specification: All the packages specified are installed, including dependencies recursively, and nothing has unmet dependencies.

(Note that lock data is only valid for a given artifact restriction context: because artifacts have installation restrictions, and different artifacts of the same package/version can have different dependency lists, the lock solution is only valid for a given restriction context.)

Conceptual environments can be actualized with the assistance of a tools. To be extra clear, actualization is turning the lock data into something that actually exists on the system and is usable. I’m deliberately being vague about it because pipenv and briefcase have pretty different ideas, but it generally means something to the effect of “download, unpack, and possibly build all packages in the lock data”.

A few times I refer to a project. This is the set of source code, metadata, etc that does something. Some projects produce packages. Some produce end-user deliverables. Some are deployed to a server as a network service. I think I generally say things like “package produced by the project” to mean “this project, but interpreted as a package”. I don’t mean to imply that you must build this project into artifacts to do anything useful with it.

So, how do I see these concepts mapping to existing technologies?

Actualized environments would be things like venv, conda, containers, or who knows what else. Most of the existing stuff is about development and deployment environments, but these can also be actualized into portable(ish) environments (buildpacks, briefcase, or my own poetry2container).

Specifications are pretty varied–things like requirements-formatted data, setuptools specifications (setup.py or setup.cfg), poetry, pipenv, briefcase, tox I think, and probably more. There exist PEPs to specify some common syntaxes, but there’s still variation. Also note that a specification can be composited from multiple sources–something poetry-like might look at both the installation requirements of the package produced by a project as well as some variant of “development” requirements.

I mentioned package data about producing the lock data. This would be the total set of available packages and their metadata. Usually from a package registry. For Python+PyPI (I don’t know Conda), some the necessary data is available from the pypi index api, but some of it can only be found inside artifacts. (Good news, both sdist and wheel provide all the necessary metadata without having to actually build/install anything, so you could produce lock data for py3.8+windows from py3.6+linux. Bad news, you’ll have to download all the artifacts of interest even if you never use them.)

The big thing this thread is trying to nail down is the lock data schema and the format in which it is serialized to disk.

A conceptual environment doesn’t really exist–it’s an idea, created from and providing context for specifications and lock data. Briefcase might define one for each distribution target. Poetry might have prod and dev environments for the current system.

Note that it’s the restriction context (platform) that some current tools fall over on; for example, pipenv has (had) bugs to the effect of “projects locked on linux fail to install on mac”. Lock data only makes sense given that context, because you need that context to select dependencies. And because Python is cross-platform and lock data is shared among all developers in a project (even if they’re on different platforms), any solution needs to account for this.

Note that in this, I’m not defining that actualized environments are mutable or immutable, or that lock data describes the totality of what is installed in an environment. I think these are tool-specific decisions best left to the tool. (Although maybe providing utilities to make this easier would be appreciated.)

Universal, however, is the flow of specification -> lock data -> actualization.

Ok, so how would a tool use this workflow?

Ok, let’s take xot. It’s basically tox, but I get to define how it works so I don’t conflict with how the real tox actually works. xot is a testing (task) framework that lets you run validation things in a variety of configurations.

We’ll map those configurations to environments. Some have unique specifications (like a lint task) and some have shared specifications (like a testsuite matrix), but each one has a specification, which implicitly includes the dependencies of the package this project produces. Note that conceptual environments include the artifact restriction context (platform), so we’ll have to account for that. Let’s say that xot does workful thing and produces lockdata for the major platforms (linux/mac/windows) for each of its configurations. So each configuration would map to 3x conceptual environments (only some of which are actualized on any given system).

So the workflow is such: The developer writes a file describing the configurations and the commands to run under each configuration. The developer then run something like xot lock which resolves the specifications defined in the xot file into lock data for the platforms they care about. (xot may opt to also choose to actualize the environments pro-actively, so they’re ready to use.) They can then use xot go to actually run the validation stuffz. It’ll use the environments (actualizing from the lock data if needed) to run the tasks from the xot file and report the results to the user.

Both the xot file and the xot lock data would be added to VCS and shared amongst developers.

I think this was a bit more than $.02, but that was a lot of ground to cover. Hopefully, I actually contributed something.

brettcannon · April 16, 2020, 7:36pm

That’s already being done across the entire packaging ecosystem.

Potentially standardizing this is being discussed elsewhere ATM to see if consensus can be reached.

uranusjr · April 20, 2020, 4:09pm

For those interested, I wrote down what I current have in mind for the lock file, including things I learned both from Pipenv.lock, poetry.lock, and various thoughts from this thread:

I’m quite sure this is not perfect, I may have missed something (even from this thread), and there are still holes in the proposal that needs to be plugged, but I’m quite happy with the general strcuture of things I’ve come up with. Hopefully this can be the basis of a PEP if you (everyone in this thread and more!) could correct me and extend on the idea

pf_moore · April 20, 2020, 4:44pm

From an initial reading, this looks pretty reasonable.

sdispater · April 20, 2020, 6:32pm

I am not particularly in favor of having a standard lock file, even though I understand why people might want it.

To me the lock file is something that should be tool specific. The reason for that is that each tool has different needs which depend on its purpose. An example of that is Poetry: Poetry needs to lock for any platform and any Python version, to that end various versions of a same package might be present in the lock file, it’s at the installation time that Poetry decides which version should be installed, updated or removed (by resolving using the lock file and what is currently installed). However, a tool like pip (or Pipenv for that matter) does not necessarily need that because the locked packages are only guaranteed for a specific system.

So, unless we want to standardize a lock file that is system agnostic (which could be problematic due to the need of a dependency resolution before installing), I think we should not try to do this since coming to an agreement might prove difficult.

What’s important at the moment is the standardization of the project metadata, which is currently in progress, rather than the standardization of the lock file, which, in my opinion, should be an implementation detail.

tgamblin · April 20, 2020, 6:51pm

I’ll just second this sentiment. Spack has lockfiles – they contain what we call “concrete” specs for every package installed in an environment. That includes version, compiler, compiler flags, build configuration, dependency information, the OS and microarchitecture we built for, and specifics about how each spec was resolved, for every package in the environment. What might be one package to pip is many different builds and configurations to Spack. See slides 2 and 7 in this FOSDEM presentation.

So, it’s not clear how we can agree on a format for this when different package managers have widely different package and dependency models – by design.

dustin · April 20, 2020, 7:54pm

On the other end of the spectrum, I am strongly in favor of a standard lock file, but maybe not for a reason that has already been mentioned here (although I haven’t read this entire thread in detail).

I’m in favor of it because the proliferation of many tool-specific lock files is a burden on platforms and tools that offer Python runtimes. For example, if you’re a platform like Heroku, or a runtime builder like buildpacks, every time a new tool and its corresponding method for specifying dependencies comes along, you get users asking for bespoke support for their tool.

For these providers, when comparing Python to some other languages, the amount of tools they’re being asked to support here looks insane, and has the effect of none of them being supported aside from requirements.txt.

dustin · April 20, 2020, 7:55pm

Probably should have gone back and read this thread in detail before posting. Someone has already mentioned this, and it was me.

sdispater · April 20, 2020, 8:34pm

So wouldn’t the solution be to settle on a single tool (two at most) to manage Python projects, like this is the case for the other languages? Because, for me, this is where the difference comes from: almost all languages have their tool of choice.

Rust: Cargo
Javascript: NPM/Yarn
PHP: Composer
Ruby: Bundler (soon possibly gel)
Dart: Pub
Elixir: Mix
Julia: Pkg

Those are just the ones I have from the top of my head.

Python is the exception here and I think we are taking the problem in the wrong direction.

dustin · April 20, 2020, 8:57pm

In an ideal world, yes, but I think we’re already past the point of no return here. The number of users that each individual tool has is non-trivial. I think it’s far more likely that we’ll find a way to standardize a lock file than to agree one a single tool…

pf_moore · April 20, 2020, 8:58pm

I suspect that standardising a lock file format would be easier/quicker than convincing everyone to use a single tool here. (I presume you’d be less comfortable with your own proposal if I told you that we were going to settle on pipenv? )

But joking aside, yes you do have a point that Python has a somewhat-unique problem in not having a single project management tool of choice. (Although not completely unique - Java and C# among others have no universal tool either as far as I know).

Having various tools have their own native lockfile format, but a mechanism for generating a standard format for deployment, seems like the standard sort of solution for this type of problem. Why would it be unacceptable here? (Tools that wanted to avoid the “generate standard format” step could support the standard natively, but there’s no requirement to do so).

sdispater · April 20, 2020, 9:08pm

And yet, I think people can be willing to move if there is enough incentives to do so. I have seen it first hand with people moving from pipenv to Poetry. And if we tell them that it’s the path forward this might ease the transition. And yes, I must admit it’s not trivial at all to do due to the various workflows people have today due to the lack of real consensus up until now but that’s something that’s worth the effort.

And we somewhat have one in the form of requirements.txt, don’t we?

Introducing yet another format can lead to even more fragmentation than we already have and it would not solve anything.

uranusjr · April 20, 2020, 9:11pm

I feel we (as the two camps) are talking past each other here. It is most definitely not my intention to force all package managers related to Python to use the same lock file format. Spack (for example) definitely should not use the exact same lock file format as pip; this would be a terrible thing to do in a lot of ways. But that’s not the proposal here. This is probably my fault; I’ve been calling the idea a “lock file” (and even name the repo as such), and that likely makes people start at the wrong track right from the beginning. (I’m avoiding the term in this post from now on, hopefully this helps avoid the wrong prepossession.)

To me, a unified dependency description format solves the too-many-formats problems from a different direction. Instead of starting from a package manager and look at what the format should do, it starts from the common scenario there’s a project running on Python and needs to install third-party site packages and works its way back to include all the required information to make this viable. This line up with the problem @dustin is trying to address; all of the package managers can solve this problem, they all solve it the same way (by installing stuff into site-packages), but they all describe that common solution in vastly different ways. This is also the reason why the proposed structure is very flexible with open fields everywhere; package managers can add whatever they want in their lock files to make things work, but as long as the project is still within the install stuff into site-packages boundary, that part can be fulfilled by another package manager. (And if it goes out of bounds, you’ll need specialised tools anyway.)

I think Python is “hurt” by its incredible interoperability here. All the tools listed here are language-specific package managers, but it is incredibly common for people to build a project that leverages components not entirely Python. In an ideal world we can all use one package manager everywhere, but it is incredibly demanding design flexibility and development power to even sniff at that goal. So in practice tools generally do certain things better by trading the ability to do other things cleanly (or at all). The result is that groups of people favour tools that are good at different things. Which is still a good thing (since we are unable to build one thing to fit all needs), but that doesn’t mean we can’t make common ground on the common things.

tgamblin · April 20, 2020, 9:33pm

I could get behind a lockfile format with this scope, as at least it would tell me what I need to know to manage things in site-packages. In Spack, we would probably use this to establish constraints and conflicts for pure python packages that we might want to link into a single environment. A spec would certainly help.

I start to worry when we start talking about native components, as this is exactly where we’re trying to provide way more metadata, and where our install model is quite different from wheels. But, if it accomplishes the goal of standardizing all the info I need to reproduce a python/wheel deployment, a standard would be useful because I could at least read the existing format and understand it.

So basically Spack wouldn’t write these lockfiles but maybe we could get some benefit from reading them and including the result in some superset of what is standardized in Python.