PEP 621: how to specify dependencies?

pf_moore · July 19, 2020, 9:34am

Note that on Windows (which this example seems to be targetted at) single quotes don’t work in cmd, and Powershell’s handling of them is somewhat weird. Trying the second example with Cygwin’s echo command in Powershell Core:

>D:\Utils\Cygwin64\bin\echo pip install 'win32{version=">= 1.0",extras=["all"], environment="os_name == \"nt\""}'
pip install win32{version=>= 1.0,extras=[all], environment=os_name == \nt"}

(And calling echo via another “wrapper” that I typically use to run Cygwin commands, further mangles the quoting!)

In general, I take the view that any command line syntax that needs quotes to be passed to the underlying executable is almost certainly going to create huge pain for at least some group of users.

ofek · July 19, 2020, 3:33pm

I think we are all blocked here without a concrete answer.

Yes, as a Windows user, that causes much pain

pganssle · July 20, 2020, 3:06pm

My perspective is that I remember being new to Python and I never had problems with the way dependencies are specified. I also remember being new to cargo and I still have trouble writing anything even mildly complicated, because I can’t really remember the names of the keys.

I also think you underestimate how complicated it is to “know TOML” compared to knowing PEP 508. If we’re going with the criticism that “careless typos that look like they should work are evidence that it’s hard to write a specifier”, my original example of “TOML is more complicated than it looks” actually has a “gotcha” example in it that I didn’t intend. If I re-factor it to be a single line so it’s “valid”:

name = { url = "https://foo", extras = [fred, bar], python = "2.7"}

Notice that I’ve put extras=[fred, bar] and not extras=["fred", "bar"]. I also imagine that some people will think that the right syntax is extras="fred, bar" or extras = "[fred, bar]", by analogy to PEP 508.

I will concede that one thing that using TOML has going for it is that if you are first coming across a somewhat complicated string specified in TOML, you will be probably be able to google what the meanings of the unfamiliar fields are (assuming there are good documents out there explaining it).

That said, it’s only a benefit when these things show up in TOML files (we still need PEP 508 in a number of other situations), so you will have to learn both things, and we could fairly easily write a document that breaks down the components of a PEP 508 string that would be decently google-able. We’ll need to do that anyway for the TOML syntax (since people will need to be able to look up what goes in the keys), and if it’s a good idea as a companion to PEP 621 it’s even more of a good idea to do it for the PEP 508 syntax because of all the places you’ll see these dependencies show up unrelated to PEP 621.

pganssle · July 20, 2020, 3:14pm

Unfortunately, I think this is a situation where compromise is worse than either outcome individually. It virtually guarantees that you’ll need people learning and regularly using two formats, people won’t understand when you can use one and not the other (we definitely won’t be able to get universal adoption of the TOML format in all tools that deal with dependencies), it means anyone working with the “either one works” format will need to have two separate dependency parsers lying around.

I also don’t like the compromise format proposed above, because it has none of the advantages of PEP 508 (everyone already knows how to use it, parsers for it exist already, easily copy-pasted) but essentially none of the advantages of using a TOML-based table (named keys that make on-boarding with the format easier, you can use a TOML parser to extract most of the syntactic information). I think we should either stick with PEP 508 or use a rich format.

ofek · July 21, 2020, 11:38pm

In case anyone missed it, here’s a decent comparison of the extant formats https://github.com/uranusjr/packaging-metadata-comparisons/blob/master/topics/dependency-entries.md

Out of curiosity, what’s the usual process for deciding on things which do not have consensus?

Here are my final thoughts:

I implemented the Flit/setuptools/legacy, Poetry, and Nick compromise format in my Hatch rewrite over the weekend. Generally, the legacy way of defining a sequence of PEP 508 specifiers seems easiest for beginners, and is the most concise because it lives right under [project] like the majority of other fields. I therefore agree with @pganssle that we should use that way. The only downside I see (which I will have to workaround somehow ) is the lack of a standard way to define local editable installations. This is where the expanded table shines: tools can add arbitrary non-standardized fields for more features.
@brettcannon Please please please do not take this out of the PEP! I get the sense that this discussion could potentially go on for years and that would not be good for the community. Anything is better than uncertainty.

xafer · July 22, 2020, 7:46am

FWIW I’m also in favor of sticking with the existing PEP508 format and not add a second one.

In 90% (made up statistics) of cases, users will simply use the my-pkg==1.2.3 construct, which is quite comprehensible and compact vs the alternative (if I’m not mistaken) my-pkg = { version = "==1.2.3"} (or maybe my-pkg = { version = "1.2.3"} would work ?) which IMHO does not add much in readability.

In other cases, new users are likely to refer themselves to some documentation in both cases so I’m not sure what a second format would add.

pf_moore · July 22, 2020, 8:35am

I’ve just done a very quick skim of the discussion, and based on a headcount of who prefers what, I see a fairly strong preference for PEP 508 style.

There are a couple of people expressing a strong preference for an “expanded” TOML-style format, and people noting the explicit choice of pipenv and poetry to take that route, but based on pure numbers, PEP 508 feels like it’s the preferred option. Preference for PEP 508 seems milder but wider-spread.

I’m not trying to push for a “decision based on numbers” here, but I thought it was worth a review as we seem to be reaching a point where we’re just revisiting the same ground, and at some point we need to accept that we have to make a decision, even if not everyone agrees with it.

PS If anyone wants to fact-check me, please do so. I did only skim the thread quickly, so I may have missed nuances or cases where people changed their minds (although I tried to cater for that by basing my count on later posts ahead of earlier ones).

sdispater · July 22, 2020, 11:02am

Getting input only from people who are already involved in Python packaging is not the right approach, I think.

Gathering input from real users would be more useful.

PEP-508 always felt like a good format for machines (barely) but never for humans, really. It’s awkward to write, awkward to parse programmatically and, honestly, awkward to read at first glance. Yes, when you are accustomed to it it becomes easier but that’s a low bar for user experience.

Regarding the copy-paste to CLI: I think it’s bad practice and should not be encouraged, so, to me, this is not a strong argument for PEP-508 strings. Having a different DSL for the CLI is not a problem and should be expected for complex cases. For instance, here are the accepted formats in Poetry:

- A single name (requests)
- A name and a constraint (requests@^2.23.0)
- A git url (git+https://github.com/python-poetry/poetry.git)
- A git url with a revision (git+https://github.com/python-poetry/poetry.git#develop)
- A file path (../my-package/my-package.whl)
- A directory (../my-package/)
- A url (https://example.com/packages/my-package-0.1.0.tar.gz)

Also, I would like to reiterate that readability is important, which, in my opinion, PEP-508 does not provide especially if you start having a reasonable number of dependencies.

I’ll take Poetry’s own dependencies as an example:

cachecontrol = { version = "^0.12.4", extras = ["filecache"] }
cachy = "^0.3.0"
cleo = "^0.8.1"
clikit = "^0.6.2"
crashtest = { version = "^0.3.0", python = "^3.6" }
functools32 = { version = "^3.2.3", python = "~2.7" }
futures = { version = "^3.3.0", python = "~2.7" }
glob2 = { version = "^0.6", python = "~2.7" }
html5lib = "^1.0"
importlib-metadata = {version = "^1.6.0", python = "<3.8"}
keyring = [
    { version = "^18.0.1", python = "~2.7" },
    { version = "^20.0.1", python = "~3.5" },
    { version = "^21.2.0", python = "^3.6" }
]
packaging = "^20.4"
pathlib2 = { version = "^2.3", python = "~2.7" }
pexpect = "^4.7.0"
pkginfo = "^1.4"
poetry-core = "^1.0.0a8"
requests = "^2.18"
requests-toolbelt = "^0.8.0"
shellingham = "^1.1"
subprocess32 = { version = "^3.5", python = "~2.7" }
tomlkit = "^0.5.11"
typing = { version = "^3.6", python = "~2.7" }
virtualenv = { version = "^20.0.26" }

which would translates to:

cachecontrol[filecache]~=0.12.4
cachy~=0.3.0
cleo~=0.8.1
clikit~=0.6.2
crashtest~=0.3.0; python_version ~= "3.6"
functools32~=3.2.3; python_version ~= "2.7"
futures~=3.3.0; python_version ~= "2.7"
glob2~=0.6.0; python_version ~= "2.7"
html5lib~=1.0.0
importlib-metadata~=1.6.0; python_version <= "3.8"
keyring~=18.0.1; python_version ~= "2.7"
keyring~=20.0.1; python_version ~= "3.5.0"
keyring~=21.2.0; python_version ~= "3.6"
packaging~=20.4
pathlib2~=2.3.0; python_version ~= "2.7"
pexpect~=4.7
pkginfo~=1.4
poetry-core~=1.0.0a8
requests~=2.18
requests-toolbelt~=0.8.0
shellingham~=1.1
subprocess32~=3.5; python_version ~= "2.7"

At first glance the PEP-508 is harder to read by far. I know some will say the opposite because they are accustomed to it.

I’ll also note that no user of Poetry have complained about the chosen format.

One other argument in favor of a TOML exploded representation is validation: with PEP-508 you need to implement or embed a complete parser for the specification while with a TOML format you can just use more standard way to validate your data, like a JSON schema or directly in TOML if schema validators ever make it to the TOML specification.

There is a reason I made the choices I made in Poetry and it’s user experience. I didn’t make them because I felt like it but because making intuitive tools and specification is above all else for me and PEP-508 was not providing that.

Now, if we ever settle on PEP-508 for this new standard Poetry will have no choice but to follow it, unfortunately. That would be a shame because introducing a new standard is the chance to learn from past mistakes and improve on it.

bernatgabor · July 22, 2020, 11:48am

At first glance to me personally both seem equally hard to read. This is expected as you have a long dependency list with various conditions in it. For me CLI copy able is a nice bonus feature of the pep 508. At the end of day though both have a onboarding step involved and are harder/easier to read in various cases. I feel like you’d have a better case if you would have raised this before pep 508. However, I’d rather keep what we have than to introduce confusion by having competing specifications. IMHO if we go down the table route beginner users would be just as much hurt by having to shuffle between two existing ways, then helped by potentially a better format. Because the transition period will be long and there’s a lot popular material out there detailing what pep 508 does rather than a new standard.

dstufft · July 22, 2020, 12:57pm

I’d just point out that both these cases are roughly the same. If you’re including a JSON schema parser or you’re using packaging to validate a PEP 508 string, in either case you’re using some existing library to validate. The JSON schema option would require some additional complexity that’s already handled by the packaging lib (for instance, the packaging lib validates that the version number is an actual valid version, to do that with JSON schema would involve rewriting the decently large regex from packaging, or just passing the value into packaging to validate it).

I don’t think that either option is a pure win or a pure loss in terms of user experience. Obviously everyone has their own opinion, but having to use a different format on the CLI versus in the file is a worse user experience than being able to use a unified syntax in both places. Having some form of a “standard” structured data instead of a string DSL is a better user experience for complex cases than having everyone learn a custom DSL that collapses that down into a much more terse syntax. In the cases of simple dependencies, they’re both roughly the same.

If there was a clear winner in terms of user experience, you’d see people generally latching onto that one. Instead, like many things, what you have is a trade off between optimizing for better user experience in one area, at the expense of user experience in another area. Different people will think that optimizing for a different area of user experience is more important, and thats OK.

Personally, I find being able to use the same format everywhere as a solid win, so I’m moderately in favor of keeping with PEP 508. That being said, I think it’s also valid for us to say we want to optimize for another aspect of UX, I just don’t buy this idea that one provides a good user experience and one doesn’t.

pf_moore · July 22, 2020, 1:29pm

True. This is an ongoing problem with packaging discussions.

It would, but it’s typically been near-impossible to do. I’d love to get such input (in an unbiased form - I’d be cautious about anecdotal information from groups that potentially have a level of bias, such as experienced developers, or users of a particular tool, etc). But I don’t want to block progress on chasing after an unattainable ideal.

That’s a matter of opinion, as I’m sure you’ll agree.

Agreed, but it’s also not a strong argument for TOML-style syntax. Any format that includes quotes is IMO a potential usability problem, given the quirks of quoting in different shells. The best I can say here is that of the two proposed formats, PEP 508 uses less special characters. But I don’t feel that either proposal can reasonably claim to be ideal for use on the command line. Inventing an additional, CLI-only, format may be even worse, though. Having said all that, this discussion is not about CLI usage, “can it be copied to the CLI” is an incidental question at best.

Correct. I find the PEP 508 version far easier to read. So there’s little point to the comparison, as all it does is confirm that people have different views.

I’m not aware of pip/setuptools users complaining about PEP 508 format, either. But I haven’t investigated much, so if we really care about that sort of data, we should do a proper survey.

True. But unless we get 100% consensus, it is always going to be the case that some tools will need to change to conform to the standard. I sympathise with the fact that Poetry deliberately chose to use something other than PEP 508, and so this would feel like a regression, but so far the arguments that convinced you to take the route you did with Poetry either haven’t been explained clearly enough in this discussion, or haven’t been sufficiently convincing to change people’s minds.

I think this is the key here. We don’t all agree on what to optimise for, so we keep circling round the same arguments - but no-one is convinced, because we all have different goals.

To give a personal perspective, I consider it “obvious” that the following are the highest priorities:

Simple cases can be written with minimal syntax overhead.
Knowledge of how to write dependencies in pyproject.toml should be transferrable to other situations (CLI usage, other configuration files, metadata, …) and vice versa.

Conversely, I don’t put any priority on (for example):

Making it easy to specify complex cases.
Similarity to other language ecosystems.
What libraries tools need to use to parse the data.

There are plenty of other aspects that come somewhere in the middle (teachability, discoverability, for example).

ofek · July 22, 2020, 2:29pm

For a real example, the dependency declaration in https://github.com/docker/compose/blob/789bfb0e8b2e61f15f423d371508b698c64b057f/setup.py#L28-L61 would become:

[project]
...
dependencies = [
  'cached-property >= 1.2.0, < 2',
  'distro >= 1.5.0, < 2',
  'docker[ssh] >= 4.2.2, < 5',
  'dockerpty >= 0.4.1, < 1',
  'docopt >= 0.6.1, < 1',
  'jsonschema >= 2.5.1, < 4',
  'PyYAML >= 3.10, < 6',
  'python-dotenv >= 0.13.0, < 1',
  'requests >= 2.20.0, < 3',
  'texttable >= 0.9.0, < 2',
  'websocket-client >= 0.32.0, < 1',

  # Conditional
  'backports.shutil_get_terminal_size == 1.0.0; python_version < "3.3"',
  'backports.ssl_match_hostname >= 3.5, < 4; python_version < "3.5"',
  'colorama >= 0.4, < 1; sys_platform == "win32"',
  'enum34 >= 1.0.4, < 2; python_version < "3.4"',
  'ipaddress >= 1.0.16, < 2; python_version < "3.3"',
  'subprocess32 >= 3.5.4, < 4; python_version < "3.2"',
]

[project.optional-dependencies]
socks = [ 'PySocks >= 1.5.6, != 1.5.7, < 2' ]
tests = [
  'ddt >= 1.2.2, < 2',
  'pytest < 6',
  'mock >= 1.0.1, < 4; python_version < "3.4"',
]

Note: it would also be cool if we allowed multiline literal strings in addition to arrays to get rid of quoting and commas. Basically, the same parsing for files read by pip with the -r flag.

ofek · July 22, 2020, 3:00pm

I do very much like @sdispater’s idea to have a real poll en masse. It may sound odd, but I think the best option we have is for one of us to write a comment here with a simple example of both approaches, then whoever here has let’s say more than 500 followers on Twitter, do a poll that links to the comment.

We could have real data by next week.

sinoroc · July 22, 2020, 3:13pm

For what it’s worth: I have been following the Python packaging topics on Stack Overflow for a bit now, and I can’t remember people having issues with the PEP 508/PEP 440 notations themselves. The cases I remember where about people being confused about which notation to use because of the deprecation of setuptools dependency_links (that was a bit rough). Once pointed to the right document, people seemed to be satisfied (or at least they went quiet). And I haven’t seen such a question in months (of course I haven’t seen every question).

A wider, more public poll would be welcome.

For the simple cases (which might be the most common cases: just the name, plus sometimes a pinned version or a range), PEP 508 feels more readable to me. My gut feeling is that when people need more than that, then they are somewhat experienced enough that they don’t get scared by such notation. A pleasant user experience is important, but we are talking about people who at the very least managed to write the very Python code they are trying to package, so I believe it’s not a big ask.

But obviously I would feel bad, if poetry (and others) had to give up on their notation. I would vote for a reasonable hybrid notation (I think there were some suggestions here), but I am aware it would make for more complex specification and implementations. I would encourage the proponents of a TOML notation to come up with more suggestions.

Maybe something like that:

dependencies = [
  'A [one, two] ~= 1.2.3 ; python_version < "2.7"',
  { name = 'B [one, two] ~= 1.2.3 ; python_version < "2.7"' },
  { name = 'C [one, two] ~= 1.2.3', markers = 'python_version < "2.7"' },
  { name = 'D [one, two]', version = '~= 1.2.3', markers = 'python_version < "2.7"' },
  { name = 'E', extras = [ 'one', 'two' ], version = '~= 1.2.3', markers = 'python_version < "2.7"' },
]

Basically parse name as PEP 508 first and then everything that comes after replaces (no questions asked, conflicts are user’s fault) what’s already in the Requirement object. I believe it’s quite close to poetry’s notation.

bernatgabor · July 22, 2020, 3:20pm

I feel like I’m going agaisnt the current, however I don’t think would be beneficial, because as I said above the drawback of having a migration path and supporting two ways of doing things for at least the next 3 years (but more likely 5), in my opinion, outweights any potential benefits we might get with the table format. And this is assuming the table format is easier to read, on what we can’t really all agree.

In best case scenario we’ll end up in a place where these dependencies are a bit easier to read/write. However, to get there we’d have to support in the mid-term both across various packaging tools and provide assistance for people mixing up the two (not to talk about the time themeselves would waste when inevitably they use one format over another, where only one of them is allowed). IMHO spending both maintainer and new users time on this is not the most effective usage of our (very limited) resources.

For example an editable mode for PEP-517 is a much more important topic if we have availabilty.

sdispater · July 22, 2020, 3:27pm

They are not actually: one is a “standard” that is specific to Python while the other is a more generic standard that spans multiple languages and tools.

So, you agree that the TOML approach has more advantage than PEP-508 from a metadata file standpoint?

Because you know the specification. What about new users or occasional Python developers?

And yet, when I see the number of setup.py files that get it wrong (or use programmatic checks instead), I feel that it might not be the clearest specification.

I chose it because of the following main reasons:

Readability (debatable)
Discoverability: it’s easier to find if a dependency exists in a dict than in a list
Explicitness
Programmatic manipulation: You can’t easily manipulate PEP-508 strings to change parts of it compared to TOML elements.
Consistency with what exists in other popular languages. This was the principal factor that led me to this decision.

bernatgabor · July 22, 2020, 3:39pm

Nothing stops you from loading that list into a dictionary once you read the file. And they’re equal.

Not sure what part of it you consider it to be more explicit?

Why? What’s wrong with https://packaging.pypa.io/en/latest/requirements? You update the property, and then call str on it?

The biggest problem here is that to get where other languages are would be a lot of pain. So the question is are we prepared to hurt a lot in the next few years just to be on par with other languages? And while getting there also loosing some current features (e.g. copy-paste-ability of specifications).

sdispater · July 22, 2020, 4:18pm

That’s an extra step just to circumvent an issue that could be solved at the specification level.

You have the name of the elements specified directly in the file (like extras) which helps make it more self documenting.

That’s an extra dependency you need to have while you could rely on the fact that any TOML parser will return native types that are easily manipulable.

So, we just give up and don’t try to improve on what we have? Shouldn’t this be a goal in itself? To provide a user/developer experience that is on par with what other languages have?

That being said, one reason other languages were able to pull this off is because they mostly have one or two tools of reference, instead of several tools like we have in Python, so making a transition like this is easier.

xafer · July 22, 2020, 4:30pm

I think the point of Donald was:

one is a string following a single specification: PEP-508,
the other is a string following & combining several specifications: TOML & PEP-508.

ofek · July 22, 2020, 4:39pm

If we find out that most people outside of our bubble prefer the TOML way, then the answer should be a resounding “yes” from all of us here.

That’s a good thing! There should be a core dependency parser to avoid duplicate work by setuptools, Poetry, Hatch, Flit, etc.