Exactly one of the command items MUST include a {} placeholder, which will be replaced by the mapped package identifier(s). The install command SHOULD support the placeholder being replaced by multiple identifiers, query MUST only receive a single identifier per command.
For consistency, I think this should say “specifier” rather than “identifier”.
Since the placeholder may expand into multiple tokens, it wouldn’t make sense for the item to contain any other text, so suggest changing “include a {} placeholder” to “be a {} placeholder”. Or maybe, for future extensibility, the placeholder should be given a name like {specifiers}.
exact_version MUST be None or a list of strings that describe the syntax used for specifiers that only express exact version constraints
What’s the distinction between exact_version and equal, and how does this relate to the DepURL? The conda example uses == for exact_version and = for equal, but conda defines = as something which doesn’t correspond directly to any PEP 440 operator.
the keys equal, greater_than, greater_than_equal, less_than, less_than_equal, and not_equal
not_equal is not one of the operators allowed in PEP 725; I guess this is an oversight.
For consistency, I think this should say “specifier” rather than “identifier”.
Noted, will update.
Since the placeholder may expand into multiple tokens, it wouldn’t make sense for the item to contain any other text, so suggest changing “include a {} placeholder” to “be a {} placeholder”. Or maybe, for future extensibility, the placeholder should be given a name like {specifiers} .
The choice of “include” there is intentional. In some package managers you may need some boilerplate around the placeholder of the specifier. More specifically, cases like --package={}. That said, it feels like this can be done in the specifier_syntax dict too, so I’ll double check whether we can remove it.
What’s the distinction between exact_version and equal , and how does this relate to the DepURL? The conda example uses == for exact_version and = for equal , but conda defines = as something which doesn’t correspond directly to any PEP 440 operator.
exact_version is a literal string equality, just like PEP 440 ===. However, some package managers don’t have the notion of operators, so we need to provide this subset of functionality separately too.
equal is PEP 440 ==, which has dual behaviour: literal version matching, and fuzzy matching (if requested with a .* suffix). The closest mapping between PEP440 and conda is the one described in the example (conda’s = is always PEP440 ==x.y.*).
not_equal is not one of the operators allowed in PEP 725; I guess this is an oversight.
It is not included in PEP 725 because some package managers do not support it, and we want to avoid the presence of incompatible DepURLs for now. However the specifier_syntax dict does specify all operators in case that changes in the future. I understand this is confusing though, so I’d be happy to clarify further if needed.
I’m also wondering how much of this discussion is taking into account the fact that we’re rapidly moving toward AI-assisted development, where agents will be reading, writing, and acting on this metadata. What should this new context inform our design choices here? Things like clear check/install commands, rationale fields, predictable schemas, explicit platform distinctions, idempotent actions—and perhaps even more precise ways of describing the full environment a package needs, including system settings or environment variables, so agents can reconstruct it safely and reliably.
@jaimergp@rgommers Late to the party, but as it happens there is an effort to actually create a registry of PURLs, supported in part by NLnet NLnet; C/C++ Package Registry with a focus on C/C++ code that would likely mesh up well with this PEP. I think this is going to morph into something a bit more generic than strictly C/C++, to address the problem of identifying packages that are not in a registry, like GNU libc, the GNU GCC libstdc++, OpenSSL, Boost, zlib, and similar.
Eventually (and even more so since CVE.org has merged PURL support in their schema 5.2) there is a need to have PURLs for packages that do not live in a registry, and this is a discussion I entertained specifically with Linux maintainers.
I would like to take the opportunity to thank everyone that took the time to take a look, read through the text and provide incredibly insightful comments that have helped improved this proposal so much. I’m beyond grateful!
Thanks, I wasn’t aware of the === operator. But it isn’t listed in PEP 725, and the current spec describes it as “heavily discouraged”. And if its main distinction from == is whether it supports the .* suffix, that suffix isn’t mentioned in PEP 725 either, and I don’t suppose it would be easy to map to different package managers if it was.
But the actual mapping we’re doing is in the opposite direction. If we convert a PEP 440 dependency of ==x.y into conda’s =x.y, then that would allow the installation of x.y.99, which isn’t correct according to PEP 440 semantics. Conda’s ==x.y seems like a closer match.
In which case, there wouldn’t be any need for an exact_version key, because package managers that don’t support operators, but do support exact versions, can express that using a version_ranges which only contains equals.
There’s some inconsistency between the structure of the top-level tables:
definitions and mappings are arrays within which each item has an id key.
package_managers is also an array, but each item has a name key.
ecosystems is a dict instead.
I’d suggest regularizing all of these to dicts. This would enforce uniqueness of keys, and allow for faster startup time when reading what could be some quite large files, since the data would come out of the JSON decoder already in a form suitable for lookup.
Each schema has a schema_version, but this isn’t defined anywhere. It would seem reasonable to follow the same rules as Python’s Metadata-Version, with major and minor versions so the consumer can detect whether the format has changed in an incompatible way.
It does seem a shame to diverge from the PURL ecosystem when it’s just getting started. I wonder if there are other ways to achieve these two goals. For example:
Virtual packages could use namespaces of the generic type. The PURL spec doesn’t define any generic namespaces, but it doesn’t forbid them as it does with conda. So we’re free to use them however we want.
Version constraints could be written immediately after the PURL, just as we’ve already allowed environment markers. This would effectively give an external dependency specifier the same syntax as a Python dependency specifier, with the PURL taking the place of the package name.
Because = is allowed in PURLs, a space would probably be required between the PURL and the constraints.
We could allow a single version in the PURL itself, or a set of PEP 440 constraints after the PURL, but not both.
Thanks @mhsmith, I’ll take this into consideration for the current revision in the coming days! I’ll reply here to the last points for now.
I agree that the need for dep:-variants is a pity. I am willing to explore ways around the limitations we found in PURL. The generic namespace can work. We would “claim” the compiler and interface namespaces (but only as namespaces, not as names). Perhaps @pombredanne is also willing to bless these two generic namespaces.
About the space separators, though, I don’t dislike it, but I see two potential problems:
The space might be easy to miss and some folks may be tempted to not use it, especially given how the space in Python requirements is kind of optional (e.g. numpy>=2 and numpy >=2 are both valid).
The percent encoding rules disallow spaces in PURL, so it has to be percent-encoded. However, they are allowed as input, and the packageurl-python will happily parse it as part of the name. Not sure if this introduces ambiguity as to what constitutes the PURL part and our version specifier.
The same is true of environment markers with a semicolon, so I guess some kind of pre-parsing to separate the PURL from the rest was always going to be required.
Instead, what if we break the need for string-only identifiers, and use structured objects? This is valid TOML:
[external]
build-requires = [
# simple PURLs, no version, can use a string
"pkg:generic/compiler/c",
# for deps with versions, use an object
{purl = "pkg:generic/compiler/cxx", version = ">=21"},
# add environment markers too, should these two be equivalent?
"pkg:generic/compiler/go; sys_platform == 'darwin'";
{purl = "pkg:generic/compiler/go", environment_marker = "sys_platform == 'darwin'"},
]
I like structured objects like this a lot, it removes all the ambiguity, but at the cost of verbosity. It also challenges the familiarity of strings in the dependencies table.
I’ve had another look through the PURL spec, and I think it’s rigorous enough that we can make the space optional and still keep the parsing simple. Specifically:
The syntax of an external dependency specifier is:
purl wsp* versionspec? wsp* quoted_marker?
The purl is terminated by any of the following strings, none of which are valid in a PURL:
==, >, <, ;
The remainder of the specifier is then parsed for versionspec and quoted_marker according to the same syntax as a Python dependency specifier, except for the limited set of version operators. If the purl already contains a version, then versionspec is not allowed.
mappings are arrays because there can be more than one possible way to map a DepURL to package names/specifiers. If we want a dict there, then we need to make the value an array of dictionaries, nesting the schema even further. I’m ok with that but I found the list of dicts approach simpler.
Thanks, I’d also like to hear other opinions on this question. For clarity, my proposal is that all of these examples would be valid except for pkg:generic/ninja@>=2.0, because you can use either the PURL version syntax or the Python syntax, but not a mixture.
OK, my main concern here is performance, since these lists could contain 10,000 items or more. I guess Python’s JSON reader is optimized enough that actually reading the file wouldn’t be a problem, but I’d like to see some numbers for how long it would take to do a typical lookup with a linear search in Python code.
One could even argue that it is indeed valid, because the version may be empty. The spec is not super clear (it hints and it should not be empty but doesn’t forbid it in my opinion). The Python parser doesn’t complain either: