PEP 440: Total Character Length Limit for Version Scheme

As described in No Constraint on Version Names Can Cause Issues · Issue #12483 · pypi/warehouse · GitHub, there is no limit on the total character length of the version specifier. This can lead to potential issues, either due to deliberate abuse or by accident, if the version is extremely long and hits file name or file path length limits on the filesystem.

So far this only seems to have been encountered within projects which are mirroring PyPI ("File name too long" error on centos7.2 · Issue #1200 · pypa/bandersnatch · GitHub, Enhancement: Filter packages with very long versions · Issue #1228 · pypa/bandersnatch · GitHub, 您好我的pypi包好像很久没有同步过来了,可以麻烦帮忙看一下吗?谢谢 · Issue #1538 · tuna/issues · GitHub), likely because no major projects have versions long enough to cause an issue.

Out of curiosity I dug into this a bit, with google big query, for all packages in the-psf.pypi.distribution_metadata the summary is:

count    7.902727e+06
mean     6.598360e+00
std      3.552846e+00
min      1.000000e+00
25%      5.000000e+00
50%      5.000000e+00
75%      7.000000e+00
99%      2.200000e+01
max      2.350000e+02

Out of 7,902,727 published package versions there are:

  • 337,624 over 16 characters
  • 697 over 32 characters
  • 407 over 64 characters

It’s kind of surprising to me that hundreds of releases have such long versions :confused: either way, overall 99.991% of versions have less than or equal to 32 characters.

There’s actually a discussion about this on semver What should the size of numeric identifiers be? · Issue #304 · semver/semver · GitHub but I don’t think any limit was set in the specification, although practically there is a limit as major/minor/patch get parsed as integers and JS’ max safe integer is 9007199254740991, which in total means the max string length is 50 characters for node js.

PEP 440 says that “the versioning specification may be updated with clarifications without requiring a new PEP or a change to the metadata version”, IMO adding a ‘sensible’ (whatever that may be) limit to the total character length of the version specification would fit into this.


Edit: poll to keep track of limits people are happy with

Character Length Limit for Versions
  • 32
  • 50
  • 64
  • other
  • none

0 voters

1 Like

+1 on limiting version length.

Searching my database of the data from the JSON API, many of the longer versions look invalid (for example, 0.9dev-BZR-r1-panta-elasticworld.org-20110427165731-j00nsiss2af57yhu for softwarefabrica-django-director).

I get only two valid versions over 64 characters long, and 107 invalid versions over 64 characters long. I’m not sure why my numbers are different from yours.

In practice, I don’t think that adding a length restriction in PEP 440 will make a significant difference. To deal with issues like this, at some point we’re just going to have to bite the bullet and drop invalid projects from PyPI (at which point maybe adding a limit would be worthwhile).

Frankly, though, for now I’d suggest that someone just reach out to the author of uselesscapitalquiz (the project with a version number of pi to 217 decimals), and ask them to choose a more reasonable version number.

Paul mostly beat me to it, but here’s the full list of versions > 32 characters on PyPI right now: https://gist.github.com/di/b7ed90e661b7820aa51613034bb25ab7 for anyone curious.

1 Like

64 seems reasonable then?

I’m not too sure either, I used big query with the PSF PyPI distribution metadata dataset, selected the version column and got the CHAR_LENGTH.

When I opened an issue on the warehouse repo to discuss this @EWDurbin mentioned that PEP 440 should probably specify a limit before PyPI enforces one, which is sensible. Plus if this is added to the PEP then projects that can build or publish packages (Poetry, Flit, GitHub PyPI Publish action, etc…) can be updated to enforce that on their end as well.

This would only prevent more packages with unreasonable versions being added, not deal with the ones that are already there, but it’s a start.

I can contact him.

I edited the original post and added a poll in, it’s multi-choice so people can pick whatever they’re happy with.

I voted, but on reflection a crucial piece of information is what length would fix the “file name too long” issue that triggered this? As it’s a file name issue, would a stupidly long project name have the same effect?

Ah, I was focused on the version since that caused these issues, didn’t think of dealing with the real issue which is the full file name.

The length of the wheel file would have to be under 255 characters, so {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl is the real limit.

Longest combination I could think of would be cp311-cp32dmu-manylinux2010.whl which is 31 characters, leaving ~220 for the name, version, and build tags.

If there’s a limit of 64 for the name and for the version then that would leave ~90 characters spare for the build tag and for any additional platform tags.

So you’re right, there should be a length limit for both version and project name. Alternatively a total wheel name limit could be mentioned as part of the binary distribution format specification (Binary distribution format — Python Packaging User Guide), since that’s what really leads to issues.

You are aware that tags can be multiple individual tags separated by dots? For example, numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl has two platform tags. And MacOS binaries can have ridiculous numbers of tags - for example, this is a genuine wheel on PyPI: rgf_python-3.6.0-py2.py3-none-macosx_10_6_x86_64.macosx_10_7_x86_64.macosx_10_8_x86_64.macosx_10_9_x86_64.macosx_10_10_x86_64.macosx_10_11_x86_64.macosx_10_12_x86_64.macosx_10_13_x86_64.macosx_10_14_x86_64.whl

Note also that the filename of the wheel for uselesscapitalquiz is only 253 characters long, so your limit of 255 wouldn’t help here. I’m now inclined to suggest that this is more of an OS limitation for certain index servers than something we should be fixing by trying to put limits on values in the standards.

Unless we have a proposal for how we’d publish entirely legitimate wheels whose filenames exceed our chosen limit, I don’t think a filename limit is viable. For a concrete example, if we chose a limit of 200 characters, how would we support that rgf_python distribution[1]?

One solution may be to work out a better tagging scheme for MacOS (which is the main culprit here), but that’s a much bigger piece of work.


  1. If we choose a limit of 255 characters, we’re only delaying the problem until macos 10.15 comes out… ↩︎

Possibly relevant: Windows still has a hard limit of 260 characters for file names (more precisely, any individual segment of the path), even though the total path length can exceed that. 255 is a reasonable limit to allow space for the extension and null terminator that someone will no doubt need to fit into their 260 char buffer without breaking…

Wow! I had no idea, but then again I do very little packaging. Do people really read the chain of version info, or is that left to machines?

I will toss out an ill-considered idea which I anticipate will be shot down immediately. Would it be possible to compress that long string of platforms then convert back to ASCII? When testing for compatibility you’d have to decompress, but it would still give a bit more breathing room in filenames.

It’s meant for machines, but people do peruse the files sometimes (e.g. I look quickly at the Python version to see if a wheel is available for the latest Python release).

I don’t think it would be worth the overhead.

The primary reason to put tags in wheel file names is to make it easier for installers to scan through a list and locate compatible wheels. Making that more complex would slow down code in the hot path for installers when installing wheels directly from an index.

There are two hard problems in computer science…

In fact, it came out 3 years ago, and there are 3 versions since plus now universal2 and arm64 architectures to worry about…

I do that as well, FWIW

Are the platform tags specifically important to be legible? Perhaps if a filename would be greater than 260 characters, the platform tag string could be compressed first. We should come up with a compression scheme then.

I can’t imagine decompressing tags would be anywhere near comparable to extracting metadata or downloading in terms of compute time.

So it sounds like:

The focus on PEP 440/versions is kind of a red herring – the actual problem is that there’s a artifact on PyPI with a 253 character filename, and when bandersnatch is running it creates temporary files whose name incorporates the artifact filename + 10 extra characters, and put these things together and you have 263 characters. And the host running bandersnatch had a 255 character limit for filenames, so this didn’t work.

It seems hard to create a general solution to this problem, because we don’t know how much filename expansion an arbitrary processing pipeline is going to add, and different systems have different filename limits anyway. Also, I think Windows still has a 260 character limit on paths, i.e. filename + the directories its under, unless you do some special non-default stuff to disable ti, so that’s very easy to hit, even if your tool doesn’t mess with the artifact’s filename itself.

But skimming through Comparison of file systems - Wikipedia it does seem like 255 characters is a pretty common limit, and almost no-one goes above that.

So my suggestion:

  • Don’t change PEP 440
  • PyPI should definitely reject filenames >255 characters.
  • PyPI should probably also reject filenames that are “too close” to 255 characters, with the understanding that no limit is going to be good enough to guarantee things work in all cases, so we’re just handwaving. Maybe 200 characters?
5 Likes

Those wild macOS wheel tags were a temporary workaround for the period when pip didn’t understand macOS versioning properly. That huge string in the rgf_python wheel is equivalent to macosx_10_6_x86_64, and these days everything understands that. So it’s not really an issue anymore.

2 Likes

Technically no, although people do still read them quickly for various reasons.