Sdists for pure-Python projects

bryevdv · March 28, 2023, 1:45am

The complication is having to resort to and maintain dreadful hacks like

if (ROOT / 'PKG-INFO').exists():

in order for an “sdist build” to use the pre-packaged JS resources that are exactly the ones we build and publish separately, rather than trying to build them directly from TypeScript sources.

rgommers · March 28, 2023, 7:06am

I had a look at the structure in Bokeh, and I think this isn’t how it’s supposed to be structured. (disclaimer: my first look at this, so I may miss some of the subtleties). What you have is two packages:

BokehJS, a JavaScript/TypeScript package
Bokeh, a Python package which depends on finding BokehJS at runtime

Right now you build BokehJS from within your setup.py, but there is no cross-talk between the JS and Python packages beyond finding a single path to the installed JS package. Hence, that JS package build probably should not live inside setup.py at all (pretty much 100% of your setup.py content is about JS, not Python). You could have:

a single pure Python package with completely static metadata and no setup.py at all. The sdist would have no hacks, and match the code in VCS.
A JS package build done through either a separate script with the Python code you have inside setup.py now, or whatever the JS-native way is to invoke that build (nicer for JS contributors I suspect).
A configuration mechanism to point the installed Python package to the JS package. Defaulting to bokeh.server.static if the user doesn’t provide a configuration, but overridable for development builds or custom builds. Anything from an env var to an entrypoint could work. Or maybe even that isn’t needed, since the path of the JS package build doesn’t seem to be exported from setup.py - the .py code to locate BokehJS may simply check if the default local bokehjs/build/... path exists, and point to the static server otherwise.

bryevdv · March 28, 2023, 9:50am

I think this isn’t how it’s supposed to be structured.

HI @rgommers things are they way they are after ~12 years for a variety of reasons and history, and I don’t believe what you have described would work for our needs. But I also don’t really want to derail this thread, so please stop by our GH discussions if you’d like to get into details.

pf_moore · March 28, 2023, 10:00am

That’s absolutely fair, and if the process you have is working for you, then I think that’s fine. But this thread is discussing what people would like in terms of best practices, and “distributions find it useful to have docs and tests in sdists” is a perfectly reasonable thing to ask for in a discussion of what constitutes best practice. Individual projects don’t have to follow the recommendations, but in the absence of specific reasons, they should be a good starting point.

@rgommers suggestion is useful in that it explains a way to structure a project that builds non-Python artefacts in a pure Python project, while still being able to conform to the recommendations. It may not work for you, but I still think it’s good general advice. Agreed, that the project tracker is the best place to discuss details, though, so thanks for taking that part of the discussion offline.

Of course, whether anyone ends up collecting the comments in this thread into any sort of recommendation document, or it just ends up here for people to find in the course of searches, is a whole different question But I think the original post here highlights that such a document would be useful to have.

CAM-Gerlach · March 28, 2023, 10:17am

Bringing things back to the topic of the discussion, though, as stated in this thread’s title the OP’s question was specifically asking whether and why uploading their sdist along with their wheel was considered a best practice explicitly in the context of a pure Python project—one where there was the least obvious immediate benefit in offering a sdist alongside the wheel—but for which doing the latter takes the least effort (in fact, it actually takes more effort to not do so when using the standard workflows and tools).

For a complex, non-pure-Python project, the value proposition is a little different on both fronts. There’s substantially more benefit to providing a sdist for both users and redistributors given the wheel isn’t suitable in many more cases, though there’s also more potential complexity as well—though typically to much less of a degree provided the project is well-structured, taking advantage of modern tooling and following current best practices.

I certainly understand that it can sometimes be quite difficult to move toward when burdened down by 12 years of legacy code and decisions made in a time before the modern packaging landscape existed, and you of course have your reasons for that. Given as you say they are quite specific to your particular project’s unconventional structure (particularly re: sdists are concerned), I also agree with you that sort of discussion is best had on your issue tracker or elsewhere, and getting a little off-topic for this thread.

bryevdv · March 28, 2023, 12:01pm

?? I literally said that

But I also don’t really want to derail this thread, so please stop by our GH discussions if you’d like to get into details.

CAM-Gerlach · March 28, 2023, 4:07pm

Sorry I was unclear; I was intending to say that I agree with you about that:

but it seems I put way too much junk (the [...] elided portion in the quote above) between the “I agree” part and that phrase which obscured that. Ironically, I had edited it several times to try to more clearly communicate that I was agreeing with you there and echoing your previous comment to that effect, but what’s clear (heh) is that I still didn’t do a good job of it, sorry—I’ll give it another shot.

jeanas · March 28, 2023, 8:04pm

Thanks everyone for the replies. I didn’t realize the tradeoffs involved in deciding between VCS source checkouts or GitHub-generated tarballs vs. sdists for redistributors.

sphinx-theme-builder is a very good example of how the sdist → wheel step can involve complexity.

Yup, I know. But wheels don’t include pycs. Actually, my first thought when I heard about wheels was “surely the main distinction between wheels and sdists is that an sdist is just the source and wheels contain pycs”, but it’s wrong

Absolutely agreed.

Also, it might be worth it to hide more of the complexity from newcomers. To take an analogy, for x in lst: in Python creates an iterator that yields elements through a __next__ method eventually raising StopIteration, and there is huge complexity behind the simple expression a.b (__getattr__, __get__, the descriptor protocol, bound/static/class methods, etc.), but luckily no beginner has to learn about StopIteration when they first encounter for loops or about __getattr__ when they write their first class.

Right now, the fact that python -m build creates a directory with a .tar.gz file and a .whl file feels like “wires sticking around” to me. Of course, what sdists and wheels are exactly is clear to everyone in this thread, but a beginner to packaging is bound to wonder what’s going on for there to be two different files, both pip install-able.

From my point of view, the newcomer ideally doesn’t ever see the contents of the dist/ folder. To be fair, both PDM and Flit have a $tool publish command that builds the package and sends it to PyPI (or whatever index) at the same time. However, they do leave dist/ in the project tree. Hatch’s publish command doesn’t do the sdist/wheel generation step. And obviously, python -m build + twine, as shown in the Python packaging tutorial, gives you a dist/.

IMHO, it would be a worthwhile enhancement to give those tools a way to publish to PyPI without changing anything in the local tree.

CAM-Gerlach · March 29, 2023, 3:32am

Yeah, an “explanation” type document on packaging.python.org could be good fit. Of course, someone has to actually put it together…

It didn’t use to be the case, but nowadays isn’t it already essentially true that beginners don’t actually need to understand the difference, assuming they’re actually following the workflow recommended in the packaging guide or using any modern integrated tool?

Pretty much every backend (even Setuptools mostly by default now, or more completely with setuptools-scm) handles automatically including the “right” files in sdists and wheels, and I’m not sure if any backend has a non-advanced setting that makes a distinction between the two (even advanced settings are pretty sparse). So I’m not sure how a beginner would encounter the difference here unless they went looking.
When building with the baseline workflow, the default python -m build automatically builds both with no further user intervention. Likewise twine upload dist/* just uploads whatever is in the dist/* directory (perhaps a small improvement would be assuming that directory by default), without the user having to care what’s actually in there unless they really want to know.
When using an integrated packaging tool, this is even more abstracted, typically involving just one or two commands to build the project and upload it. For a more advanced non-pure-Python project, which a beginner is unlikely to start with, they’ll need to set up cibuildwheel, but the default workflow they provide handles both the wheels and the sdists so even then, at least starting out users don’t really need to think too much about this.

Well, it’s gotta put them somewhere that will be easy for twine to find, no? Unless you combine the functionality of build and twine into a single tool, which is already what the integrated packaging tools do (and much more). And this isn’t something the user has to know, the user has to go looking for them and deliberately want to learn how things work under the hood.

With build and pdm build, you can specify an output directory of your choice that can be anywhere outside the working tree, inside the system temp directory, etc.; with twine at least you can then pass that path in (though I’m not sure about pdm publish). flit deliberately omits those options as its designed to be as dead simple as possible and “just work”. If tools were to switch en masse to a different standard out of tree location such that users (eventually) wouldn’t have to care again, it would require a PEP, a ton of bikesheeding, consensus and implementation in the relevant tools, and at least in the short term could non-trivially degrade UX if some tools recognize/use this new location but others don’t.

jeanas · March 29, 2023, 7:28am

Yes. But how do they know that they don’t need to care? That’s the problem I’m talking about.

Speaking for myself, if a tool leaves generated files in my tree, I’m going to assume that these files are for me to care about somehow. If I see that I have to upload a .whl file, my next Web search is going to be “what is a whl file”.

Oh, I’m not suggesting that the behavior of $tool build should be changed. Only that $tool publish, which controls both the producer and the consumer of the build step, could generate the result in some temp dir rather than in the project tree. This is just like pip install project-that-doesnt-provide-wheels builds a wheel for you, but doesn’t show it to you.

CAM-Gerlach · March 29, 2023, 8:03am

Well, if you’re following the official PyPA packaging tutorial, it does explain it for you as soon as it is introduced. Maybe it would be helpful to have some examples of issues/posts where users were confused about that—perhaps you’re overestimating the curiosity of the typical user? And I guess you could say the exact same thing of the build directory that at least setuptools generates (not sure about other tools); I can’t say I’ve seen any new users poking around too much in there, asking questions and getting confused.

In that case, maybe open an issue with one or more of those projects and see what feedback you get? Or we could ping Frosty, Thomas, Ofek, etc. over here.

jeanas · March 29, 2023, 8:14am

Sorry, I don’t have much more than my own experience here – I don’t follow the issue trackers of packaging projects.

Maybe. At any rate, I’m not saying this is the alpha and omega of packaging UX. It’s just one of small bits that add up.

@frostming @ofek @takluyver

takluyver · March 29, 2023, 8:59am

Hmm, interesting possibility. My first thought is to wonder if people used to Python packaging would be surprised that flit publish didn’t leave files in the dist/ folder. Although it’s not hard to run flit build as well if you want those files.

brettcannon · March 30, 2023, 12:47am

I know I would be surprised. It would also mean that when I tell twine or some other command (e.g. gh release upload) where the files are I will now have to copy that file path from some stdout rather than getting to default to dist/ which I can easily do in some release pipeline.

dstufft · March 30, 2023, 1:12am

I think it would be surprising, but I don’t think that means it’s a bad thing? It’s surprising not because it’s wrong, but because it’s different, but different could be better (or it could be worse!).

If you were to do it, I’d suggest maybe a --keep-dist-files flag, unless there’s a way to publish existing artifacts (unless flit produces a byte for byte identical copy every time?). The reason I add that caveat is that flit build is fine, but if you can’t upload those exist artifacts then it’s not really the same outcome.

takluyver · March 30, 2023, 9:30am

To be clear, the proposal is that only flit publish, which does the upload for you, would change. flit build would still put files in dist/, so if you prefer to use twine to do the upload, nothing would change.

If you use flit publish to release to PyPI and then also e.g. upload the packages to Github, you would need some change; the easiest option would be to invoke flit build as well.

It quite possibly does make byte-for-byte identical files. We only specifically aim for this if SOURCE_DATE_EPOCH is set, but running it on the same source files on the same machine, it might well work even without that. Maybe I’ll look into strengthening this.

jeanas · November 15, 2023, 10:50pm

Just for the record:

Here’s at least some start at better communicating sdist/wheel differences: Add discussion of package formats, expanding "Wheel vs Egg" discussion by jeanas · Pull Request #1397 · pypa/packaging.python.org · GitHub

(Judging from this thread and Should sdists include docs and tests?, whether sdists should be treated as “source checkouts” and thus include docs and tests seems to be a largely controversial topic, so I didn’t attempt to formulate a recommendation about that.)

sirosen · November 15, 2023, 11:37pm

Regarding the time of sdists for repackaging, it would be nice to have some community doc which agrees on how we expect repackaging to happen. I’ve been told that Arch is now preferring using repo sources over sdists for the AUR, as sdists frequently omit testing requirements.

I have some packages which include CI jobs for

build sdist
unpack sdist
run tests from sdist

Because it’s so easy to break this if you aren’t checking it.

Maybe that should be recommended? Or maybe it should not be, and distros should be advised to use the source and not to use sdists?

It would be great to have some cleaner cut expectations about this interaction.

jeanas · November 16, 2023, 11:46am

I think it depends on the build backend. Hatchling for example defaults to including everything (except VCS-ignored files).

Generally speaking, I agree guidance would be great, though I don’t know how hard to get this would be (does it need a PEP?).

pf_moore · November 16, 2023, 12:35pm

I think the biggest problem is that there’s no consensus on what should be included in the first place. Everything needed to build the project is a minimum, obviously. Everything needed to run tests (locally) is often requested. Everything needed to build documentation is another common request. But going beyond that, there’s things like CI configuration files (e.g., the .github directory), VCS configuration (e.g., .gitignore), editor config (like .vscode).

Personally, I think that “everything in VCS” is a bad choice, because (a) it’s easy to get that by downloading from VCS (obviously!), and (b) it includes a bunch of stuff related to the maintainers’ workflow that probably won’t be of interest or value to consumers of the sdist (like CI, VCS, editor configuration). It’s convenient as a default for backends because it’s easy to define and implement, but we shouldn’t be making recommendations based on what’s easiest for us to implement, but on what’s best for the user.

I don’t think we need a PEP here, but I do think we need to get consensus and document the appropriate advice in the packaging guide.

One problem with all this, of course, is that identifying “what is needed to run the tests” (or to build the docs, or whatever) is impossible without some standards on how test suites or doc builds should be laid out, and how they should be executed. That’s something that frankly I don’t think is in scope for the packaging community - except in the trivial sense that everything is packaging if you’re trying to publish your project. So I suspect the best we can do is to make broad statements like “sdists should include everything needed to run your test suite locally”, combined with some examples of one or more common layouts. We’ll get people complaining “the documentation isn’t clear enough, there are too many options” again, but IMO we have to draw the line somewhere. Otherwise we’ll have to make an “official recommendation” between tox, nox, hatch, etc. for running your tests. And then we’ll never make any progress