I am working on a set of related command-line programs, and I don’t understand how best to structure them into Python packages and projects. These commands do individual related steps of a workflow. They share data structures and functionality.
The obvious way to structure their source code is as one project in one repository. I can register a console_scripts entry for each program. Shared modules are in one src/ tree. Each program can easily import any of them. However, then I have only one pyproject.toml. I can have only one project name, one issue date, one version number, which all the programs must share.
If I want to allow separate project names, issue dates, and version numbers, it seems I would need to make them multiple projects. Each would have its own respository, its own pyproject.toml, its own src/ tree. I would have to structure the shared modules as importable packages as their own project, or maybe as a git subproject which each program’s repository inherits. That seems like more complexity and repeating myself.
Are there other models which I am missing? Something which lets each program have its own version number, but which lets them share code?
I am hoping someone can point me to something clearly better which I am failing to imagine. Thank you!
There’s no requirement for the root of a package folder to also be the root of the Git repository, so my thoughts here are to have all programs in one repository containing multiple packages. The repository root will contain a directory for each package, with each of those folders containing all of the project-specific stuff typically found at the root of a repo (such as pyproject.toml). Your CI script or other build process can cd into each folder and build the package, then cd into the next folder and so on.
Shared code can be its own package (and associated directory) on which the programs using it can declare a dependency. The PyPI description should just say that it is an internal package not meant to be installed alone.
So your repository can look something like this:
myprograms
|
+-- common/ <-- shared files go in this project
| |
| +-- src/
| +-- pyproject.toml
|
+-- program1/
| |
| +-- src/
| +-- pyproject.toml
|
+-- program2/
| |
| +-- src/
| +-- pyproject.toml
|
+-- .gitignore
+-- .gitlab-ci.yml (or the CI file for your choice of CI service)
+-- LICENSE.txt
+-- README.md
If you’re going to version your programs separately, I suggest they go into different repositories, otherwise making releases gets a bit complicated when you look at the Git history. I do have a little bias against monorepos, however.
Similarly to Laurie I also think the clean approach is to have separate repositories for the tools to be released separately and one more repository for the shared code.
To minimize the repeated work when creating the individual repositories you can use templates. One of popular tools is Cookiecutter:
By Laurie O via Discussions on Python.org at 17Jul2022 01:43:
If you’re going to version your programs separately, I suggest they go
into different repositories, otherwise making releases gets a bit
complicated when you look at the Git history. I do have a little bias
against monorepos, however.
By contrast, I love the monorepo! At least for interrelated things.
Nothing stops you having multiple projects and releases inside a
monorepo. That’s how my personal projects are handled.
Do you really want distinct releases per command? If each is complex,
that makes sense. If they are smallish and interrelated, maybe you just
want to release the lot as one project with one release number. I’m
thinking about command A being dependent on the “current” revision of
command “B” - if you always release as a single thing that’s always in
sync because A and B come out together; if you release them individually
you may want to include versioned dependincies i.e. command “B” requires
at least revision A3 of command “A”.
Obviously this is up to you, but there’s a complexity tradeoff.
It is helpful to have pointed out that there can be multiple src/ trees and pyproject.toml files in a single repository. I don’t think I recall seeing that possibility discussed in the Python Packaging User Guide.
The debate between monorepos and multirepos is interesting. I see the advantages to both choices.
It is early days on my project, and I expect it will always be a small project, so I am starting it out as a monorepo with a single pyproject.toml file. However, I now have some conceptual models for directions in which it can grow.
I’m not a big fan of monorepos and don’t really use them (much), but in general the biggest challenge of managing lots of different repos is not creating their infra, but maintaining them, particularly things like updating common portions of the readme, contributing guide, other meta files, gitignore, gitattributes, GitHub Actions workflows, other CI config, test config, linter config, tooling config, packaging config, etc, etc.
I’ve experimented with several tools for handling this, but they’ve all had caveats for the use cases I run into, and also require a lot of work to create the cookiecutters, as seemingly all of the ones out there are out of date or have some issue or another. I even have had plans to create my own wrapping pre-commit and cookiecutter, but that seems an impossible dream given the time involved…
I would be curious to see if anyone can share a link to a working monorepo setup, one using setuptools and that survived the test of time.
I am less worried about managing changelog for each package but more about not ending up wasting precious time debugging bugs in all the tooling around, that might get confused but such a layout.
Still, having a single linters configuration could be seen as nice feature when using a monorepo.
I have no idea how much has been overridden these days, I suspect a lot. And there are dedicated people whose job is to keep it all running, but that’s inevitable at this scale.
The subrepos are in the sdk directory. I don’t believe they’ve switched to pyproject.toml throughout the source tree yet, but they may be generating them for releases (and if not, that’s likely where they’ll start, and I’d better go poke them to get on it before the legacy behaviour in pip stops working).
Monorepos are great as long as you can cut a single release per repo. commit history isn’t the only thing that is made more complex by trying to version multiple things independently out of a single repo. You will waste a lot of time in a lot of places. The time spent just with configuring your CI builds to ignore changes in certain paths and then dealing with having some builds related to toolA and some related to toolB…and god help you if you ever want to do something like put a badge on the CI job that shows which version was deployed. It’s doable, sure. Maybe none of it is even that difficult in isolation and nothing ever goes wrong. But at the end of the day the aggregate will not be trivial, and that’s time that could be better spent on something more useful, like staring blankly into space.
But anyway, there is likely no good reason NOT to simply cut a new release whenever either changes, and always keep the version numbers in sync. It sidesteps most problems at the cost of maybe having users download a single python dependency unnecessarily because it’s a release with no changes. I’ve done both plenty of times, and can say that I have encountered no memorable problems with “mono repo, everything released every build w/ the same version number”, and many with “mono repo which is essentially multiple repos in a single source tree” (and I’ve watched other teams struggle with it repeatedly because everyone needs to learn their own lessons I guess).
Here’s a test: are you directly importing anything from one module to another, expecting to make changes to both which must line up for the project to work? Are consumers generally or always going to require both of them, e.g., one is a library used by the ‘more public’ module? Then mono repo good, but you should be releasing them as a unit anyway or you’re asking for trouble.
If you expect to build, release, and version them independently, and toolA needs to support multiple versions of toolB w/ backward compatibility guarantees, or consumers of the project could be expected to use one but not the other, those are clearly two separate projects and should not be in a repo together.
Note: Something to keep in mind is my work tends to be on internal tooling where build times for each component are roughly the same, I can parallelize them, compute is essentially free & infinite for these purposes, and bandwidth for my users is also basically free. I wouldn’t recommend this for multiple compiled components, one of which takes 3 hours to build and is 3gb and the other takes 3s and is 3mb, and I’m shipping the artifacts to Abu Dhabi every time. Use judgement. Things don’t magically get less complex as time goes on, so you definitely do not want to start off complex.