Best practices for including non-executable binaries in a library

TL;DR: What is the best way to include data binaries (specifically, .npz sparse matrices) in a Python library publishing workflow?

I have an unusual packaging situation for my scientific python library, and I want to be following best practices. My library is for scientific simulations, and involves a lot of linear algebra. The performance is greatly enhanced by including partial solutions as sparse matrices. These are created using the pydata sparse library, and saved as .npz files. These files are stored in a bin folder inside my library (nmrsim):

src
├── nmrsim
│   ├── __init__.py
│   ├── bin
│   │   ├── Lproduct1.npz
│   │   ├── Lproduct10.npz
│   │   ├── {etcetera}
│   │   ├── __init__.py
│   ├── module.py
│   ├── another_module.py
│   ├── {etc.}

These files add about 1.3 MB to the size of the installation.

My current solution feels very wrong: the bin folder and contents are committed to GitHub, and a GitHub Actions workflow is used to publish to PyPI. Adding binaries to version control is not a good practice, but it is working.

A better solution might be to have a script create this folder and contents as part of the GitHub Actions workflow. However, the calculations are computationally intensive, and I don’t think a GitHub cloud server would have the CPU/memory to generate them. Even if it did, it would add a lot of time to the build.

Another possibility would be making their installation optional, and have a script create these files on the user’s machine. An added benefit would be no “your .npz header looks funny” warnings if there’s a non-breaking difference between the user’s environment and the environment that created the .npz files. However:

  • that would require extra steps from a user, and a fair amount of time to run the computations.
  • this step would not be obvious if my library was installed as a dependency of someone else’s project.
  • I’ve already built boolean tags into the code (e.g. SPARSE, CACHE) that would provide fall-back options if acceleration files are missing, but I don’t know best practices for how to allow users to change such settings (e.g. make SPARSE = False for their installation).
  • Scripts that add files to a user’s system are fraught with peril. I do have tests to cover this, but it still feels a bit nerve-wracking.
  • Finally, I can imagine scenarios (e.g. PyScript/PyIodide WASM web apps) where a no-fuss, pure-python installation would be highly desirable or even required.

So: what would be the accepted, “industry standard” way to create and include these files in my library?

If you’ve read this far, thanks for reading my first post. I hope it’s an appropriate question for the forum.

bin is for executable files traditional.

Your .npz files are data and I would expect to be in an appropiate data folder.

1 Like

Agreed. Make them package data and put them alongside the .py file that reads them.

First, just to make sure I understand “alongside” correctly: simply renaming the bin folder to e.g. data, with functions in nmrsim.qm accessing this data?

src
├── nmrsim
│   ├── __init__.py
│   ├── data
│   │   ├── Lproduct1.npz
│   │   ├── Lproduct10.npz
│   │   ├── {etcetera}
│   │   ├── __init__.py
│   ├── module.py
│   ├── another_module.py
│   ├── qm.py

The remaining question is where the binary data would be stored or generated, and then accessed by CI such as GitHub Actions. If it really is OK to be including this data folder (with 33 files, 1.3 MB total) in the GitHub repository, then problem solved, but this feels like “GitHub: You’re Doing It Wrong”.

Are these .npz files exactly the same for all environments, all Python interpreter versions, all operating systems, and so on?

To be pedantic, you probably mean “git: You’re Doing It Wrong”. It does not seem to me like there is anything specific to GitHub (or GitLab or Bitbucket).

1.3 MB isn’t much, so yes this seems perfectly fine. It’s how we do it for SciPy, also for some .npz files. If your data files get significantly larger, you may want to add an API to download them dynamically only when needed - scikit-learn, scikit-image, and scipy.datasets all have such APIs. Scikit-learn has a custom downloader IIRC, and scikit-image and SciPy both use Pooch.

2 Likes

Almost. I haven’t pinned down the source yet, but in some situations a stdout warning would mention that the .npz header looked off, and to check that it’s not corrupted—but it still works. It works across mac/linux/windows, pythons 3.6 to 3.10, and CPUs, but my suspicion is that different numpy versions might have slightly different conventions.

It’s GitHub-specific in the sense that I am using GitHub Actions for CI, including publishing to PyPI, and GitHub has to have access to the binary data. If I were doing this manually, I could just have a dataset on my laptop, not added to git, and build from there.

If they are not exactly the same for all environments, then I would not add those files to the source code repository, nor the source distributions (sdist). And I would try to get them generated and added to platform-specific wheels by the CI pipeline (GitHub Actions, GitLab CI, Bitbucket pipelines).

If they are always the same for all environments, then I would try to get the CI pipeline to generate them and add them to the sdist and the wheel(s). If those .npz files are somewhat text-based and diff-able then I would maybe consider adding them to the source code repository (maybe via some kind of pre-commit hook or something to the same effect) so that the generation in the CI pipeline can be skipped, but that would not necessarily be my preference.

But if those files change depending on the NumPy version, then it seems to me like there is no other choice but to have them be generated on the user’s machine, is there?

This is general, and personal advice. And as I have now understood from the other participants, many other projects already deal with such .npz files and these projects have probably established best practices for this already, so you should probably follow those.

1 Like

I had started working along those lines. Then, cibuildwheel errored with “Build failed because a pure python wheel was generated” and I had an epistemological crisis (“What is pure, and what isn’t?”) :rofl:

I will try to give this another go, since on paper it seems like the best approach. However, to compute a 11 x 11 x 211 x 211 matrix taxed my MacBook Pro and my gaming PC, and they refused to do n = 12 instead of 11. So, I’m anticipating a CI cloud computer will fare worse. I’ll study further what other scientific libraries do, but I imagine the big ones have legions of contributors that can custom build things on different architectures.

If you’re happy with using your users’ resources, you could compute then on first use (and saved to disk).

Otherwise, I would host these files in some cloud object storage (AWS S3, Azure Storage, etc) and have the files be downloaded on first use (and saved to disk).

The downside is that these wouldn’t be cleaned up when you uninstall the package, so you would have to document files left behind.

1 Like

Yes, that is the issue that always gets mentioned when it is about putting files in user directories. When it’s just some configuration files, it’s okay, but cache… it can take quite some disk space.

1 Like

I actually have this option already coded as a “fall back and punt” option, if the user has the requirements installed but can’t find the files. So, that’s definitely an option, if the user is OK with a slow first use. For education, though, I’m looking at binder-launched Jupyter notebooks and WASM web apps (via PyIodide/PyScript) as tutorials. If the student has to watch a spinning wheel for 10 minutes they’re likely to move on to something else.

How large would the files be when saved as numpy.savetxt? Would it make sense to store those text files in Git? Would converting them from that stored text format to .npz in CI be faster than regenerating them?

The .npz is per documentation portable between machine architectures and independent of NumPy version.

1 Like

There are also scenarios that expect to install everything and then either disconnect the internet or make all the packages read-only, so if you’re going to do anything on first use, be very clear about it and provide a way to trigger it directly.

Still, I’d prefer essential data to be in the wheel such that a wheel cache can then be installed without network access and everything works. If that means your wheel is going to exceed PyPI’s limits, then you’ve probably got to just document that it’s downloaded or generated later.

As for git, enable git-lfs (large file storage) and don’t worry about large files :slight_smile:

3 Likes

I’d have to unravel a lot of what happens behind the scenes, because it is the pydata sparse library saving the files, which are sparse matrices. Their documentation says that their save_npz files are not compatible with SciPy’s save_npz, even though both use the .npz format. They also say their binary format is not stable, which may also be the source of those warnings I mentioned earlier. I could test it and see if things break.

I’d hate to see the git diff if the 700 KB data file ever changed, though :smiley:

back in the day, before wheels and plain old setup.py, I would have suggested that you could build the files on installation, rather than on first use – they you get to have them built on your user’s machine, and with the right versions of everything. Now, I’m not so sure.

Is this a pure-python package? that is, no compiled extensions? If so I think you could still distribute as source dist only, and have the files built on install.

(though I’ve lost track – pip now builds a wheel, and then installs that, so maybe not?)

Otherwise, you could inform your users to run a “setup” script after installing by hand – as a user that has permissions to install into the dir the package has been installed into.

BTW: Not Putting binary files in git is recommendation, not a hard requirement. As a rule, if they are not too huge, and don’t change often, it’s fine. Though maybe you are on the edge of “too huge” here – do they change often?

1 Like

No compiled extensions, for now at least.

I’ve deliberately avoided changing them. They were uploaded to GitHub once in 2019 and they still work.