How to include pickled files in installation

In order to speed up my package I use pickle for some core functionalities. These pickle files should be installed along with the package. What is the best way to achieve this?

The simplest way is:

  1. Install dependencies (e.g. via a requirements.txt)
  2. Run the script which pickles the files
  3. Run the installation script and specify the pickled files as data dependencies

This works well for wheels but is not ideal for users who want to install the code from source.

An alternative is:

  1. Install the package without the pickle files
  2. Generate the pickle files the first time they are needed

This causes problems for users who need to install the package system wide as sudo privileges are then needed to run the script.

Is there a standard way to handle this use case? I imagine it could be done with a post-install script but I am struggling to find how to modify the various installation files to call such a script using the new packaging paradigms.

Are the pickled files dependent on the environment (Python interpreter implementation, Python interpreter version, operating system, CPU bitness)? Or are they always the same for all environments?

I believe they are the same for all environments. The may be a dependence on the Python version as the implementation of pickle sometimes changes between versions.
There is also a dependence on the version of my package which is why we do not save the pickled files in the repository

If there is a different answer with an environmental dependency I would also be interested to know about this though as this may be added in the future

I do not recall having ever used pickle, so I do not know much about those. But from a quick research, pickle seems to have protocol versions. As long as all the Python interpreter versions your library supports support the same pickle protocol version then it should be fine to generate the pickle files yourself (or your build system or CI/CD pipeline) and include those pickle files in the sdists and wheel files.

So then you would recommend the first option? Where users wanting to install from source have to run 3 independent steps?
Or simply that we change our workflow to include these files in our repository. You have no suggestions for how to make this part of the installation script?

If you want to have them autogenerated by packaging tools, you can make them generated by your build backend when building a wheel. For example, see Reference - Hatch for hatchling.

1 Like

Pickle files are well supported by backward compatibility guarantees. A pickle made on one version of Python should be able to be read on another version. The latest pickle format version was introduced in Python 3.8, so any version since then should be equivalently compatible. (Even if a future Python version introduces a new format, you will still be able to select protocol version 5, or even version 4 for compatibility all the way back to Python 3.4.)

But it sounds to me like the pickles are cache files of some sort. That seems like a good reason to create them on the target system, simply to avoid desynchronization. But if you’re going to package and ship them, remember to treat them with the exact same security care that you treat your source code - a tampered-with pickle can execute arbitrary code, and is thus equivalent to a tampered-with source file.

There is no installation script to speak of… there are maybe build scripts, though…

What does your packaging workflow looks like. What is the build back-end? Is it setuptools or something else?

This is key to unraveling the issue. What exactly do you mean by “the installation script”? For example, are you using Setuptools as a backend and providing a setup.py?

Sorry if my terminology is incorrect. By “installation” script I mean the script that is called by pip to install the package on a system. I would not usually call this a build script as in our case nothing is compiled but maybe the creation of the pickle files would be considered to be a build stage or this is the standard Python terminology?

We currently have a few files that are related to build/installation:

  • MANIFEST.in
  • pyproject.toml
  • setup.cfg
  • setup.py

As far as users are concerned, the 2 use cases with pip are :

  1. Install from pypi
  2. Install from source

Ideally users should only need to run 1 pip call. We can run multiple to create the pypi object but I suspect that if there are lots of steps to do this then there will also be lots of steps to install from source.

Currently we are using setuptools<61. The version restriction is due to the workaround that we are currently using to create the pickle files during installation. Our setup.py file looks like this:

 import setuptools 
 from setuptools.command.egg_info import egg_info 
  
 class PickleHeaders(egg_info): 
     """ Class to pickle headers in time for them to be collected 
     by the MANIFEST.in treatment while building a wheel 
     """ 
     def run(self): 
         # Process files to create pickle files. 
         from my_package import pickle_creation_function 
  
         pickle_creation_function() 
  
         # Execute the classic egg_info command 
         super().run() 
  
 if __name__ == "__main__": 
     setuptools.setup(cmdclass={ 
         "egg_info": PickleHeaders, 
         })

However this hack does not work with Python 3.12 as the old version of setuptools doesn’t work properly.

Thanks! Hatch looks really promising I will need to take some time to look into it but it could be exactly what I need

I can confirm that I was able to reach a cleaner solution which works well with all in-life versions of Python using Hatch

1 Like

@ebourne For those who might find themselves in a situation similar to yours and who might stumble into this thread in the future, do you mind sharing your solution?