Package Pruning

Is it possible to prune a package to create a venv/lib folder (or custom /utils/init ) that only contains the functions needed (and their direct dependencies) ?

For example using xgboost in an application that only requires the predict() function but the entire package loaded by the build is 196 MB. The test() and fit() functions of xgboost needed for model training will never be called in this app thus are not needed in this instance of the runtime. Same could apply to numpy or pandas—only deploy the functions and their dependencies that are actually called.

Because the deployment bundle is too large for the serverless hosting platform I am using, shrinking the bundle size to only what is being called would resolve this problem.

I understand I would create my own package or /util but that would be fine.

Alternatively, I could create a container and deploy that—may be my only option, but is more coding and IaC setup work.

1 Like

This is possible, but I don’t know of a tool to do it automatically. It’s going to be brittle, in the sense that it’s liable to change (or break) from version to version. I’m also not sure how much you will end up saving this way–the actual code files are probably pretty small, and depending on how they are written they might import almost everything (e.g. if they make a lot of stuff available at the top level of the package).

There are also a lot of blog posts out there on how to shrink a docker image that contains python by removing unnecessary files. Sometimes docs and tests directories are bundled in the installation even though they often aren’t necessary at runtime[1]. Similarly, compiled extensions might include detritus from the build process. But again this will depend on the packages–I think numpy and pandas both have a fair amount of bloat that can be removed this way, but I don’t know about xgboost.


  1. in particular if there’s any test data in there it can be quite large ↩︎

2 Likes

PyInstaller does this when building an app - the process is described in their documentation. You could probably read their source code to see how they do it, and use that to build your own tool.

But as far as I know there’s currently no tool that does what you’re suggesting - probably because it would be quite fragile (dynamic imports and runtime sys.path manipulation would cause problems for it, for example - see pyInstaller for more discussion of the issues), or because few people have the need for it.

1 Like

xgboost is probably a bad case for this, the windows and manylinux wheels for x86_64 have one giant binary .so/.dll file making up most of that 120+MB wheel size. I don’t know the module so I don’t have any view into why it is different for other platforms but I imagine any change would need to be done in the build process making that binary, as opposed to pruning afterwards.

2 Likes

I’m just (educated) guessing, but I suspect xgboost is so large because the binaries contain optimized code for a wide variety of platfoms – so they can have one binary that runs (well) everywhere.

If so, then buliding it yourself could yield a much smaller binary.

I have not looked into how hard that would be to do :slight_smile:

2 Likes

In many cases trying to target specific functions won’t save much. In my experience many libraries have an “all roads lead to Rome” organization in which there is a fairly broad base of functions and classes that tons of other things ultimately depend on. Unless the library was specifically written with separability in mind, trying to separate the call chains of individual functions wouldn’t have a huge payoff even if you were able to do it. I don’t know how true this is of xgboost specifically though.

3 Likes

agree. brittleness due to package owner changes/upgrades would be problematic.

I’ll investigate. Thanks for the package reference!

For others who want to look into this: the academic term for this is dead code elimination, or DCE. JavaScript also calls it “tree shaking.” There’s a lot of prior art out there for dynamic languages like Python, although per @pf_moore’s point Python is uniquely dynamic at the import machinery layer :slightly_smiling_face:

2 Likes