Easy and recommended way to get path of datafile within package?

jagerber · October 31, 2022, 3:41pm

Suppose I have a package structure like

.
├── package
│   ├── __init__.py
│   ├── data
│   │   └── data.csv
│   └── mymodule.py
├── README.rst
├── MANIFEST.in
└── setup.py

How can/should I get the file path of data.csv from within mymodule.py as a Path object so that I can then open the file however I like? Note, I’m specifically not asking how to get the data from the file but just the path. In this example the data is a csv so its data can be accessed using stdlib functions, but if the data were instead, for example, a .hdf5 file I would need the path and then I’d need to access the file using an h5 python package.

This thread has some information on the topic. There seem to be many approaches and there doesn’t seem to be good agreement on which is recommended or has a promising future prospects.

The main candidates seem to be importlib.resources or pkgutils.

In importlib.resources, if I add an empty __init__.py to package\data\, then I can get the path via:

import importlib.resources

data_dir = importlib.resources('package.data')
data_path = Path(data_dir , 'data.csv')

This seems ok but it’s a little annoying to need to include __init__.py in package\data\.

I can’t quite figure out the right way to do this with pkgutils. The best I have is

import pkgutils
from pathlib import Path

data_dir = Path(pkgutils.resolve_name('package.data').__file__).parent
data_path = Path(data_dir , 'data.csv')

(1) Are both of these strategies ok/recommended?
(2) Is one of these two approaches (or another one) to be preferred and why?
(3) Is trying to get the path of a file within a package for some reason a bad pattern and I should be doing something else? Sometimes reference data files, of complex formats, need to be stored and accessed for a package to operate so it seems like the answer can’t be no here.

cameron · October 31, 2022, 9:28pm

Why not just have something like this in your top __init__.py?

 from os.path import dirname, join as joinpath
 DATADIR = joinpath(dirname(__file__), 'data')
 ... utility function to compute for `data.csv` or whatever ...

or the same written using pathlib (my head’s still in os.path land).

The mymodule.py computes:

 from . import DATADIR
 csvpath = os.path.join(DATADIR, 'data.csv')

Methods involving importlib will inherently try to treat your data dir
as a package, because that’s what importlib is all about.

Cheers,
Cameron Simpson cs@cskk.id.au

jagerber · October 31, 2022, 9:33pm

Apparently it’s bad form to make paths for data resources using paths relative to __file__ as explained in “Bad Way #1” in this answer. I guess the issue is that this approach will break down if the package is packed in zip or wheel. I don’t think that case applies to me but still trying to learn and follow best practices…

steven.daprano · October 31, 2022, 11:25pm

There is no “best practices” for this.

Simple path manipulation assumes that your library is installed as a package on a file system
which is only true 99.99% of the time;
although the code is pretty simple and obvious;
and gives you an actual pathname to a real file that you can pass to libraries which require actual pathnames to real files;
and it works even for naive applications that just dump a .py file and its data files in the current directory.
The modern importlib.resources API is bad because it requires your data files to be part of a package, not just a subdirectory of a package;
and because its API is complicated and confusing (WTH is a Traversable?);
and because it doesn’t give you a path to an actual file;
but on the positive side, it works for the 0.01% of packages that aren’t installed into a file system.
The legacy importlib.resources API is bad because it is deprecated and is going to be removed.
pkg_resources is bad, because it requires setuptools to be available at runtime;
and because it is slow;
pkgutil is bad because it also requires your data files to be part of the package;
and because it can only work with binary files, not text;
and there is no API for modifying your data files, only reading their static content;
and it doesn’t give you a path to an actual file.

My opinion is that simple filepath manipulation may not be best practice, but it is as close to it as we have right now, except for that tiny minority of packages installed into zip files or wheels.

cameron · November 1, 2022, 12:11am

Just to this: a file path may not even be meaningful for a “file” inside a not unpacked archive such as a zip.

Cheers,
Cameron

jagerber · November 1, 2022, 1:06am

@cameron Yes, that was a little bit my concern that the whole idea of a “file” doesn’t make sense for zips. I don’t really know anything about zip files or wheels and frankly I’m not hoping to learn about them anytime soon.

@steven.daprano Thank you very much for your summary of the different approaches and how they all have downsides! You hit the nail that’s been bothering me on the head. Basically the naive path approach by calling __file__ somewhere inside the package seems pretty straightforward but I was turned away from it by a number of stack overflow posts advising against it. But the alternative approaches that were proposed always seemed to have their own (even worse) problems/inconveniences. Especially given that the __file__ approach will work for “only” 99.99% of cases, which covers 100% of my personal cases.

Well, for my part I think I’ll go with __file__ for now until better import/package management systems are developed and agreed upon. That said, still open and curious to hear others’ opinions and thoughts on this point.

bryevdv · November 1, 2022, 3:24am

I keep seeing this stated, but citation needed? We use __file__ and relative paths to locate JS resources shipped with Bokeh and there has never been any issue with the wheel packages we publish.

Or does “installed into wheels” really mean something obscure and different from “installed from wheels”?

jagerber · November 4, 2022, 1:46am

I’ve wondered about this as well. From the language it sounds like the package can somehow be installed into a wheel. i.e. the code is somehow still inside the wheel even when it’s installed then executed. I’m not sure if this can be the final distribution state of a package…

Another possibility is that maybe during the installation of a package that is distributed as a wheel it is possible to execute code within the wheel before it’s finally deployed onto a file system for some reason.

But I know very very little about this. This is just what I’ve inferred from language I’ve seen around this particular subject…

I also don’t know a use case for when you would be executing code within a .zip…

effigies · November 4, 2022, 12:39pm

My impression is that the use case is zipapps, though I haven’t actually tried writing one myself. And possibly py2exe generates archives that need to be accessed without unpacking.

It would be useful to me to have a canonical example (similar to pypa/sampleproject) of a tool that needs resource access where __file__ wouldn’t work, so that I could write a test to ensure that my access method that works fine for unpacked wheels also works for zipped apps.