Proper use of importlib.resources for data files

wigging · November 23, 2024, 11:38pm

I have a project layout as shown below. Data files are located in the package in the src/mypackage/data directory.

my-project
├── src
│   └── mypackage
│       ├── data
│       │   ├── fruits.csv
│       │   └── veggies.csv
│       ├── __init__.py
│       └── reader.py
├── README.md
├── example.py
└── pyproject.toml

The pyproject.toml content is shown here:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "mypackage"
version = "0.1"
authors = [{name = "Bart Simpson"}]
description = "A small example package"
requires-python = ">=3.12"
dependencies = ["pandas", "ruff"]

In the reader.py module I have functions that read the CSV files in the data directory and print out the data. Below is a function that reads the fruits.csv file and prints the fruit data.

import pandas as pd
import importlib.resources

def read_fruits():
    """Read fruits CSV file and print data."""
    data_res = importlib.resources.files("mypackage") / "data"

    with importlib.resources.as_file(data_res / "fruits.csv") as f:
        df = pd.read_csv(f)

    print(f"\nFruits data from `fruits.csv` is below\n{df}")

Question 1

I can use

importlib.resources.files("mypackage") / "data"

or I can use

importlib.resources.files("mypackage.data")

to get a traversable to the data directory in the package. Both of these definitions work but does it matter which one I should use? Is one more performant than the other?

Question 2

The data directory is just a plain folder as shown here

data/
├── fruits.csv
└── veggies.csv

or it can be a package as shown next

data/
├── __init__.py
├── fruits.csv
└── veggies.csv

Both of these approaches work. But the Python docs make it sound like this directory should be a package. Can someone clarify if this data directory should be a package or not?

Question 3

If the data directory is made a package as shown below

data/
├── __init__.py
├── fruits.csv
└── veggies.csv

I can import it as a package and use it as shown here

import pandas as pd
import importlib.resources
from . import data

def read_fruits():
    """Read fruits CSV file and print data."""
    data_res = importlib.resources.files(data)  # <-- use data package here

    with importlib.resources.as_file(data_res / "fruits.csv") as f:
        df = pd.read_csv(f)

    print(f"\nFruits data from `fruits.csv` is below\n{df}")

This approach works too and doesn’t rely on using strings to get the data directory but it requires adding an __init__.py file and an import statement for the data package. Is there any reason to use this approach compared to my approach shown above?

bwoodsend · November 24, 2024, 2:17am

They achieve the same thing. As for picking one, since your data directory isn’t really intended to be a package (there’s no code in it), I’d encourage importlib.resources.files("mypackage") / "data" but not for any functional reason – just because it better matches what you mean.
Since namespace packages exist, the presence/absence of a __init__.py doesn’t change if a directory is a package or not. A subdirectory of anything in sys.path (including your data directory) is always treatable as a (sub)package regardless of whether you want to use it as such. I think all the docs is alluding to is that you can’t use importlib.resources.files("my_package") / "../neighbouring/files" since anything outside of the package is not part of the package (i.e. this API is not an arbitrary file system walker).
No, there’s no reason to do that (and for the same reasons as in 1, I’d encourage you not to).

wigging · November 24, 2024, 2:36am

Ok, thanks for the comments. I think I’ll stick to the importlib.resources.files("mypackage") / "data" approach as you suggested.