How to define an importlib Distribution to extract metadata from an sdist?

I want to extract standard metadata fields from sdists. Although PKG-INFO is easy to parse, I would prefer to rely on a standard implementation of the parser rather than write my own.

I wrote the following shim using importlib_metadata but I can’t tell what might be broken about this:

import contextlib
import pathlib
import tarfile

from importlib_metadata import Distribution


class TarballSourceDistribution(Distribution):
    def __init__(self, path: pathlib.Path):
        self.path = path
        self.dist_name = path.name.rsplit(".", 2)[0]

    def read_text(self, filename: str) -> str | None:
        with tarfile.open(self.path, "r:gz") as tf:
            with contextlib.suppress(KeyError):
                fp = tf.extractfile(f"{self.dist_name}/{filename}")
                if fp is None:
                    return None
                return fp.read().decode()

    def locate_path(self, path: str) -> pathlib.Path:
        return pathlib.PurePosixPath(path)

I got this far by going straight to the importlib.metadata source and poking around at what Distribution requires.

It seems to work! I can pull metadata from an sdist without having to install it.

But I feel unsure that this is correct in a few ways. importlib.metadata docs don’t really describe what locate_path should return. And it seems like PackagePath.locate is expected to return a path on disk (which is ambiguous for an unextracted tgz).

Does anyone have experience working with custom Distribution types or something similar? I’d appreciate any guidance or affirmation that this is a good approach.

1 Like

This is probably incomplete, and a minor hack. It’s linted, but needless to say it’s not checked, tested or debugged. This is random code on the internet, most certainly supplied with absolutely no warranty etc.

But it’s the first thing that occurred to me in the shower and I think it will work: If pyproject.toml is there, you can just return tomllib.load(‘pyproject.toml’). If it’s a setup.py style sdist, you can shadow distutils.setup and setuptools.setup, and import setup.py. setup.cfg left as exercise for the reader.

import sys
import tempfile
import contextlib
import pathlib
import tarfile
import tomllib



TMP_DIR = pathlib.Path(tempfile.gettempdir()) / 'sdist_metadata_reader'
TMP_DIR.mkdir(exist_ok=True)

class SdistMetadataReader:

    def __init__(self, path):
        self.path = path

    def save_kwargs(self, **kwargs):
        self._kwargs = kwargs

    def read_sdist_meta(self) -> dict:

        with tarfile.open(self.path, "r:gz") as tf:
            names = tf.getnames()
            if 'pyproject.toml' in names:
                 with tf.extractfile('pyproject.toml') as f:
                    return tomllib.load(f)
            elif 'setup.py' in names:

                import setuptools, distutils
                self.st_setup = setuptools.setup 
                self.dst_utls_setup = distutils.setup

                setuptools.setup = distutils.setup = self.save_kwargs
                sys.path.append(TMP_DIR)

                try:
                    tf.extract('setup.py', path = TMP_DIR, filter='data')
                    import setup
                finally:
                    setuptools.setup = self.st_setup 
                    distutils.setup = self.dst_utls_setup
                    sys.path.pop()
                    (TMP_DIR / 'setup.py').unlink()
                    
                return self._kwargs

            else:
                raise ValueError('Neither pyproject.toml or setup.py is present')

My goal is actually to get away from handling the various ways that build backends may store their data and instead pull from PKG-INFO (or perhaps wheel METADATA in the future).

I interact with various packages using different build systems, plus I want to handle dynamic metadata.
So while your approach is interesting – and it might be useful for other tasks for me; thanks for sharing it! – it’s not quite doing the same job.

1 Like

Does something like this help?

Hopefully I did not misinterpret the question.

1 Like

Oh, very cool and interesting, thanks!

I’m already pulling in build as a dependency of my project so that I can convert from the source tree to an sdist, so this would fit more or less perfectly.

I’ll need to look more at how this works, but it seems likely to solve my problem.

1 Like

Good. Feel free to report after testing…

1 Like

I just finished up some work with this and it came together nicely. I even tried giving myself an env var for my tool to translate into the isolated flag, which let me get some faster results in the cases where a non-isolated build works.

But overall I’m left wondering… It seems like this does the same work as build --sdist followed by metadata extraction. So although my new code is shorter, it feels like the build CLI might be better supported and maybe I should go back to using it.

However, I at least have a strong notion of where to look in build for how that tool pulls metadata from the sdist. Regardless of exactly what I choose – I’ll have to noodle on it more – I feel like I’m on a better path towards success now.

I guess maybe build.util.project_wheel_metadata is the more “correct” way.

But depending on the circumstances it can be quick or slow. For example some build back-ends might be more efficient than others. Also if there is no dynamic metadata in [project], then it should be quick (if the build back-end is smart) because parsing [project] should be all that is necessary.

1 Like

I wasn’t aware that this was a possibility!
My experience with build has always been to use it to build sdists + wheels, so it always operates in the mode of generating an sdist and then using the sdist to build the wheel.

That certainly puts some weight on the scale in favor of using build.util. I only wish it were clear that this is a stable API – the name and 0.x versioning are both subtle signs that it might be unsafe to depend upon.

[This is getting quite deep into the technical details, and I have only surface knowledge of this topic, so take what I say with a grain of salt…]

I believe your use case heavily depends on this hook of the specification: prepare_metadata_for_build_wheel. This is an optional hook. I do not know which build back-ends have implemented support for it yet. Or even if the build front-ends support it. According to this I guess build seems to support it, so that’s good.

So I would say that using build.util.project_wheel_metadata is a reliable solution, that is ready for the “happy path” which would deliver project’s metadata very quickly:

  • the build back-end of the project implements the prepare_metadata_for_build_wheel hook (in a smart and efficient way)
  • the project does not contain any dynamic field

I do not know of any alternative tool or library that offers such an API. For sure other build front-ends have implemented this, but I have no idea if they offer a public API for this functionality. Well, there is pyproject-hooks which is the one that both build and pip use, but as it says on its PyPI page it is a “low-level library”.

I guess it is fair to ask if build’s API is stable and can be relied upon, maybe in this ticket (its contributors are active on this forum, particularly in Packaging).

1 Like