Importer: provide a specialisation point for handling new file extensions

My motivation for this is wanting to write a module that handles .gql files by parsing them as GraphQL and compiling the resulting Python, though I can immediately several things that would benefit from similar support:

  • A templating system that uses the new t-strings.
  • Parser generators
  • Dynamic wrapping of code in other languages into extension modules

…and I’m sure people can think of others.

I asked here whether I’d found a genuine limitation, and it seems that I have.

Right now (3.14) there are two key specialisation points for the import machinery: sys.meta_path and sys.path_hooks. Neither of those is well suited to the task of adding a new suffix and a Loader to handle it: the existing machinery specifies the list of handled suffixes at boot time (when constructing the FileFinder.path_hook to put in sys.path_hooks). What is needed is a specialisation point for file extensions.

I’ve studied the source code and believe it could be added very easily:

  • Create a new sys.suffix_loaders, a list of (suffix, loader) pairs.
  • Augment the FileFinder constructor and path_hook class method to accept a details keyword argument. Either *loader_details must be empty or details must be None. If details is not None, it must be an iterable returning (suffix, loader) pairs. Note that details is inherently compatible with, and could be assigned directly to, the FileFinder._loaders member; this minimises refactoring.
  • Alter _bootstrap_external._install to, instead of creating a FileFinder(*supported_loaders), populate sys.suffix_loaders with them and create a FileFinder(details=sys.suffix_loaders).

I believe that would be 100% backward compatible, and have negligible performance (CPU, memory, code size) impact.

Does this seem like a sensible improvement to other people?

Not to this person. I use yaml for configuration data, usually pyyaml sometimes ruaeml. It’s straightforward and explicit. Figuring out what magic is happening when importing non Python wouldn’t be worth the cost.

The “magic” would be importing a module. Which is what you expect of import statements.

I’m not anticipating that this would be used for configuration, more for accessing facilities that are programmed in a language other than Python — like GraphQL requests in my main motivating example.

In fact, sys.path_hooks was explicitly designed for the purpose of adding custom import mechanisms for files with non-standard suffixes. Look at the list of authors of PEP 302 if you want confirmation that I’m able to state this with some authority :slightly_smiling_face:

You can write a path importer that handles the files you want. You may have to reimplement some of the file path machinery that’s present in importlib, but that’s not impossible - the file path importer is simply one implementation of the underlying finder protocol. At best, you may have found a place where the helpers in importlib aren’t as helpful as you’d like, but what you’re trying to do is 100% possible.

In fact, the Quixote library implements precisely this - importable .ptl files. See the “PTL Modules” section in the documentation for more information.

4 Likes

In fact, sys.path_hooks was explicitly designed for the purpose of adding custom import mechanisms for files with non-standard suffixes.

Really? I honestly can’t see how it can be used for that. :confused:

By my understanding, when the PathFinder (via sys.meta_path) interrogates the sys.path_hooks it is looking for a file finder for a particular path, not for a particular file?

That is exemplified by the ZIP importer, which handles accessing files within a ZIP file as though it were a directory.

When a path hook returns a finder, that finder has to work for every file (directly) beneath that path.

Therefore, as things stand:

  • If you place the path hook for your suffix before FileFinder’s, you must handle all the usual file types, because PathHandler will never reach FileFinder’s hook
  • If you place your path hook after FileFinder’s, it will never be reached

Have I missed something, there?

As such, I can see very few approaches currently available, none of them ideal:

  • Write your own clone of PathFinder that handles just your suffix, and place it on sys.meta_path
  • Identify the FileFinder’s path hook (itself necessitating a certain degree of hackery) and:
    • Hack its closure to include your suffix and its loader in the loader_details it will pass to FileFinder’s constructor.
    • Wrap the hook to modify what it returns:
      • Hack the _loaders member of the FileFinder
      • Return a wrapper around the FileFinder

Of those, hacking the FileFinder’s path hook is the easiest and most efficient, though also prone to break without notice in new versions of Python and somewhat nausea-inducing. Putting your own path handler on sys.meta_paths is clean, I grant, but you have to roll your own sys.path_importer_cache doppelgänger.

One thing weighing heavily on my mind is: what if two projects each want to do this for their own suffix?

  • If each adds to sys.meta_path then there’s a proliferation of caches.
  • If you wrap the FileFinder’s path hook, the next project won’t be able to find it.
  • If you attempt to mitigate either of these by creating a standard library everyone should use for adding suffixes, you end up with something harder to write and less good than what I’m suggesting here.

You mentioned Quixote, so I took a look at what it does, and there’s a certain amount of creeping horror:

  • The docstring says its suffix is optional because “not all Quixote applications need to use PTL, and import hooks are deep magic that can cause all sorts of mischief and deeply confuse innocent bystanders”, but this doesn’t help if integrating code that uses Quixote into a larger project, since the suffix handler is global.
  • It also adds “One known problem is that, if you use ZODB, you must import ZODB before calling this function.” which exacerbates my concerns about playing nice with other code.
  • It goes the route of adding a PathFinder clone to sys.meta_path
  • …but that clone is a subclass of PathFinder with _path_importer_cache() overriden to use its own path_importer_cache.
  • …and because it was last touched a decade ago it’s not been updated to track the changes to PathFinder.invalidate_caches resulting from gh-93461.

Of those, I find overloading an undocumented function that begins with an underscore especially distasteful!

Why do you think that?

A very cursory review of the documentation says:

To selectively prevent the import of some modules from a hook early on the meta path (rather than disabling the standard import system entirely), it is sufficient to raise ModuleNotFoundError directly fromfind_spec() instead of returning None . The latter indicates that the meta path search should continue, while raising an exception terminates it immediately.

I’ve never messed with Python importer hooks specifically but that’s generally the way these things work. You either handle or just say “Not me!” and let something else handle it.

Why do you think that?

  • The documentation (which I do feel could be more clear)
  • The source code of importlib
  • Running the below fragment of diagnostic code and then trying stuff.
import sys

class Diag:
    def __init__(self, name):
        self.name = name
    def find_spec(self, *args, **kwargs):
        print(f'find_spec name={self.name!r} {args=} {kwargs=}')
        return None
    def __call__(self, *args, **kwargs):
        print(f'(hook) name={self.name!r} {args=} {kwargs=}')
        raise ImportError

sys.meta_path.insert(0, Diag('pre_meta'))
sys.meta_path.append(Diag('post_meta'))
sys.path_hooks.insert(0, Diag('pre_hook'))
sys.path_hooks.append(Diag('post_hook'))

Yes, I thought that too until I wrote some code and tried to get it to work. :disappointed_face: It turns out that the question being asked of sys.path_hooks is, in effect “can you find things in path /tmp/crj/example?” not "can you find (with a usable suffix) /tmp/crj/example/submodule?

The output from using the above import_diag on a simple example looks like this:

>>> import import_diag
>>> import example
find_spec name='pre_meta' args=('example', None, None) kwargs={}
running example/__init__.py
>>> import example.submodule
find_spec name='pre_meta' args=('example.submodule', ['/tmp/crj/example'], None) kwargs={}
(hook) name='pre_hook' args=('/tmp/crj/example',) kwargs={}
running example/submodule.py
>>> import example.nonexistent
find_spec name='pre_meta' args=('example.nonexistent', ['/tmp/crj/example'], None) kwargs={}
find_spec name='post_meta' args=('example.nonexistent', ['/tmp/crj/example'], None) kwargs={}
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    import example.nonexistent
ModuleNotFoundError: No module named 'example.nonexistent'
>>> 

Also, I can then try this:

>>> list((k,v) for (k,v) in sys.path_importer_cache.items() if k.startswith('/tmp/crj'))
[('/tmp/crj', FileFinder('/tmp/crj')), ('/tmp/crj/example', FileFinder('/tmp/crj/example'))]

…and note that there are only two FileFinders there, one for each directory in which it’s looked for files.

(Also, the attempt to import example.submodule didn’t use sys.path_hooks at all, because a FileFinder for /tmp/crj/example was already in the cache.)

Oh. Sorry.

I’m leaving all that I just wrote because I think it’s a useful medium-depth-dive into what I’m talking about here, and at least demonstrates why I don’t think I’m talking nonsense. :crossed_fingers:

But on re-reading your question I think there’s a far simpler misunderstanding: I was talking about sys.path_hooks and you quoted documentation for sys.meta_path. :slightly_smiling_face:

1 Like

Sorry, no. You’re right, I misunderstood what you were trying to do.

At the time PEP 302 was added, the basic import mechanisms were hard coded, and PEP 302 simply added extension points to supplement them. New file types wasn’t a key use case for the PEP, which was mainly focused on importing from new “containers”, like zipfiles, databases, or URLs. To support importing non-standard suffixes, you’d implement a meta_path hook, and yes, you’d have to replicate the built in logic for searching sys.path in the finder you registered. I assume that’s what Quixote did originally, and it switched to using subclassing PathFinder when that was added.

The PathFinder class was added as part of the introduction of importlib, which reimplemented the standard import logic in Python, using PEP 302 hooks, in order to replace the original custom C implementation of import. It’s not a complete surprise to me to hear that PathFinder wasn’t designed for extensibility. Again, it’s simply that it wasn’t the key goal of the reimplementation. As another example, the support for implicit namespace packages is another area (from what I recall) where it’s not really possible to hook into the mechanism.

So I apologise - you’re right that there’s no extensibility point where adding support for new file extensions can be hooked into the system. I suspect it’s just something that not many people have needed, and the question never came up.

I’m sure that if someone came up with a good PR adding such support to PathFinder, it would be considered. I can’t guarantee it would be accepted, that’s going to be down to the complexity of the change, the maintainability of the code, and the demand for the feature. And a discussion on DPO is always useful to support a PEP - if you get evidence here of a decent level of interest in the feature, that will help the PR’s case.

1 Like

Just to clarify, if I understand right, it seems that the difficult aspect here is not so much handling a new extension, but handling it in a way that re-uses the existing logic for module-finding. That is, you could easily add a new entry to the meta path that finds files with your given extension; the trick is that you want that finding process to duplicate the existing logic for finding .py files.

It seems that some of this difficulty could be alleviated by refactoring the PathFinder and FileFinder classes so that they are less intertwined with other dimensions of the import system. For instance, right now PathFinder hardcodes references to sys.path_importer_cache, meaning it cannot easily be subclassed. But if this were abstracted a bit (e.g., by giving the class a my_import_cache attribute which for PathFinder points to sys.path_importer_cache), then it would be simpler to subclass PathFinder.

If my understanding is correct then so is yours. :slightly_smiling_face:

Yes, refactoring PathFinder the way you suggest would help, and might even be desirable on general principles.

On the other hand, for simplicity and for performance, I much prefer my original suggestion: letting implementers add items to a new sys.suffix_loaders list.

For reference, I went ahead and tried making the change; it was only twelve lines of substantive code.

I think my main problem now may be that nobody seems to understand what I mean from the subject line, but I’m struggling to think of anything more clear. :thinking: