Trying to robustly patch func.globals

jgarvin · December 10, 2024, 3:21am

Functions have a __globals__ field that you can reassign to change how lookups of globals is done, it gets initialized based on the f_globals of the frame which either comes from the defining function or at the top level comes from the __dict__ for the module. You can write your own class that has a __getitem__ that for example records what global lookups occur. However, if you do this you run into some trouble:

CPython internally uses PyDict_ functions on the globals object, meaning that it must actually be a dictionary or a subclass of a dictionary, otherwise you crash.
If you make it a subclass of a dictionary, the most obvious approach is to take the original globals dictionary on construction then just override __getitem__ to do the recording and then do the lookup in the original. But this creates a subtle problem – your object is really 2 dictionaries, the inherited one and the original globals being stored as a member. You can try to override every dictionary relevant method, but the PyDict_* functions bypass user overrides and always call the original dictionary version so they get the inherited dict not the member one. I don’t need to monitor the internal accesses that the interpreter makes, but I do need the interpreter to see an consistent view so it doesn’t get confused/crash.
You could instead subclass dict, but on construction instead of storing the original globals dict you could call self.update(original) to copy all the contents, then override __getitem__ to record. This way there is really only one dictionary for the function. But then you have the problem that module.__dict__ is no longer the same object as the func.__globals__, so any new globals at module level will be invisible to the function and vice versa.
This leads to the idea of patching every module.__dict__ to be a tracking dictionary type and letting that automatically propagate down to to the __globals__ of the functions the module defines. But you can’t reassign __dict__. I could try reassigning it in C, but I don’t know if this is expected to work in the future?
I could try in C directly overwriting the ob_type pointer for the module dict to transform it in place into my custom dict subclass. But is that future proof either?
There is also the PyDict_AddWatcher API, but it only lets you watch modifications, not pure lookups.

Would appreciate any tips on the best way to do it.

JamesParrott · December 10, 2024, 9:49am

Do the same problems occur with a class based __call__able?

jgarvin · December 10, 2024, 2:08pm

You mean if I wanted to change how a __call__ method looked up globals? I think so, since the __call__ method should just be another function that you can patch the __globals__ of. I don’t think there is a different lookup mechanism for methods.

JamesParrott · December 10, 2024, 3:08pm

It doesn’t work, but I meant if you make an instance of a class with a __call__ instance method, and use that instead of a function, the class can also be given a __globals__ class or instance variable.

jgarvin · December 10, 2024, 6:51pm

I see, yeah I think in that case __call__ would still have its own __globals__ that would be used instead of the one stored on the class, unless you also reassigned it to be the class one. But I think you would still have the same general issues.

blhsing · December 13, 2024, 1:44am

No, func.__globals__ can’t be reassigned:

def f(): ...
f.__globals__ = {} # AttributeError: readonly attribute

However, we can work around it by creating a new types.FunctionType object with all the same attribute values except globals, which we can replace with an object of a dict subclass that per your example records what global lookups occur.

That is true for updates but untrue for lookups. The Python interpreter has this strange asymmetry where lookups to globals would fall back to a slow path of PyMapping_* calls for an object of a dict subclass:

github.com

python/cpython/blob/ba2d2fda93a03a91ac6cdff319fd23ef51848d51/Python/ceval.c#L3229


      
                  /* _PyDict_LoadGlobal() returns NULL without raising
                      * an exception if the key doesn't exist */
                  _PyEval_FormatExcCheckArg(PyThreadState_GET(), PyExc_NameError,
                                              NAME_ERROR_MSG, name);
              }
          }
          else {
              /* Slow-path if globals or builtins is not a dict */
              /* namespace 1: globals */
              PyObject *res;
              if (PyMapping_GetOptionalItem(globals, name, &res) < 0) {
                  *writeto = PyStackRef_NULL;
                  return;
              }
              if (res == NULL) {
                  /* namespace 2: builtins */
                  if (PyMapping_GetOptionalItem(builtins, name, &res) < 0) {
                      *writeto = PyStackRef_NULL;
                      return;
                  }
                  if (res == NULL) {

but would always stick to PyDict_* calls for updates:

github.com

python/cpython/blob/ba2d2fda93a03a91ac6cdff319fd23ef51848d51/Python/bytecodes.c#L1500


      
          
          inst(DELETE_ATTR, (owner --)) {
              PyObject *name = GETITEM(FRAME_CO_NAMES, oparg);
              int err = PyObject_DelAttr(PyStackRef_AsPyObjectBorrow(owner), name);
              DECREF_INPUTS();
              ERROR_IF(err, error);
          }
          
          inst(STORE_GLOBAL, (v --)) {
              PyObject *name = GETITEM(FRAME_CO_NAMES, oparg);
              int err = PyDict_SetItem(GLOBALS(), name, PyStackRef_AsPyObjectBorrow(v));
              DECREF_INPUTS();
              ERROR_IF(err, error);
          }
          
          inst(DELETE_GLOBAL, (--)) {
              PyObject *name = GETITEM(FRAME_CO_NAMES, oparg);
              int err = PyDict_Pop(GLOBALS(), name, NULL);
              // Can't use ERROR_IF here.
              if (err < 0) {
                  ERROR_NO_POP();

Knowing the above, this approach can actually work if we simply update func.__globals__ with the new content after a call, as demonstrated below:

class DictLogger(dict):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, name):
        print('accessing', name)
        return super().__getitem__(name)

    def __enter__(self):
        self.update(self.data)
        return self

    def __exit__(self, *_):
        self.data.clear()
        self.data.update(self)

def log_globals_lookups(func):
    def wrapper(*args, **kwargs):
        with DictLogger(func.__globals__) as logging_dict:
            return FunctionType(
                code=func.__code__,
                globals=logging_dict,
                name=func.__name__,
                argdefs=func.__defaults__,
                kwdefaults=func.__kwdefaults__,
                closure=func.__closure__
            )(*args, **kwargs)
    return wrapper

so that:

@log_globals_lookups
def f():
    global a
    print(f'before incrementing: {a=}')
    a += 1
    print(f'after incrementing: {a=}')

a = 1
f()
print(f'after call: {a=}')

outputs:

accessing print
accessing a
before incrementing: a=1
accessing a
accessing print
accessing a
after incrementing: a=2
after call: a=2

Demo here

The downside is that this isn’t thread-safe since updates to globals aren’t visible to others until globals are updated with new values at the end of a call, so you’ll likely have to lock the entire call if you want to make it thread-safe.

jgarvin · December 16, 2024, 3:15am

Sorry yeah I spoke imprecisely, that is exactly what I did originally when I was doing a pure python module. In C nothing prevents reassigning func->func_globals although I’m not sure how kosher this is. Since I do it from PyFunction_AddWatcher callback as soon as the function is created I think there’s no time for any other python code to observe the difference.

Yep I also posted about that shortly after this

This is an interesting idea. I think you can probably break it even within one thread though, because you could call other functions before you return that might expect to see changes in globals right away.

blhsing · December 16, 2024, 10:09am

Good point. So just patching func.__globals__ isn’t enough then. We need to patch module.__dict__, and I’ve found a workaround to do just that, by making a “module” that is using a dict subclass as __dict__:

class LoggingDict(dict):
    def __getitem__(self, name):
        print('accessing', name)
        return super().__getitem__(name)

class LoggingModule:
    def __init__(self):
        self.__dict__ = LoggingDict()

So with this custom module type, we can create our custom module loader:

import sys
from importlib.util import spec_from_loader
from importlib.abc import InspectLoader

class LoggingLoader(InspectLoader):
    def create_module(self, spec):
        return LoggingModule()

    def get_source(self, fullname):
        return '''\
def f():
    global a
    print(f'before incrementing: {a=}')
    a += 1
    print(f'after incrementing: {a=}')

a = 1
f()
print(f'after call: {a=}')
'''

class LoggingFinder:
    def find_spec(self, fullname, path=None, target=None):
        if fullname == 'foo':
            return spec_from_loader(fullname, LoggingLoader())

sys.meta_path.insert(0, LoggingFinder())
import foo

This outputs:

accessing f
accessing print
accessing a
before incrementing: a=1
accessing a
accessing print
accessing a
after incrementing: a=2
accessing print
accessing a
after call: a=2

Demo here

jgarvin · December 16, 2024, 6:24pm

I’m confused that this works because I thought module.__dict__ wasn’t reassignable. It looks like you get away with it because you don’t explicitly inherit ModuleType, but I assume by not doing so the module is going to be missing other subtle things that may be expected, e.g. __file__, support for user’s defining __getitem__ at the module level and I’m not sure what else? When I try to change your example to inherit it I get:

Traceback (most recent call last):
  File "/ATO/code", line 38, in <module>
    import foo
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 921, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 813, in module_from_spec
  File "/ATO/code", line 17, in create_module
    return LoggingModule()
  File "/ATO/code", line 13, in __init__
    self.__dict__ = LoggingDict()
    ^^^^^^^^^^^^^
AttributeError: readonly attribute

Demo here

In the docs it looks like the constructor only takes the doc string so you can’t make it different there either. I’m not sure if it’s somehow possible with the importlib/spec machinery.

blhsing · December 17, 2024, 1:44am

Yes, that’s why I characterized my solution as a workaround and why I put the word “module” in quotes.

Regarding your good point about the missing subtle things that may be expected from ModuleType:

__file__ does not come from ModuleType. It is set by a file-based loader, which the InspectLoader I used in my example is not. It would be set if you use a loader inherited from SourceFileLoader instead for example.
Support for a user-defined __getattr__ (not __getitem__ by the way) can be easily implemented by overloading __getattr__ for our “module”.
For code that explicitly checks if our “module” is a ModuleType, we can work around it by setting the __class__ attribute.

And here’s our updated “module” type to support points 2. and 3.:

class LoggingModule:
    __class__ = ModuleType

    def __init__(self):
        self.__dict__ = LoggingDict()

    def __getattr__(self, name):
        if __getattr__ := self.__dict__.get('__getattr__'):
            return __getattr__(name)
        raise AttributeError

Demo here, which shows that it correctly supports a user-defined __getattr__ and passes both of the following tests:

assert isinstance(foo, ModuleType)
assert inspect.ismodule(foo)

jgarvin · December 17, 2024, 4:21pm

Thanks for all the help so far I really appreciate the clarifications, there is a lot of nuance here. I wasn’t familiar with the trick of setting __class__ at the class level. It appears you’re right that it affects isinstance but it has some weird behavior:

>>> class A:
...     def foo(self):
...         print("A")
...         
>>> class B:
...     __class__ = A
...     def foo(self):
...         print("B")
...         
>>> B().foo()
B
>>> type(B())
<class '__main__.B'>
>>> isinstance(B(), B)
True
>>> isinstance(B(), A)
True

So B() is considered to be both a B and an A, even though __class__ was reassigned?! And you still get the B version of methods, so it’s like setting __class__ makes it as if B inherited from A instead of actually making it an A?

I tried looking this up in the docs but the main page that comes up AFAICT doesn’t really explain these details.

blhsing · December 18, 2024, 2:59am

This is because isinstance checks both the actual type of the given object and its __class__ attribute:

github.com

python/cpython/blob/b92f101d0f19a1df32050b8502cfcce777b079b2/Objects/abstract.c#L2600


      
          }
          
          static int
          object_isinstance(PyObject *inst, PyObject *cls)
          {
              PyObject *icls;
              int retval;
              if (PyType_Check(cls)) {
                  retval = PyObject_TypeCheck(inst, (PyTypeObject *)cls);
                  if (retval == 0) {
                      retval = PyObject_GetOptionalAttr(inst, &_Py_ID(__class__), &icls);
                      if (icls != NULL) {
                          if (icls != (PyObject *)(Py_TYPE(inst)) && PyType_Check(icls)) {
                              retval = PyType_IsSubtype(
                                  (PyTypeObject *)icls,
                                  (PyTypeObject *)cls);
                          }
                          else {
                              retval = 0;
                          }
                          Py_DECREF(icls);

This trick of setting __class__ at the class level to fake an object of a different class is actually employed in quite a few CPython test cases such as this:

github.com

python/cpython/blob/b92f101d0f19a1df32050b8502cfcce777b079b2/Lib/test/test_descr.py#L4832


      
          
              with self.assertWarnsRegex(RuntimeWarning, 'MyClass'):
                  MyClass = meta('MyClass', (), {})
          
          def test_proxy_call(self):
              class FakeStr:
                  __class__ = str
          
              fake_str = FakeStr()
              # isinstance() reads __class__
              self.assertIsInstance(fake_str, str)
          
              # call a method descriptor
              with self.assertRaises(TypeError):
                  str.split(fake_str)
          
              # call a slot wrapper descriptor
              with self.assertRaises(TypeError):
                  str.__add__(fake_str, "abc")
          
          def test_specialized_method_calls_check_types(self):

The doc does explain what object.__class__ does and gives an example of using it for faking an object of a different class.

In the end of day as long as you don’t have code that does something extremely explicit and generally discouraged like type(foo) is ModuleType the trick will work just fine for all real-world use cases as far as I can tell.

Trying to robustly patch func.__globals__

Trying to robustly patch func.globals