PEP 669: Low Impact Monitoring for CPython

Given the absolute silence when I announced this PEP on Python-dev, I thought I’d try again here.

Any comments or thoughts?

Feel free to comment here, or on python-dev.

7 Likes

This looks promising. I have a dedicate library for in-production runtime profiling, but the goal of that library is more high-level than this PEP: I suspect monitoring every function call, then checking if the called function is in the set of functions I want to monitor, could still lead to a significant performance penalty.

I usually know callbacks which are invoked when events occur as “event handlers”.

Do you expect tools to update their event handlers for newer versions of Python, or are you guaranteeing that the required event handler signature stays the same in newer versions (or uses some dynamic argument-passing via inspection)? If not, would you consider some struct / named-tuple / slotted-dataclass as the only argument (or code, struct as two arguments) for all handlers going forward?

You say that “the callee’s frame will be on the stack”, suggesting that the event handler will have access to that frame (and the rest of the stack). Does that mean the event handler’s (invocation’s) frame will be on the same stack (as the next frame)? That’s seems to be the case judging by the event handler’s signature. This would mean exceptions raised in the handler could bubble to the user, and look like the user is partly responsible (as their code is part of the traceback). Alternatively, I suggest putting the event handler in its own new stack, and passing the stack/frame in to the handler (or you could just warn raised exceptions in handlers).

I read PEP 669 and PEP 659. I found them to be quite amazing and really useful. On the point of PEP 669, I think this is very useful and much more effiecient.
I always found a great need of profiling code that is running in production instead of profiling code over simulated traffic. However, there’s always some subtle over head one incurs even when they use sys.setprofile() and sys.settrace(). So if this PEP proposes a more powerful way one can implement profiling of production code, I’d gladly take it up.

PEP 659 is also something I’m excited about.

The API will be fixed. Event handlers that work for one version should continue to work indefinitely.
We might add new events, but the old ones would still work.

The event handler will be on the same stack, and any errors in the event handler would propagate up to the user.
Tool writers might need to take care not to raise exceptions, and to report errors by other means.

Very interesting! I maintain pyinstrument, a CPython profiler. This could improve performance quite a bit I think and provide a nicer API, too :slight_smile:

A couple of questions on this :

  1. The PEP says

To register a callable for events call:

sys.monitoring.register_callback(event, func)

Functions can be unregistered by calling sys.monitoring.register_callback(event, None).

So does this mean that multiple sys.monitoring users can operate at the same time? E.g. a profiler and a debugger, or multiple profilers at the same time? That would be good. I’ve had cases where users accidentally start multiple profilers at the same time and with the sys.setprofile API, they just silently overwrite each other, not good!

  1. Should the PEP include a C API like PyEval_SetProfile? I have found that C API to be more performant, presumably because a C function call is a lot cheaper than a Python one.

  2. Why is it necessary to call sys.monitoring.set_events and register_callback, could the set of active events instead be computed by the active callbacks that are registered?

1 Like
  1. Only one can callback object can be registered at once.

It is responsibility of the tools to not trample on each other. You could make a wrapper to pass events to multiple callbacks.
I’ll update the API so that register_callback returns the previously registered callback, so tools will at least know if they need to cooperate.

  1. Registering callbacks is an infrequent operation. Its performance is unimportant. It is the performance of the callback objects that matters. You can implement the callback object in any language you like, as long as it is callable.

  2. There is no reason to couple registering callback objects and turning on and off events. It would reduce flexibility for no obvious benefit. Specifically, we want to be able to turn events and off for individual code objects.

Two questions:

  1. Has there been discussion/consideration to register events on a code object? If I am reading correctly I can enable events for a specific code object, but I can only register a global handler.

Where we would find this useful is writing instrumentation wrapping code for specific functions, where the behavior of the handler would be different per-function.

I can definitely build the desired behavior with the provided API, but per-object handlers would make things much easier.

  1. The draft it says the overhead will be much lower than existing settrace and setprofile. Do you have any benchmark data to show the improvement?

If it helps for perspective. Most APM vendors use wrapt for creating a bunch of function wrappers for tracing applications. These can be quite heavy, and intrusive. If the performance overhead of this API is at least similar to that of function wrappers it could make our lives easier.

I’ve not considered per-code object handlers, but I think the overhead would be prohibitive.
There is nothing stopping dispatching to your per-code object handlers from a global handler, but we don’t want to take the performance hit in general.

The performance impact should be less than wrapt, but probably not much as most of the overhead is recording the tracing information.

Yeah, we can definitely make it work with a global handler. If there was a clear/performant way to do this per-code object, please mark me as interested :slight_smile:

:+1: while wrapt does have it’s own performance implications it definitely isn’t the largest source of performance overhead from tracing libraries.

Apologies for the slow follow up Mark. Thanks for the responses so far.

So, yes Pyinstrument does currently employ a wrapper so that multiple profilers can run concurrently. However, I was more thinking about the ability for multiple consumers to operate concurrently. e.g. a profiler and a debugger, or multiple different profilers. Returning the previously registered callback might help in a couple of cases, but honestly, it’s not a great solution because the order of registering/unregistering matters e.g.:

◀───────────────────────────────────────▶  Profiler 1

           ◀───────────────────────────────────────▶  Profiler 2

           │                            │
           Profiler 2 wraps             Profiler 1, unaware that it's been wrapped, calls
           profiler 1, storing its      "sys.monitoring.register_callback(event, None)",
           callback and passing         unregistering both profilers.
           through events when
           it's called.

Perhaps a more robust solution would be to maintain a list of registered callbacks for each event, and add a function sys.monitoring.unregister_callback to remove things from the list. But, do bare in mind that I am blissfully unaware of the complexity of how all this stuff works, so please feel free to dismiss this idea as unworkable!

Yes, that makes sense. So I suppose the bit I’m not understanding is how I would register a C function pointer as the callback for an event?

Ahhh, I missed the code-object-level setting of these events. That’s neat! Follow-on question then - the PEP says that “No events are active by default.” - is there a reason to default to all off rather than all on? Perhaps performance?

1 Like

I’d very much like to have this implemented as I think it’ll be really nice for debuggers.

There are 2 things that I feel are missing from the PEP though:

  1. The PEP should include that changing the frame.f_lineno inside one of those callbacks needs to be possible (changing the frame.f_lineno is an important feature of debuggers and right now it can only be done inside the tracing – it’s the reason why the debugger needs to fall back to tracing after hitting a programmatic breakpoint with PEP 523 and it seems the fallback won’t even be possible with PEP 669, so, it’d be a blocker for adoption…).

  2. The semantics regarding StackOverflow exceptions should be clear. Right now tracing is disabled when sys.getrecursionlimit() is reached (which is awful), but it’s even worse when it’s reached inside the tracing call which usually crashes CPython… now, if a callback would be called due to a StackOverflow and it’d StackOverflow itself it wouldn’t help much, so, it’d be nice to know how would the semantics related to StackOverflow work in this case.

Apart from that, it’d be really nice if this could be added to python as I believe it’s a great improvement over the status quo and I’ll definitely upgrade pydevd/debugpy to use PEP 669 (if you have early builds with this available we can work together to iron things out with a real-world use case with pydevd/debugpy) .

@markshannon I’m also interested in the questions from @joerick regarding multiple callbacks registered instead of forcing other tools to cooperate could be nicer – the current PEP is on par with what’s available now, but it’s not uncommon to have reports from users that said that the debugger didn’t work because coverage is being used because in practice that cooperation is hard to come by as it’s hard enough for the debugger to setup its own tracing and returning the last one isn’t enough when tools are competing on what should be traced or not as those things will definitely change during a debugging session.

I think that the only real way to make it work would be having a different API where whoever registers things provides its id so that things are contextualized to it or something closer to that (but then again, even adding a function call for each function call will make things slower, so, we’ll probably just end up with the current status quo where just 1 user may be active at any time as clients will definitely trample over each other with the available API).

You won’t be able to register a C function pointer, but you can implement the PEP 590 vectorcall interface on the callable, for performance close to that of a raw function pointer.

1 Like

Fabio,
Addressing your points:

  1. It was my intention you anything you can do in a sys.settrace callback, you can do in a PEP 669 callback. I’ll make that clearer in the PEP.
  2. Our handling of stack overflows is a bit of a mess because we rely on a counter to detect both C stack overflow and runaway recursion.
    Once we handle the two separately, then we can do something sensible when sys.getrecursionlimit() is reached, like temporarily raising the limit.
    We probably wouldn’t do that for C stack overflows, as we don’t want to crash the VM.

Regarding multiple tools being active at the same time.
I don’t see any fundamental reason not to handle multiple callbacks in the interpreter, but it does add some complexity.
In general, I’d like to keep things as simple as possible. It is not easy to make this both fast and reliable.

1 Like

The latest version of PEP 669 is up PEP 669 – Low Impact Monitoring for CPython | peps.python.org
Now with the following new features:

  • Up to 6 different tools can be active at once. Useful if you need to debug your coverage tool.
  • Ability to drop events for a particular point in code, once seen. This allow near zero overhead coverage and debuggers.
  • Can operate at the same time as sys.setprofile() and sys.settrace(). I don’t know why anyone would want this, it just falls out of the new design.

Let me know what you think.

5 Likes

Suggestion: In tool id section, note that ids 6,7 are reserved for sys.trace & sys.profile (discussed in back compatibility section.

Modules bdb and profile are based on sys.trace & sys.profile. Do you intend that they be re-written as id 0 and 2 tools? Would they run faster? Doing so would test sys.monitoring and serve as examples. (But this would lessen test of sys.trace and profile. :wink:

"This outlines the proposed implementation for CPython 3.11. " /11/12/

I’m reluctant to touch either bdb or profile. I think it would be better to work with third-party tools that are actively maintained and updated to get them to use the new API.

I’ve fixed the version number reference, thanks.

I was rereading the latest version. I think having a tool_id is a nice addition.

Still, I just wanted to point out that the part which says:

This makes sys.settrace and sys.setprofile incompatible with PEP 523. Arguably, they already were as the author know of any PEP 523 plugin that support sys.settrace or sys.setprofile correctly. This PEP merely formalizes that.

Isn’t really correct… The pydevd / debugpy debugger does use PEP 523 to have a hook to change the code object to add programmatic breakpoints and proceeds to call the original eval implementation (and what the programmatic breakpoint does is mainly enabling sys.settrace after it’s hit and generating a spurious line event), so, it’s actually breaking that mode of operation, which probably makes PEP 523 unusable by debuggers (which I guess may be OK as hopefully using PEP 669 will make it possible to make it faster without resorting to PEP 523, but I think that paragraph should be reworded to account for that breakage).

Another note:

I think this PEP still doesn’t make clear that changing the frame.f_lineno inside those callbacks needs to be possible.

This isn’t supporting sys.settrace, IMO. This is using both PEP 523 and sys.settrace. Does the pydevd debugger use only PEP 523 for debugging in a way that allows coverage.py to use sys.settrace?

Does the pydevd debugger use only PEP 523 for debugging in a way that allows coverage.py to use sys.settrace?

No.

The debugger uses PEP 523 to change the code bytecode to add a programmatic breakpoint (it just uses it for a hook where the frame code can still be changed) and calls sys.settrace afterwards when the programmatic breakpoint is hit (so, it should run untraced until a breakpoint is hit – but it still has a minimum performance penalty on calls to check whether frame the bytecode must be changed and it runs with tracing afterwards).

As a note, I don’t think it’s important that it does work (because presumably PEP 669 will have faster/better support for what’s needed in the debugger than PEP 523), but I’d like to know whether this will actually be the case.