PEP 669: Low Impact Monitoring for CPython

I asked about bdb in particular because IDLE has a visual debugger based on a subclass Idb of bdb.Bdb. I realize now that Ibd could be rewritten to use this PEP without touching bdb.

From the PEP: " To insert a breakpoint at a given line, the matching instruction offsets should be found from code.co_lines()."

`co_lines is not indexed and is not mentioned in the code object doc. From experiment, co_lines yields (start, stop, lineno) tuples, where start and stop are byte offsets. I presume the needed instruction offsets for each breakpoint line is the index of the first tuple with that line. Is the following correct?

def co_offsets(code):
    line = 1
    lines = [None]  # Dummy since lines are 1 based.
    for codeoffset, (start, stop, lineno) in enumerate(code.co_lines()):
        if lineno == line:
            line += 1
    return lines

def f(x):
    x += 3
    return x

lines = co_offsets(f.__code__)
# lines = [None, 0, 1, 5]  where 1, 5 are the offsets of initial load x instructions for lines 2 and 3.
# lines[3] (=5) would be instruction offset for breakpoint on the return line (line 3).

The Steering Council accepts PEP 669 (Low Impact Monitoring for CPython).

We’d like to point out that, as with any PEP, if the initial implementation requires design changes, they need to be re-approved and noted in the PEP (see PEP 1).
Please just ping a SC member on the pull request if/when you change the PEP.

To be clear, we expect that all claims in the PEP will hold, and that the PEP will mention all known cases where the implementation degrades performance.

If the changes are unacceptable, the SC may still reject the PEP and ask for reverting the change.
(You could say the PEP is “provisionally accepted, pending the implementation” – except it’s how PEP acceptance works in general. In this case, the SC does still have some reservations, but a successful implementation would be the best way to resolve them.)

— Petr, on behalf of the SC


In this case, the SC does still have some reservations

What reservations exactly?
It would make my life easier if I knew what they were.

Regarding performance, are there specific applications or tools that are of concern?

I’ll reply without consulting the rest of the SC. So it’s not an official reply – but also it’s not my personal point of view:

We had long, drawn-out discussions about details, and decided the best we can do at this point is to unblock work on the initial implementation.
To quote PEP 1:

Standards Track PEPs consist of two parts, a design document and a reference implementation. It is generally recommended that at least a prototype implementation be co-developed with the PEP, as ideas that sound good in principle sometimes turn out to be impractical when subjected to the test of implementation.

This PEP doesn’t have a draft implementation. The text technically allows things the SC wouldn’t like. But writing (and approving) a perfect specification isn’t really the point. It’s not productive to discuss issues that you, with the implementation in mind, find trivial. I think it’s better if you start the implementation, so people can start trying it out, reviewing it, discovering any hidden drawbacks – and finding out their fears are unfounded.

We generally expect any issues that come up can be solved (though we do expect some to come up and need solving). Some community members aren’t convinced, but the implementation will probably be the best way convince them.

Some specific ones:
For “Coverage tools can be implemented at very low cost, by returning DISABLE in all callbacks” – hopefully that’ll be possible while preserving existing functionality of coverage tools.
Hopefully sys.settrace() (and thus pdb, bdb, pudb, etc.) won’t become unbearably slow.
Hopefully there’ll be an example debugger & profiler to show that the API serves its purpose.
Hopefully there aren’t serious downsides hiding in careful wording, or not mentioned at all.

At this point I don’t think it’s worth discussing them, encoding them in the PEP, or finding new ones. “The SC does still have some reservations, but a successful implementation would be the best way to resolve them.” Show us the code :‍)

Thanks. That’s definitely helpful.

The first thing to implement will be sys.settrace() and sys.setprofile(). So we should get an idea of the performance impact on those early on.

The text technically allows things the SC wouldn’t like

I’m happy to tighten up the spec, if there are loopholes.

Now we’re getting fully into “my own opinion" territory:

A lawyer-proof spec would just be extra work for everyone.
I don’t want to look for loopholes in PEP wording. I don’t want to encourage anyone to do that. They’re design documents, not contracts. And neither are they marketing materials. As the PEP author, it’s your responsibility to make known potential downsides obvious (along with other details of the change, of course).

The SC accepts the PEP, with a clarification of what acceptance means. (The extra wording is not only meant for you, but also for the SC itself, and everyone else.)

I’m still concerned that this point hasn’t been addressed:

As Petr says, a PEP doesn’t need to be a complete design document. But where it does include specific details, they should be the right details.

Disabling a branch event after its first firing will make it useless. If I only get low impact monitoring by disabling the event, then I won’t have low-impact branch coverage. Is this OK?

Maybe I’m misunderstanding the centrality of this idea.

Branch coverage was brought up before. We hear your concerns.
OTOH, I get the impression that Mark thinks this isn’t worth discussing – and I think that until the implementation appears, we should just assume that. And then let’s check.

The PEP says:

Coverage tools can be implemented at very low cost, by returning DISABLE in all callbacks.

You could read that as “coverage tools can be made faster, but they can be super mega extra fast if they can disable events they no longer care about.” In fact, the very next sentence is:

For heavily instrumented code, e.g. using LINE, performance should be better than sys.settrace, but not by that much as performance will be dominated by the time spent in callbacks.

So it might be worth it to write extra code to check if all branches in a function were already taken, or something along those lines.

I’ve been looking to implementing instrumentation that is directional, but it doesn’t seem feasible without an explosion in the number of instrumented instructions.

This means that you’ll need to maintain a mapping code, srcdest for each branch.
If the destination differs from the one already recorded, then both branches have been taken and monitoring can be disabled for that branch.
Something like this:

if (code, src) in taken_branches:
    if dest != taken_branches[(code, src)]:
        del taken_branches[(code, src)]
        return DISABLE
    taken_branches[(code, src)] = dest

Hello, I am following this PEP from a distance. I think it is a very exciting and interesting proposal. I am particularly interested in the ability to enhance per-opcode instrumentation.

One thing I have noticed with the existing implementation of opcode tracing is that it does not appear to provide any information to the callback about either the operands or the result of the operation being performed. There are some tools that would be benefit greatly from access to this information (e.g. tools that perform dataflow analysis).

In principle, it seems like at least some of this information is accessible via the reference to the stack that belongs to the frame object. However, it is not exposed as part of the public API either in the Python code or in the underlying structure. I am wondering whether it would be reasonable to make this information public (even if read-only). In the new implementation, now that the per-opcode event appears to be triggered prior to opcode dispatch, it seems like at least the operands could easily be passed to the callback.

Along similar lines, I am wondering whether there would be any appetite for providing an interface to actually override the handlers for certain opcodes. Of particular interest would be opcodes that implement basic operations like BINARY_OP and FORMAT_VALUE. I believe that this could be done in a way that imposes zero cost on uninstrumented code. There is some precedent for this kind of thing in interpreter implementations for other languages and it would be extremely valuable for certain kinds of monitoring tools.

I am wondering whether either of these proposals would be candidates for another PEP (which I would be willing to author). I am also potentially interested in helping with the implementation for any/all of the above.

Thank you!

Opcode tracing is a fairly niche use case, mainly because it has such high overhead.
Passing more information to the monitoring tool would further increase that overhead.

There is also the question of what are the operands and results.
The instructions in the VM change from version to version, and with PEP 659 can be modified at runtime.
So it is hard to specify what the operands and result are, in a way that is useable by tools.

My concern is that the API for this would be large and either constantly changing or obstructive to optimizations.

OOI, what are you wanting to implement with this?

Thank you so much for your response.

OOI, what are you wanting to implement with this?

To boil it down, I want to instrument basic operations (such as +) involving built-in types (such as str).

I realize the use case is fairly niche but it is quite valuable for certain kinds of runtime security analysis. Some recent changes, including the auditing API, seem to suggest that the community is open to changes that would support runtime security tools, and so I am hoping that this or a similar proposal might be considered.

The reason I have followed this particular PEP is because it seems very close to enabling this kind of instrumentation. However not quite enough information is currently available: namely, the operands and result are not publicly available.

Your concerns about performance and flexibility are quite valid. However, I’m wondering whether there is some way for callbacks to opt in to receiving this information, or whether they could access it indirectly via some other API that exposes the current state of the interpreter stack. This would seem to mitigate any incidental performance concern except for callbacks where that data is actually required.

As I mentioned above, some this information is actually encoded on each frame object but it is not exposed as part of a public API. If that were made a public part of the frame object, it would possibly enable this kind of analysis.

A different option would be to enable the actual action taken by each opcode handler to be overridden. For example, the handler for BINARY_OP_ADD_UNICODE calls PyUnicode_Concat, but it could just as easily call a function pointer add_unicode_handler that is set to PyUnicode_Concat by default but able to be overridden by a public extension API. This is the kind of implementation I have seen in other language interpreters.

Anyway, I really appreciate your time and consideration. Maybe this particular PEP is not the right place for this discussion, but I would really appreciate any advice on whether this kind of thing might be suitable for a different PEP or something else.

Thank you again!

(edited due to accidentally publishing too early)

Be aware that he overhead of per-instruction instrumentation is very large and going to get worse (relatively) as we add more optimizations for 3.13 and beyond. (This level of instrumentation effectively kills all optimizations)
IMO, that is likely to make it useless for all but some highly specialized uses like a reversible debugger, or some sort of GUI for understanding the VM.

If you still think a x10 slowdown or worse is tolerable for your use case, then adding an API to get a view on the evaluation stack is probably the best approach.
It should have negligible cost when not used, but give you all the information you need (I think).

Okay thanks, that’s helpful information.

Would this also apply to the alternate proposal where we simply allow the actual handler called by each opcode (e.g. PyUnicode_Concat) to be overridden? I don’t believe that would suffer from the same problem since the cost incurred would just be the overhead of the new handler itself (which presumably would also call the original function, but we could leave that up to the user). I realize this becomes fairly complicated with specialization but I think the API would be relatively simple.

Or is this all just going to become something like a JIT where there aren’t any real calls at that level anymore?

I’d be happy to draft a PEP along these lines but I’m just trying to make sure that it wouldn’t be immediately rejected.

In order to optimize Python we are trying to avoid calls in opaque calls into C code.
There are already quite a few hooks into Python execution that we are attempting to coalesce and side step, so adding more is not a viable option.
And as you say, future optimizations could remove these calls entirely.

However, once per-instruction monitoring is turned on, we effectively abandon all attempts to optimize anything. So adding per-instruction monitoring then examining the state of the VM is a viable approach, as we need to support that anyway.

It will be super slow, but (with an appropriate API) you could examine the execution stack for each instruction, and record whatever information you need.

Okay that makes sense. I’m curious to test this out once these particular changes are integrated in order to get a sense of the performance impact (although I realize I still don’t have an API for getting the execution stack).

I’m wondering whether the community is open to discussing other possible mechanisms for supporting runtime dataflow tracing. I think there are a variety of potential solutions (including leveraging the auditing mechanism). Maybe I should start another thread on that topic.

Thanks for the reply. Is it true that all branches only have two destinations? There are none with more than two? I see what you are saying: once we know that all destinations have been reached, we can disable the event. I’m just wondering if it is true that we can know what all the destinations are during runtime (that is, before code analysis).

There is one final open question about PEP 669: how to handle the visibility of code belonging to one tool to other tools.

The PEP states that:

Events are suspended in callback functions and their callees for the tool that registered that callback.

That means that other tools will see events in the callback functions for other tools. This could be useful for debugging a profiling tool, but would produce misleading profiles, as the debugger tool would show up in the profile.

This doesn’t work for two reasons:
First, it breaks sys.setprofile and sys.settrace support, as code executed in profiling functions is not visible in tracing functions, and vice versa.
Second, it is really confusing. Adding a tool can make a mess of the events seen by other tools.
What happens if you do want to profile the debugger (in the above example) or just want to debug the program and want to ignore the profiler?

What I propose is that tools can only see events for the code in other tools if they explicitly request it.

The new API would be:

def set_tool_insight(tool: int, tool_set: int)->None

Which would give the tool tool insight into the tools in tool_set.
A ValueError will be raised if tool is in tool_set.

An anyone see any problems with this approach?

Is it true that all branches only have two destinations?

Yes, all branches that produce a BRANCH event have two destinations.
(Single destination branches are called jumps and produce JUMP events).

If we add multiple destination branches (which is unlikely but possible), then we would add a new event kind. If this ever happens, it will be clearly documented.

I’m just wondering if it is true that we can know what all the destinations are during runtime

If you have seen two destinations for a BRANCH, then you have seen them all. So, if you know that a branch has been reached you can tell whether all destinations have been reached with just a count.

Sounds good. Consider also adding a getter – introspection is always nice to have.

1 Like