PEP 669: Low Impact Monitoring for CPython

I don’t know what “vectorcall protocol” means

PEP 590

If I track that information, then I need to map that back to line numbers to produce a report for the user. Is that right?

Yes, you’ll need to do that. code.co_lines() has the offset to line information.

Can you say more about why JUMP and BRANCH are more efficient than line-based?

Two reasons.

  1. There are fewer JUMP and BRANCH events than line events
  2. JUMP and BRANCH events map to specific VM instructions, so can be instrumented more efficiently.

The PEP does not include anything regarding threads.

The PEP makes no mention of threads, because they are not relevant.
Instrumentation is per-interpreter not per-thread. I’ve added a line to the PEP to make this a bit clearer.

“sys.setprofile() can be made a lot faster by using the API provided by this PEP”

The full sentence has a typo in it, which doesn’t help. I fixed it in the PEP. The full sentence should have read:

However, tools relying on sys.settrace() and sys.setprofile() can be made a lot faster by using the API provided by this PEP.

How is this true? Not because the proposed approach is amazingly fast, but because sys.settrace() and sys.setprofile() are really slow.

How are debuggers supposed to translate that into the provided APIs that receive code objects in a performant way?

I don’t know what “receive code objects in a performant way” means, but if you are asking how one should implement a breakpoint in a way that minimizes performance impact, here is one way:

  • When debugger is attached, create an empty map of filenames to code objects and an empty map of filenames to uninstrumented breakpoints.
  • When receiving a PY_CALL event:
    • For all breakpoints in the uninstrumented map, if they lie within the code object, insert them. Finding the breakpoint is O(log n) where n is the number of uninstrumented breapoints per file.
    • Add the code object to the code object map, then return DISABLE.
  • To add a breakpoint:
    • If the code object containing the breakpoint is in the map, use insert_marker() to set the breakpoint.
    • If not in the map, then add the breakpoint to the set of uninstrumented breakpoints
    • Finding the code object is O(log m) where m is the number of code objects per filename.

Feel free to design your own scheme, but the above scheme is fast enough to implement in Python without noticeable overhead.

We don’t believe that is True [that sys.settrace is incompatible with PEP 523]

Are you claiming that all tools using PEP 523 support sys.settrace and sys.setprofile perfectly? Cinder doesn’t. I doubt that any of the debuggers using PEP 523 work flawlessly with pdb. It isn’t even clear what is debugging what.

Rather than hoping for the best, I think it better to just say: “This doesn’t work”.

Could you also add a section outlining how new events can be added in the future if necessary?

I don’t think that makes sense in the PEP.
Future events are likely to come from future language changes, and I have no way to predict how those would be implemented.

Also I’m not sure whether you are referring to the social or technical process.
The social process; a new PEP or just an issue?
Or do you mean how would the CPython source be changed to support additional events?
If the latter, then no different from any other code change, I guess. Make a PR with the changes.

Although we can more or less understand it from the PEP, is unclear how a profile function can request granular results

I don’t understand what you mean by “granular results”

a profile function doesn’t want the line number and uses PY_START events

The callback for PY_START events is func(code: CodeType, instruction_offset: int). No line number.

how can the API ensures that this information is not calculated if the callback doesn’t need it?

You can’t. Although I am puzzled why any user of the API would worry about the VM doing pointless calculations.

In general the PEP lacks time benchmarks for some common usages like simple coverage, profile or tracing functions. Having time benchmark information is important so we can make an informed decision.

I’m afraid there will be no benchmarks until it is approved, as I’m not willing to implement it until at least conditionally approved.

You could make approval conditional on the performance being good enough. That way I’m not wasting my time implementing this for you to reject it, and you are not accepting it without performance being satisfactory.

Regarding coverage, take a look at Slipcover which uses instrumentation
and is faster with coverage on 3.11 than no coverage at all on 3.10. The instrumentation is a bit fragile as there is no VM support. With VM support performance would be even better.

For debuggers, the scheme I described above costs one call into the debugger for each code object (not per call) plus the overhead of the actual breakpoints, and no other overhead.

For profilers, instrumentation will be quicker than sys.setprofile(), but if you care about performance use a statistical profiler :slight_smile:


I hope that clarifies things.

Cinder doesn’t use PEP 523 either, so probably not a relevant example here. (It is true that the Cinder JiT doesn’t support sys.settrace or sys.setprofile at all.)

IIUC, Cinder replaces the entirety of _PyEval_EvalFrameDefault(). So while it may not use PEP 523, it does the equivalent.
I think the same argument also applies to Pyston.

My point is that replacing the _PyEval_EvalFrameDefault() with anything but the most trivial wrapper and correctly supporting sys.settrace() is sufficiently difficult that we might as well just declare it impossible.

2 Likes

I asked about bdb in particular because IDLE has a visual debugger based on a subclass Idb of bdb.Bdb. I realize now that Ibd could be rewritten to use this PEP without touching bdb.

From the PEP: " To insert a breakpoint at a given line, the matching instruction offsets should be found from code.co_lines()."

`co_lines is not indexed and is not mentioned in the code object doc. From experiment, co_lines yields (start, stop, lineno) tuples, where start and stop are byte offsets. I presume the needed instruction offsets for each breakpoint line is the index of the first tuple with that line. Is the following correct?

def co_offsets(code):
    line = 1
    lines = [None]  # Dummy since lines are 1 based.
    for codeoffset, (start, stop, lineno) in enumerate(code.co_lines()):
        if lineno == line:
            lines.append(codeoffset)
            line += 1
    return lines

def f(x):
    x += 3
    return x

lines = co_offsets(f.__code__)
# lines = [None, 0, 1, 5]  where 1, 5 are the offsets of initial load x instructions for lines 2 and 3.
# lines[3] (=5) would be instruction offset for breakpoint on the return line (line 3).

The Steering Council accepts PEP 669 (Low Impact Monitoring for CPython).

We’d like to point out that, as with any PEP, if the initial implementation requires design changes, they need to be re-approved and noted in the PEP (see PEP 1).
Please just ping a SC member on the pull request if/when you change the PEP.

To be clear, we expect that all claims in the PEP will hold, and that the PEP will mention all known cases where the implementation degrades performance.

If the changes are unacceptable, the SC may still reject the PEP and ask for reverting the change.
(You could say the PEP is “provisionally accepted, pending the implementation” – except it’s how PEP acceptance works in general. In this case, the SC does still have some reservations, but a successful implementation would be the best way to resolve them.)

— Petr, on behalf of the SC

5 Likes

In this case, the SC does still have some reservations

What reservations exactly?
It would make my life easier if I knew what they were.

Regarding performance, are there specific applications or tools that are of concern?

I’ll reply without consulting the rest of the SC. So it’s not an official reply – but also it’s not my personal point of view:

We had long, drawn-out discussions about details, and decided the best we can do at this point is to unblock work on the initial implementation.
To quote PEP 1:

Standards Track PEPs consist of two parts, a design document and a reference implementation. It is generally recommended that at least a prototype implementation be co-developed with the PEP, as ideas that sound good in principle sometimes turn out to be impractical when subjected to the test of implementation.

This PEP doesn’t have a draft implementation. The text technically allows things the SC wouldn’t like. But writing (and approving) a perfect specification isn’t really the point. It’s not productive to discuss issues that you, with the implementation in mind, find trivial. I think it’s better if you start the implementation, so people can start trying it out, reviewing it, discovering any hidden drawbacks – and finding out their fears are unfounded.

We generally expect any issues that come up can be solved (though we do expect some to come up and need solving). Some community members aren’t convinced, but the implementation will probably be the best way convince them.

Some specific ones:
For “Coverage tools can be implemented at very low cost, by returning DISABLE in all callbacks” – hopefully that’ll be possible while preserving existing functionality of coverage tools.
Hopefully sys.settrace() (and thus pdb, bdb, pudb, etc.) won’t become unbearably slow.
Hopefully there’ll be an example debugger & profiler to show that the API serves its purpose.
Hopefully there aren’t serious downsides hiding in careful wording, or not mentioned at all.

At this point I don’t think it’s worth discussing them, encoding them in the PEP, or finding new ones. “The SC does still have some reservations, but a successful implementation would be the best way to resolve them.” Show us the code :‍)

Thanks. That’s definitely helpful.

The first thing to implement will be sys.settrace() and sys.setprofile(). So we should get an idea of the performance impact on those early on.

The text technically allows things the SC wouldn’t like

I’m happy to tighten up the spec, if there are loopholes.

Now we’re getting fully into “my own opinion" territory:

A lawyer-proof spec would just be extra work for everyone.
I don’t want to look for loopholes in PEP wording. I don’t want to encourage anyone to do that. They’re design documents, not contracts. And neither are they marketing materials. As the PEP author, it’s your responsibility to make known potential downsides obvious (along with other details of the change, of course).

The SC accepts the PEP, with a clarification of what acceptance means. (The extra wording is not only meant for you, but also for the SC itself, and everyone else.)

I’m still concerned that this point hasn’t been addressed:

As Petr says, a PEP doesn’t need to be a complete design document. But where it does include specific details, they should be the right details.

Disabling a branch event after its first firing will make it useless. If I only get low impact monitoring by disabling the event, then I won’t have low-impact branch coverage. Is this OK?

Maybe I’m misunderstanding the centrality of this idea.

Branch coverage was brought up before. We hear your concerns.
OTOH, I get the impression that Mark thinks this isn’t worth discussing – and I think that until the implementation appears, we should just assume that. And then let’s check.

The PEP says:

Coverage tools can be implemented at very low cost, by returning DISABLE in all callbacks.

You could read that as “coverage tools can be made faster, but they can be super mega extra fast if they can disable events they no longer care about.” In fact, the very next sentence is:

For heavily instrumented code, e.g. using LINE, performance should be better than sys.settrace, but not by that much as performance will be dominated by the time spent in callbacks.

So it might be worth it to write extra code to check if all branches in a function were already taken, or something along those lines.

I’ve been looking to implementing instrumentation that is directional, but it doesn’t seem feasible without an explosion in the number of instrumented instructions.

This means that you’ll need to maintain a mapping code, srcdest for each branch.
If the destination differs from the one already recorded, then both branches have been taken and monitoring can be disabled for that branch.
Something like this:

if (code, src) in taken_branches:
    if dest != taken_branches[(code, src)]:
        del taken_branches[(code, src)]
        return DISABLE
else:
    taken_branches[(code, src)] = dest

Hello, I am following this PEP from a distance. I think it is a very exciting and interesting proposal. I am particularly interested in the ability to enhance per-opcode instrumentation.

One thing I have noticed with the existing implementation of opcode tracing is that it does not appear to provide any information to the callback about either the operands or the result of the operation being performed. There are some tools that would be benefit greatly from access to this information (e.g. tools that perform dataflow analysis).

In principle, it seems like at least some of this information is accessible via the reference to the stack that belongs to the frame object. However, it is not exposed as part of the public API either in the Python code or in the underlying structure. I am wondering whether it would be reasonable to make this information public (even if read-only). In the new implementation, now that the per-opcode event appears to be triggered prior to opcode dispatch, it seems like at least the operands could easily be passed to the callback.

Along similar lines, I am wondering whether there would be any appetite for providing an interface to actually override the handlers for certain opcodes. Of particular interest would be opcodes that implement basic operations like BINARY_OP and FORMAT_VALUE. I believe that this could be done in a way that imposes zero cost on uninstrumented code. There is some precedent for this kind of thing in interpreter implementations for other languages and it would be extremely valuable for certain kinds of monitoring tools.

I am wondering whether either of these proposals would be candidates for another PEP (which I would be willing to author). I am also potentially interested in helping with the implementation for any/all of the above.

Thank you!

Opcode tracing is a fairly niche use case, mainly because it has such high overhead.
Passing more information to the monitoring tool would further increase that overhead.

There is also the question of what are the operands and results.
The instructions in the VM change from version to version, and with PEP 659 can be modified at runtime.
So it is hard to specify what the operands and result are, in a way that is useable by tools.

My concern is that the API for this would be large and either constantly changing or obstructive to optimizations.

OOI, what are you wanting to implement with this?

Thank you so much for your response.

OOI, what are you wanting to implement with this?

To boil it down, I want to instrument basic operations (such as +) involving built-in types (such as str).

I realize the use case is fairly niche but it is quite valuable for certain kinds of runtime security analysis. Some recent changes, including the auditing API, seem to suggest that the community is open to changes that would support runtime security tools, and so I am hoping that this or a similar proposal might be considered.

The reason I have followed this particular PEP is because it seems very close to enabling this kind of instrumentation. However not quite enough information is currently available: namely, the operands and result are not publicly available.

Your concerns about performance and flexibility are quite valid. However, I’m wondering whether there is some way for callbacks to opt in to receiving this information, or whether they could access it indirectly via some other API that exposes the current state of the interpreter stack. This would seem to mitigate any incidental performance concern except for callbacks where that data is actually required.

As I mentioned above, some this information is actually encoded on each frame object but it is not exposed as part of a public API. If that were made a public part of the frame object, it would possibly enable this kind of analysis.

A different option would be to enable the actual action taken by each opcode handler to be overridden. For example, the handler for BINARY_OP_ADD_UNICODE calls PyUnicode_Concat, but it could just as easily call a function pointer add_unicode_handler that is set to PyUnicode_Concat by default but able to be overridden by a public extension API. This is the kind of implementation I have seen in other language interpreters.

Anyway, I really appreciate your time and consideration. Maybe this particular PEP is not the right place for this discussion, but I would really appreciate any advice on whether this kind of thing might be suitable for a different PEP or something else.

Thank you again!

(edited due to accidentally publishing too early)

Be aware that he overhead of per-instruction instrumentation is very large and going to get worse (relatively) as we add more optimizations for 3.13 and beyond. (This level of instrumentation effectively kills all optimizations)
IMO, that is likely to make it useless for all but some highly specialized uses like a reversible debugger, or some sort of GUI for understanding the VM.

If you still think a x10 slowdown or worse is tolerable for your use case, then adding an API to get a view on the evaluation stack is probably the best approach.
It should have negligible cost when not used, but give you all the information you need (I think).

Okay thanks, that’s helpful information.

Would this also apply to the alternate proposal where we simply allow the actual handler called by each opcode (e.g. PyUnicode_Concat) to be overridden? I don’t believe that would suffer from the same problem since the cost incurred would just be the overhead of the new handler itself (which presumably would also call the original function, but we could leave that up to the user). I realize this becomes fairly complicated with specialization but I think the API would be relatively simple.

Or is this all just going to become something like a JIT where there aren’t any real calls at that level anymore?

I’d be happy to draft a PEP along these lines but I’m just trying to make sure that it wouldn’t be immediately rejected.

In order to optimize Python we are trying to avoid calls in opaque calls into C code.
There are already quite a few hooks into Python execution that we are attempting to coalesce and side step, so adding more is not a viable option.
And as you say, future optimizations could remove these calls entirely.

However, once per-instruction monitoring is turned on, we effectively abandon all attempts to optimize anything. So adding per-instruction monitoring then examining the state of the VM is a viable approach, as we need to support that anyway.

It will be super slow, but (with an appropriate API) you could examine the execution stack for each instruction, and record whatever information you need.

Okay that makes sense. I’m curious to test this out once these particular changes are integrated in order to get a sense of the performance impact (although I realize I still don’t have an API for getting the execution stack).

I’m wondering whether the community is open to discussing other possible mechanisms for supporting runtime dataflow tracing. I think there are a variety of potential solutions (including leveraging the auditing mechanism). Maybe I should start another thread on that topic.