Obtain a Python stack trace when poll() is executed at the C level?

Hello.

I’m working on optimizing some code that’s transferring lots of data intercontinentally.

I ran a test program under “strace --summary-only --summary-wall-clock” and found that the test program is spending over 92% of its time in poll().

I’m not sure poll()'ing is necessary for this, but I’d like to see the context in which its being used to assess that suspicion.

The module doing the transfer appears to have no calls to Python select or poll, so it could be many different modules that are the subject of my interest.

Is there a way of getting a Python stack trace and/or Python debugger breakpoint out of the Python code unit that’s causing poll() to execute at the C level?

I tried Googling about this, and got an “AI Summary” with a suggestion. But it looks pretty broken. And the actual hits (below the summary) appear to be talking about related things, not what I’m actually looking for.

Any suggestions?

Thanks!

Are you debugging pure python code or code that uses a Python C extension? The advice is a little different for the latter.

You can use a plain old C debugger with CPython. You might find using a debug build of CPython to be more ergonomic.

I’m on a Mac so I’m using lldb, but you can do the same thing with gdb:

○  lldb $(pyenv which python) -- -c "import select; select.poll()"
(lldb) target create "/Users/goldbaum/.pyenv/versions/3.12.3/bin/python"
Current executable set to '/Users/goldbaum/.pyenv/versions/3.12.3/bin/python' (arm64).
(lldb) settings set -- target.run-args  "-c" "import select; select.poll()"
(lldb) break set --name poll
Breakpoint 1: where = libsystem_kernel.dylib`poll, address = 0x00000001804116f8
(lldb) c
error: Command requires a current process.
(lldb) r
Process 47403 launched: '/Users/goldbaum/.pyenv/versions/3.12.3/bin/python' (arm64)
Process 47403 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000193b696f8 libsystem_kernel.dylib`poll
libsystem_kernel.dylib`poll:
->  0x193b696f8 <+0>:  mov    x16, #0xe6
    0x193b696fc <+4>:  svc    #0x80
    0x193b69700 <+8>:  b.lo   0x193b69720               ; <+40>
    0x193b69704 <+12>: pacibsp
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000193b696f8 libsystem_kernel.dylib`poll
    frame #1: 0x000000010022bd64 select.cpython-312-darwin.so`_select_exec + 172
    frame #2: 0x0000000100b229bc libpython3.12.dylib`PyModule_ExecDef + 164
    frame #3: 0x0000000100bfb4c8 libpython3.12.dylib`_imp_exec_dynamic + 112
    frame #4: 0x0000000100b2133c libpython3.12.dylib`cfunction_vectorcall_O + 336
    frame #5: 0x0000000100bc50b4 libpython3.12.dylib`_PyEval_EvalFrameDefault + 44188
    frame #6: 0x0000000100ad450c libpython3.12.dylib`object_vacall + 292
    frame #7: 0x0000000100ad4344 libpython3.12.dylib`PyObject_CallMethodObjArgs + 104
    frame #8: 0x0000000100bf8db4 libpython3.12.dylib`PyImport_ImportModuleLevelObject + 1268
    frame #9: 0x0000000100bc1464 libpython3.12.dylib`_PyEval_EvalFrameDefault + 28748
    frame #10: 0x0000000100bba1b8 libpython3.12.dylib`PyEval_EvalCode + 288
    frame #11: 0x0000000100c1a9c8 libpython3.12.dylib`run_mod + 168
    frame #12: 0x0000000100c19afc libpython3.12.dylib`PyRun_SimpleStringFlags + 132
    frame #13: 0x0000000100c3c1f8 libpython3.12.dylib`Py_RunMain + 1204
    frame #14: 0x0000000100c3c8fc libpython3.12.dylib`pymain_main + 328
    frame #15: 0x0000000100c3c99c libpython3.12.dylib`Py_BytesMain + 40
    frame #16: 0x0000000193817154 dyld`start + 2476

Also with gdb you can do py-bt to get a python stack trace.

I can’t completely rule out a C extension module, but the search is starting in pure python. This isn’t really my company’s code, but it’s something we’re likely to end up depending on.

Thanks.

For a C extension, the only possibly tricky bit is that you’d need to compile the C extension with debug symbols to get a meaningful stack trace. Many wheels on pypi don’t include debug symbols.

Without know details of the code you are using I can only guess.

Async code will poll for FDs that are ready to read or write and call handlers to pull or push data as required.

Usually a well written library will do as much reading or writing as the kernel will allow.
If you are seeing the code in poll a lot then it could well be waiting on the network.

It seems likely that you might be using the TCP protocol, which can introduce latency for long-distance data transfers, especially if you’re not on a dedicated network. You should explore alternatives like UDP for faster performance.

For some reason:
python3 -c "import select; select.poll()"
…isn’t causing a poll syscall to run for me.

But I went ahead and tried it on my test program, and I did get a poll() from that. :slight_smile:

To get a CPython traceback, for a Python 3.10 binary built with debugging symbols, I did the following:

  • gdb /usr/local/cpython-3.10/bin/python3
  • catch syscall poll
  • run ./my_program --arg1 --arg2
  • source /tmp/python3.10-gdb.py
  • py-bt

Thanks for your kind replies, folks.

Hi Elis.

I think this is an HTTP-based REST API. If there’s a way of doing that over UDP, I’ve not yet heard of it.

Thanks.

My test program’s main loop looks like:

read(3, "<redacted>"..., 16384) = 16384
poll([{fd=4, events=POLLOUT}], 1, 60000) = 1 ([{fd=4, revents=POLLOUT}])
write(4, "<redacted>"..., 16413) = 16413

…over and over again. I’m not sure the poll() part is necessary, and suspect the poll’ing might be causing latency to figure into the performance equation more than usual.

HTTP/3 wins unquestionably in intercontinental connections (US East Coast–Germany) with 25% faster download (mean).

1 Like

Because it’s a no-op if you do not do the setup that poll requires.

You can open up the tcp window and also tune the kernel to allow more outstanding data in the sockets. Also you are confusing throughput with latency I think.

The code needs be smarter and try far bigger reads and writes.
Does not seem to be optimised at all.

That approach can work well in a controlled environment, but adjusting settings on only one end may not provide significant results due to limitations in the intermediate network.

Only the ends of a TCP connect usually matter.
If there is a content examining firewall in the middle then all bets are off.
Intermediates are routing IP and don’t know its TCP or UDP.

1 Like