I’m working on optimizing some code that’s transferring lots of data intercontinentally.
I ran a test program under “strace --summary-only --summary-wall-clock” and found that the test program is spending over 92% of its time in poll().
I’m not sure poll()'ing is necessary for this, but I’d like to see the context in which its being used to assess that suspicion.
The module doing the transfer appears to have no calls to Python select or poll, so it could be many different modules that are the subject of my interest.
Is there a way of getting a Python stack trace and/or Python debugger breakpoint out of the Python code unit that’s causing poll() to execute at the C level?
I tried Googling about this, and got an “AI Summary” with a suggestion. But it looks pretty broken. And the actual hits (below the summary) appear to be talking about related things, not what I’m actually looking for.
I can’t completely rule out a C extension module, but the search is starting in pure python. This isn’t really my company’s code, but it’s something we’re likely to end up depending on.
For a C extension, the only possibly tricky bit is that you’d need to compile the C extension with debug symbols to get a meaningful stack trace. Many wheels on pypi don’t include debug symbols.
Without know details of the code you are using I can only guess.
Async code will poll for FDs that are ready to read or write and call handlers to pull or push data as required.
Usually a well written library will do as much reading or writing as the kernel will allow.
If you are seeing the code in poll a lot then it could well be waiting on the network.
It seems likely that you might be using the TCP protocol, which can introduce latency for long-distance data transfers, especially if you’re not on a dedicated network. You should explore alternatives like UDP for faster performance.
…over and over again. I’m not sure the poll() part is necessary, and suspect the poll’ing might be causing latency to figure into the performance equation more than usual.
Because it’s a no-op if you do not do the setup that poll requires.
You can open up the tcp window and also tune the kernel to allow more outstanding data in the sockets. Also you are confusing throughput with latency I think.
The code needs be smarter and try far bigger reads and writes.
Does not seem to be optimised at all.
That approach can work well in a controlled environment, but adjusting settings on only one end may not provide significant results due to limitations in the intermediate network.
Only the ends of a TCP connect usually matter.
If there is a content examining firewall in the middle then all bets are off.
Intermediates are routing IP and don’t know its TCP or UDP.