Python sockets and Winerror 10053

ings · August 9, 2023, 2:25pm

Hi,

in my desperation, I’m posting this here in the hope that someone has a solution. This problem as a history of more than a year in our department…

We have multiple python applications generating Winerror 10053 (“An established connection was aborted by the software in your host machine”). All applications use long-term (several minutes, up to 1 hour) TCP connections to servers (there are python and also C/C++ servers) and at a random point the client disconnects with above mentioned error (even when the TCP transfers are all exactly the same, the error will not occur at the exact same positions).

One application uses the rpyc protocol. Another one uses a handwritten protocol. What is interesting, is that it seems to happen only when using windows. Interestingly, even using a linux VM on a windows host (which is actually showing the issue natively) fixes the issue.

What we have tried already:

Disable virus scanner → no effect
Disable firewall → no effect
Tried different versions of rpyc, python → no effect
Run the exact same client on a linux OS (even VM) → error disappears
Port one of the handwritten protocols 1:1 to C → error seems to disappear
Tried to write a small standalone application which reproduces the issue by sending random data over the socket → failed

Searching the internet for this error gives a lot of hits but not a lot of useful information, unfortunately. Many hints seem to point to anti virus / firewall issues, and this could very well apply here since the machines are all managed by our company’s IT department.

I have no idea how to go forward except trying to port all protocols to C/C++, but obviously this seems to be a huge effort and I don’t have any explanation why the using sockets in python should be more unreliable than using sockets in C/C++…

Maybe anyone out there has some more hints for? Thanks in advance

Rosuav · August 9, 2023, 2:40pm

There’s no reason that doing the same thing in Python or C would have any difference here. However, there are a LOT of ways that you might end up doing something slightly different. The default settings for the sockets might not be the same, buffer sizes could easily vary, etc, etc. So I would be inclined to stick to Python, at least for the time being.

It is somewhat interesting that running in a Linux VM changes matters, but a VM and a host system most commonly work using NAT, so the host system is simply carrying packets, and not managing the connection. Do you have the option to try this in the Windows Subsystem for Linux (WSL)? That would be enlightening.

And this note I find particularly curious:

What do you mean by “failed”? You were unable to get the error? If so, this seems to me like a great avenue of exploration. Try to take your failing app and cut pieces out of it. Does it still fail the same way? Great (in a manner of speaking, of course) - keep cutting pieces out. Does it no longer fail? Then the part you removed might be significant.

ings · August 9, 2023, 4:08pm

Hi Chris, thanks for responding.

Unfortunately we can’t use WSL here.

Yes indeed, I was not able to find a simple example which reproduces the issue. And unfortunately, starting from the large application and making that simpler is a highly non-trivial task. I might try that at some time…

I still hope somebody reading this says “hey, I had the same problem and fixed it with simply doing xyz.” …

Rosuav · August 9, 2023, 7:41pm

Oh well, was worth a shot.

Good luck. Unless someone does just happen to recognize the problem off-hand, this is going to be the only way - tedious, but effective.

elis.byberi · August 9, 2023, 8:24pm

Here are some possibilities:

The client attempts to access an open file.
The client has a silent bug in Windows.

Rosuav · August 9, 2023, 8:32pm

Why would client errors result in socket errors? Can you elaborate on that please?

elis.byberi · August 9, 2023, 8:46pm

Perhaps an old CPython bug, or forcibly closing the connection within a silent except clause.

ings · August 11, 2023, 12:32pm

I have found some more insights.

The error happens inside a PySide6 GUI application. In this application, we have a pyqtgraph 3D display. When I hide this window (which results in not updating any OpenGL primitives), the error disappears, even if I try to stress the GUI otherwise.
The error also disappears with the C implementation, regardless of whether the 3D view is open or not.
Because (2) did not show any error, I have re-ported the C implementation to python (because there were some slight interface adaptions due to ctypes usage). Interestingly, this version doesn’t show the error either. I have logged the socket calls and I can’t see any differences between the two python versions (except for the fact that some recv_into’s in the running version are split compared with the failing version), the error happens in a recv_into call. It seems that it’s just a different timing between the two versions.

komoto48g · August 11, 2023, 2:21pm

Hi, I have the same problem and am still struggling…

In my use case, One thread gets CCD live images from the REST server and displays them using cv2.imshow. Another thread reads the image, does image processing using cv2 and numpy, and sends commands with the socket to the same host.

The problem occurs in random timing but is quite reproducible. It crashes silently with the message [WinError 10053], sometimes catches the exception [WinError 10057] and the connection is lost.

Both host and client are Windows 10 PC. I have not figured it out, but I’m guessing it’s due to thread-unsafe synchronization.