A question about multithreading overheads in python

I’m working on a project with tensorrt and python. No worry this question may not involve any gpu computation. I feel that in this case we can simply see tensorrt as numpy since both package drops GIL when it leaves python. I’ve already done a lot of work making the tensorrt part working. Hopefully this question is not relevant to tensorrt that much.
Firstly the multithreading does work for tensorrt + python since tensorrt drops GIL when doing it’s main execution function. But still I found that in some cases it doesn’t work. There’s still parallel work but there’s a big overhead for each thread. It only works when the execution takes long enough so the overheads are negligible. though may be the right choice is to switch to c++ but I really want to optimize it more and stick with python.
I set up an experiment to see how much overhead is there when number of subthreads is only 1. and I used yappi to profile the experiment. here are some code, it’s not my real code but it should be enough. I do not show the run_trt_execution function here because I feel it’s unnecessary, if anyone feels that you need to know more about tensorrt to answer my question I’m also happy to share my little knowledge about it.

yappi.start()
run_trt_execution(args)
yappi.stop()
yappi.start()
cur_thread = threading.Thread(target=run_trt_execution,
                                                  args = args)
cur_thread.start()
cur_thread.join()
yappi.stop()

and here are the configs, since yappi records both cpu and wall time I did experiment to get both times for both settings
Here is the log which can be too long:

cpu multithread profile ______________________________+++++++++++++++++++++++++++++++++++++++==
name                                  ncall  tsub      ttot      tavg
..hon3.8/threading.py:859 Thread.run  1      0.000005  0.008511  0.008511
..tc_multithread.py:44 run_inference  1      0.006554  0.008506  0.008506
..da/cuda.py:244 DeviceArray.copy_to  1      0.000026  0.001400  0.001400
..rter.py:159 LazyModule.__getattr__  5      0.000014  0.001319  0.000264
..aphy/mod/importer.py:90 import_mod  5      0.000035  0.001301  0.000260
..r/logger.py:375 Logger.module_info  5      0.000024  0.000893  0.000179
..y:360 Logger._str_from_module_info  5      0.000018  0.000819  0.000164
..hy/logger/logger.py:363 try_append  15     0.000023  0.000802  0.000053
..aphy/logger/logger.py:372 <lambda>  5      0.000017  0.000753  0.000151
..ython3.8/posixpath.py:387 realpath  5      0.000020  0.000726  0.000145
..uda/cuda.py:237 DeviceArray.nbytes  2      0.000013  0.000616  0.000308
..3.8/posixpath.py:396 _joinrealpath  5      0.000093  0.000564  0.000113
../cuda.py:368 DeviceArray.copy_from  1      0.000022  0.000521  0.000521
..raphy/cuda/cuda.py:118 Cuda.memcpy  2      0.000507  0.000512  0.000256
..tlib/__init__.py:109 import_module  5      0.000016  0.000326  0.000065
..rtlib._bootstrap>:1002 _gcd_import  5      0.000013  0.000307  0.000061
..lib._bootstrap>:986 _find_and_load  5      0.000045  0.000283  0.000057
..uda/cuda.py:352 DeviceArray.resize  1      0.000006  0.000281  0.000281
..lib/python3.8/posixpath.py:71 join  30     0.000115  0.000211  0.000007
../python3.8/posixpath.py:164 islink  30     0.000066  0.000209  0.000007
..python3.8/posixpath.py:372 abspath  5      0.000015  0.000140  0.000028
..ython3.8/posixpath.py:334 normpath  5      0.000057  0.000097  0.000019
..b._bootstrap>:157 _get_module_lock  10     0.000045  0.000096  0.000010
..bootstrap>:194 _lock_unlock_module  5      0.000015  0.000083  0.000017
..>:147 _ModuleLockManager.__enter__  5      0.000013  0.000082  0.000016
..._bootstrap>:1017 _handle_fromlist  15     0.000048  0.000074  0.000005
..python3.8/posixpath.py:41 _get_sep  40     0.000043  0.000064  0.000002
..ib/python3.8/posixpath.py:60 isabs  10     0.000026  0.000053  0.000005
..hy/logger/logger.py:207 Logger.log  5      0.000018  0.000049  0.000010
<frozen importlib._bootstrap>:176 cb  10     0.000027  0.000043  0.000004
..bootstrap>:58 _ModuleLock.__init__  10     0.000022  0.000033  0.000003
.._bootstrap>:78 _ModuleLock.acquire  10     0.000024  0.000029  0.000003
..bootstrap>:103 _ModuleLock.release  10     0.000020  0.000025  0.000003
..p>:151 _ModuleLockManager.__exit__  5      0.000008  0.000022  0.000004
..aphy/logger/logger.py:370 <lambda>  5      0.000008  0.000016  0.000003
..tlib._bootstrap>:937 _sanity_check  5      0.000008  0.000011  0.000002
..aphy/logger/logger.py:371 <lambda>  5      0.000006  0.000010  0.000002
../_internal.py:250 _ctypes.__init__  2      0.000005  0.000005  0.000002
..p>:143 _ModuleLockManager.__init__  5      0.000004  0.000004  0.000001
..polygraphy/util/util.py:499 volume  3      0.000004  0.000004  0.000001
..hy/logger/logger.py:276 should_log  5      0.000004  0.000004  0.000001
..0 DeviceArray._check_dtype_matches  2      0.000003  0.000003  0.000002
..olygraphy/cuda/cuda.py:24 void_ptr  4      0.000003  0.000003  0.000001
..ygraphy/cuda/cuda.py:59 Cuda.check  2      0.000002  0.000002  0.000001
..core/_internal.py:304 _ctypes.data  2      0.000002  0.000002  0.000001
..olygraphy/cuda/cuda.py:149 wrapper  2      0.000001  0.000001  0.000001
../cuda.py:203 try_get_stream_handle  2      0.000001  0.000001  0.000001
Function stats for (_MainThread) (0)
Clock type: CPU
Ordered by: totaltime, desc
name                                  ncall  tsub      ttot      tavg
..hon3.8/threading.py:859 Thread.run  1      0.000005  0.008511  0.008511
..tc_multithread.py:44 run_inference  1      0.006554  0.008506  0.008506
..da/cuda.py:244 DeviceArray.copy_to  1      0.000026  0.001400  0.001400
..rter.py:159 LazyModule.__getattr__  5      0.000014  0.001319  0.000264
..aphy/mod/importer.py:90 import_mod  5      0.000035  0.001301  0.000260
..r/logger.py:375 Logger.module_info  5      0.000024  0.000893  0.000179
..y:360 Logger._str_from_module_info  5      0.000018  0.000819  0.000164
..hy/logger/logger.py:363 try_append  15     0.000023  0.000802  0.000053
..aphy/logger/logger.py:372 <lambda>  5      0.000017  0.000753  0.000151
..ython3.8/posixpath.py:387 realpath  5      0.000020  0.000726  0.000145
..uda/cuda.py:237 DeviceArray.nbytes  2      0.000013  0.000616  0.000308
..3.8/posixpath.py:396 _joinrealpath  5      0.000093  0.000564  0.000113
../cuda.py:368 DeviceArray.copy_from  1      0.000022  0.000521  0.000521
..raphy/cuda/cuda.py:118 Cuda.memcpy  2      0.000507  0.000512  0.000256
..tlib/__init__.py:109 import_module  5      0.000016  0.000326  0.000065
..rtlib._bootstrap>:1002 _gcd_import  5      0.000013  0.000307  0.000061
..lib._bootstrap>:986 _find_and_load  5      0.000045  0.000283  0.000057
..uda/cuda.py:352 DeviceArray.resize  1      0.000006  0.000281  0.000281
..lib/python3.8/posixpath.py:71 join  30     0.000115  0.000211  0.000007
../python3.8/posixpath.py:164 islink  30     0.000066  0.000209  0.000007
..python3.8/posixpath.py:372 abspath  5      0.000015  0.000140  0.000028
..n3.8/threading.py:834 Thread.start  1      0.000011  0.000099  0.000099
..ython3.8/posixpath.py:334 normpath  5      0.000057  0.000097  0.000019
..b._bootstrap>:157 _get_module_lock  10     0.000045  0.000096  0.000010
..bootstrap>:194 _lock_unlock_module  5      0.000015  0.000083  0.000017
..>:147 _ModuleLockManager.__enter__  5      0.000013  0.000082  0.000016
..._bootstrap>:1017 _handle_fromlist  15     0.000048  0.000074  0.000005
..python3.8/posixpath.py:41 _get_sep  40     0.000043  0.000064  0.000002
..on3.8/threading.py:979 Thread.join  1      0.000006  0.000054  0.000054
..hon3.8/threading.py:540 Event.wait  1      0.000008  0.000053  0.000053
..ib/python3.8/posixpath.py:60 isabs  10     0.000026  0.000053  0.000005
..hy/logger/logger.py:207 Logger.log  5      0.000018  0.000049  0.000010
..:1017 Thread._wait_for_tstate_lock  1      0.000012  0.000046  0.000046
<frozen importlib._bootstrap>:176 cb  10     0.000027  0.000043  0.000004
...8/threading.py:270 Condition.wait  1      0.000012  0.000038  0.000038
..8/threading.py:761 Thread.__init__  1      0.000015  0.000036  0.000036
..bootstrap>:58 _ModuleLock.__init__  10     0.000022  0.000033  0.000003
.._bootstrap>:78 _ModuleLock.acquire  10     0.000024  0.000029  0.000003
..bootstrap>:103 _ModuleLock.release  10     0.000020  0.000025  0.000003
..p>:151 _ModuleLockManager.__exit__  5      0.000008  0.000022  0.000004
..n3.8/threading.py:944 Thread._stop  1      0.000016  0.000021  0.000021
..aphy/logger/logger.py:370 <lambda>  5      0.000008  0.000016  0.000003
..tlib._bootstrap>:937 _sanity_check  5      0.000008  0.000011  0.000002
..aphy/logger/logger.py:371 <lambda>  5      0.000006  0.000010  0.000002
...8/threading.py:505 Event.__init__  1      0.000004  0.000009  0.000009
..ing.py:255 Condition._release_save  1      0.000007  0.000008  0.000008
../_internal.py:250 _ctypes.__init__  2      0.000005  0.000005  0.000002
..n3.8/_weakrefset.py:81 WeakSet.add  1      0.000004  0.000005  0.000005
..8/threading.py:1306 current_thread  2      0.000004  0.000005  0.000002
..reading.py:246 Condition.__enter__  1      0.000003  0.000004  0.000004
..p>:143 _ModuleLockManager.__init__  5      0.000004  0.000004  0.000001
..polygraphy/util/util.py:499 volume  3      0.000004  0.000004  0.000001
..hreading.py:222 Condition.__init__  1      0.000004  0.000004  0.000004
..hy/logger/logger.py:276 should_log  5      0.000004  0.000004  0.000001
..0 DeviceArray._check_dtype_matches  2      0.000003  0.000003  0.000002
..reading.py:1095 _MainThread.daemon  2      0.000003  0.000003  0.000001
..hreading.py:249 Condition.__exit__  1      0.000002  0.000003  0.000003
..ython3.8/_weakrefset.py:38 _remove  1      0.000002  0.000003  0.000003
..olygraphy/cuda/cuda.py:24 void_ptr  4      0.000003  0.000003  0.000001
..reading.py:261 Condition._is_owned  1      0.000002  0.000003  0.000003
..ython3.8/threading.py:734 _newname  1      0.000003  0.000003  0.000003
...py:258 Condition._acquire_restore  1      0.000001  0.000002  0.000002
..ygraphy/cuda/cuda.py:59 Cuda.check  2      0.000002  0.000002  0.000001
..core/_internal.py:304 _ctypes.data  2      0.000002  0.000002  0.000001
..ng.py:1177 _make_invoke_excepthook  1      0.000001  0.000001  0.000001
..olygraphy/cuda/cuda.py:149 wrapper  2      0.000001  0.000001  0.000001
../cuda.py:203 try_get_stream_handle  2      0.000001  0.000001  0.000001
..n3.8/threading.py:513 Event.is_set  2      0.000001  0.000001  0.000000
wall multithread profile________________________________++++++++++++++++++++++++++++++++==
Clock type: WALL
Ordered by: totaltime, desc

name                                  ncall  tsub      ttot      tavg      
..hon3.8/threading.py:859 Thread.run  1      0.000005  0.009640  0.009640
..tc_multithread.py:44 run_inference  1      0.008012  0.009635  0.009635
..on3.8/threading.py:979 Thread.join  1      0.000004  0.009415  0.009415
..:1017 Thread._wait_for_tstate_lock  1      0.000009  0.009408  0.009408
..rter.py:159 LazyModule.__getattr__  5      0.000011  0.000975  0.000195
..aphy/mod/importer.py:90 import_mod  5      0.000030  0.000963  0.000193
..da/cuda.py:244 DeviceArray.copy_to  1      0.000022  0.000955  0.000955
..r/logger.py:375 Logger.module_info  5      0.000022  0.000681  0.000136
../cuda.py:368 DeviceArray.copy_from  1      0.000032  0.000628  0.000628
..y:360 Logger._str_from_module_info  5      0.000016  0.000623  0.000125
..hy/logger/logger.py:363 try_append  15     0.000010  0.000607  0.000040
..aphy/logger/logger.py:372 <lambda>  5      0.000016  0.000577  0.000115
..ython3.8/posixpath.py:387 realpath  5      0.000013  0.000550  0.000110
..raphy/cuda/cuda.py:118 Cuda.memcpy  2      0.000513  0.000518  0.000259
..3.8/posixpath.py:396 _joinrealpath  5      0.000062  0.000447  0.000089
..uda/cuda.py:352 DeviceArray.resize  1      0.000008  0.000365  0.000365
..n3.8/threading.py:834 Thread.start  1      0.000015  0.000328  0.000328
..uda/cuda.py:237 DeviceArray.nbytes  2      0.000009  0.000319  0.000160
..hon3.8/threading.py:540 Event.wait  1      0.000009  0.000277  0.000277
...8/threading.py:270 Condition.wait  1      0.000016  0.000260  0.000260
../python3.8/posixpath.py:164 islink  30     0.000038  0.000227  0.000008
..tlib/__init__.py:109 import_module  5      0.000012  0.000221  0.000044
..rtlib._bootstrap>:1002 _gcd_import  5      0.000010  0.000207  0.000041
..lib._bootstrap>:986 _find_and_load  5      0.000033  0.000191  0.000038
..lib/python3.8/posixpath.py:71 join  30     0.000074  0.000126  0.000004
..python3.8/posixpath.py:372 abspath  5      0.000008  0.000087  0.000017
..b._bootstrap>:157 _get_module_lock  10     0.000040  0.000067  0.000007
..>:147 _ModuleLockManager.__enter__  5      0.000009  0.000065  0.000013
..ython3.8/posixpath.py:334 normpath  5      0.000042  0.000062  0.000012
..bootstrap>:194 _lock_unlock_module  5      0.000009  0.000049  0.000010
..._bootstrap>:1017 _handle_fromlist  15     0.000034  0.000047  0.000003
..8/threading.py:761 Thread.__init__  1      0.000018  0.000042  0.000042
..python3.8/posixpath.py:41 _get_sep  40     0.000027  0.000037  0.000001
..hy/logger/logger.py:207 Logger.log  5      0.000016  0.000036  0.000007
..ib/python3.8/posixpath.py:60 isabs  10     0.000018  0.000030  0.000003
<frozen importlib._bootstrap>:176 cb  10     0.000018  0.000025  0.000002
.._bootstrap>:78 _ModuleLock.acquire  10     0.000018  0.000023  0.000002
..bootstrap>:58 _ModuleLock.__init__  10     0.000017  0.000022  0.000002
..bootstrap>:103 _ModuleLock.release  10     0.000014  0.000016  0.000002
..p>:151 _ModuleLockManager.__exit__  5      0.000006  0.000016  0.000003
..aphy/logger/logger.py:370 <lambda>  5      0.000006  0.000013  0.000003
...8/threading.py:505 Event.__init__  1      0.000006  0.000011  0.000011
..n3.8/threading.py:944 Thread._stop  1      0.000006  0.000008  0.000008
..aphy/logger/logger.py:371 <lambda>  5      0.000006  0.000007  0.000001
../_internal.py:250 _ctypes.__init__  2      0.000006  0.000006  0.000003
..tlib._bootstrap>:937 _sanity_check  5      0.000005  0.000006  0.000001
..n3.8/_weakrefset.py:81 WeakSet.add  1      0.000005  0.000005  0.000005
..hreading.py:222 Condition.__init__  1      0.000005  0.000005  0.000005
..reading.py:246 Condition.__enter__  1      0.000004  0.000005  0.000005
..8/threading.py:1306 current_thread  2      0.000004  0.000005  0.000002
..hy/logger/logger.py:276 should_log  5      0.000004  0.000004  0.000001
..ython3.8/_weakrefset.py:38 _remove  1      0.000004  0.000004  0.000004
..olygraphy/cuda/cuda.py:24 void_ptr  4      0.000003  0.000003  0.000001
..polygraphy/util/util.py:499 volume  3      0.000003  0.000003  0.000001
..0 DeviceArray._check_dtype_matches  2      0.000003  0.000003  0.000002
..p>:143 _ModuleLockManager.__init__  5      0.000003  0.000003  0.000001
..ing.py:255 Condition._release_save  1      0.000003  0.000003  0.000003
..reading.py:261 Condition._is_owned  1      0.000002  0.000003  0.000003
..hreading.py:249 Condition.__exit__  1      0.000003  0.000003  0.000003
..ython3.8/threading.py:734 _newname  1      0.000003  0.000003  0.000003
..ygraphy/cuda/cuda.py:59 Cuda.check  2      0.000002  0.000002  0.000001
...py:258 Condition._acquire_restore  1      0.000001  0.000002  0.000002
..reading.py:1095 _MainThread.daemon  2      0.000002  0.000002  0.000001
..core/_internal.py:304 _ctypes.data  2      0.000001  0.000001  0.000000
..olygraphy/cuda/cuda.py:149 wrapper  2      0.000001  0.000001  0.000000
..n3.8/threading.py:513 Event.is_set  2      0.000001  0.000001  0.000000
..ng.py:1177 _make_invoke_excepthook  1      0.000001  0.000001  0.000001
../cuda.py:203 try_get_stream_handle  2      0.000000  0.000000  0.000000
Function stats for (Thread) (1)

Clock type: WALL
Ordered by: totaltime, desc

name                                  ncall  tsub      ttot      tavg      
..hon3.8/threading.py:859 Thread.run  1      0.000005  0.009640  0.009640
..tc_multithread.py:44 run_inference  1      0.008012  0.009635  0.009635
..rter.py:159 LazyModule.__getattr__  5      0.000011  0.000975  0.000195
..aphy/mod/importer.py:90 import_mod  5      0.000030  0.000963  0.000193
..da/cuda.py:244 DeviceArray.copy_to  1      0.000022  0.000955  0.000955
..r/logger.py:375 Logger.module_info  5      0.000022  0.000681  0.000136
../cuda.py:368 DeviceArray.copy_from  1      0.000032  0.000628  0.000628
..y:360 Logger._str_from_module_info  5      0.000016  0.000623  0.000125
..hy/logger/logger.py:363 try_append  15     0.000010  0.000607  0.000040
..aphy/logger/logger.py:372 <lambda>  5      0.000016  0.000577  0.000115
..ython3.8/posixpath.py:387 realpath  5      0.000013  0.000550  0.000110
..raphy/cuda/cuda.py:118 Cuda.memcpy  2      0.000513  0.000518  0.000259
..3.8/posixpath.py:396 _joinrealpath  5      0.000062  0.000447  0.000089
..uda/cuda.py:352 DeviceArray.resize  1      0.000008  0.000365  0.000365
..uda/cuda.py:237 DeviceArray.nbytes  2      0.000009  0.000319  0.000160
../python3.8/posixpath.py:164 islink  30     0.000038  0.000227  0.000008
..tlib/__init__.py:109 import_module  5      0.000012  0.000221  0.000044
..rtlib._bootstrap>:1002 _gcd_import  5      0.000010  0.000207  0.000041
..lib._bootstrap>:986 _find_and_load  5      0.000033  0.000191  0.000038
..lib/python3.8/posixpath.py:71 join  30     0.000074  0.000126  0.000004
..python3.8/posixpath.py:372 abspath  5      0.000008  0.000087  0.000017
..b._bootstrap>:157 _get_module_lock  10     0.000040  0.000067  0.000007
..>:147 _ModuleLockManager.__enter__  5      0.000009  0.000065  0.000013
..ython3.8/posixpath.py:334 normpath  5      0.000042  0.000062  0.000012
..bootstrap>:194 _lock_unlock_module  5      0.000009  0.000049  0.000010
..._bootstrap>:1017 _handle_fromlist  15     0.000034  0.000047  0.000003
..python3.8/posixpath.py:41 _get_sep  40     0.000027  0.000037  0.000001
..hy/logger/logger.py:207 Logger.log  5      0.000016  0.000036  0.000007
..ib/python3.8/posixpath.py:60 isabs  10     0.000018  0.000030  0.000003
<frozen importlib._bootstrap>:176 cb  10     0.000018  0.000025  0.000002
.._bootstrap>:78 _ModuleLock.acquire  10     0.000018  0.000023  0.000002
..bootstrap>:58 _ModuleLock.__init__  10     0.000017  0.000022  0.000002
..bootstrap>:103 _ModuleLock.release  10     0.000014  0.000016  0.000002
..p>:151 _ModuleLockManager.__exit__  5      0.000006  0.000016  0.000003
..aphy/logger/logger.py:370 <lambda>  5      0.000006  0.000013  0.000003
..aphy/logger/logger.py:371 <lambda>  5      0.000006  0.000007  0.000001
../_internal.py:250 _ctypes.__init__  2      0.000006  0.000006  0.000003
..tlib._bootstrap>:937 _sanity_check  5      0.000005  0.000006  0.000001
..hy/logger/logger.py:276 should_log  5      0.000004  0.000004  0.000001
..olygraphy/cuda/cuda.py:24 void_ptr  4      0.000003  0.000003  0.000001
..polygraphy/util/util.py:499 volume  3      0.000003  0.000003  0.000001
..0 DeviceArray._check_dtype_matches  2      0.000003  0.000003  0.000002
..p>:143 _ModuleLockManager.__init__  5      0.000003  0.000003  0.000001
..ygraphy/cuda/cuda.py:59 Cuda.check  2      0.000002  0.000002  0.000001
..core/_internal.py:304 _ctypes.data  2      0.000001  0.000001  0.000000
..olygraphy/cuda/cuda.py:149 wrapper  2      0.000001  0.000001  0.000000
../cuda.py:203 try_get_stream_handle  2      0.000000  0.000000  0.000000

one thread cpu time ____________________________++++++++++++++++++++++++++++++++++++++++++++++++++++
name                                  ncall  tsub      ttot      tavg      
..tc_multithread.py:44 run_inference  1      0.005786  0.007297  0.007297
..da/cuda.py:244 DeviceArray.copy_to  1      0.000016  0.001010  0.001010
..rter.py:159 LazyModule.__getattr__  5      0.000010  0.000940  0.000188
..aphy/mod/importer.py:90 import_mod  5      0.000025  0.000928  0.000186
..r/logger.py:375 Logger.module_info  5      0.000018  0.000621  0.000124
..y:360 Logger._str_from_module_info  5      0.000014  0.000568  0.000114
..hy/logger/logger.py:363 try_append  15     0.000018  0.000555  0.000037
..aphy/logger/logger.py:372 <lambda>  5      0.000013  0.000517  0.000103
..ython3.8/posixpath.py:387 realpath  5      0.000014  0.000497  0.000099
..raphy/cuda/cuda.py:118 Cuda.memcpy  2      0.000475  0.000479  0.000239
../cuda.py:368 DeviceArray.copy_from  1      0.000021  0.000474  0.000474
..3.8/posixpath.py:396 _joinrealpath  5      0.000068  0.000383  0.000077
..uda/cuda.py:237 DeviceArray.nbytes  2      0.000007  0.000377  0.000188
..tlib/__init__.py:109 import_module  5      0.000012  0.000250  0.000050
..uda/cuda.py:352 DeviceArray.resize  1      0.000006  0.000247  0.000247
..rtlib._bootstrap>:1002 _gcd_import  5      0.000016  0.000236  0.000047
..lib._bootstrap>:986 _find_and_load  5      0.000035  0.000213  0.000043
..lib/python3.8/posixpath.py:71 join  30     0.000081  0.000150  0.000005
../python3.8/posixpath.py:164 islink  30     0.000048  0.000129  0.000004
..python3.8/posixpath.py:372 abspath  5      0.000011  0.000097  0.000019
..ython3.8/posixpath.py:334 normpath  5      0.000040  0.000066  0.000013
..>:147 _ModuleLockManager.__enter__  5      0.000012  0.000066  0.000013
..b._bootstrap>:157 _get_module_lock  10     0.000033  0.000065  0.000006
..bootstrap>:194 _lock_unlock_module  5      0.000011  0.000057  0.000011
..._bootstrap>:1017 _handle_fromlist  15     0.000032  0.000050  0.000003
..python3.8/posixpath.py:41 _get_sep  40     0.000031  0.000046  0.000001
..ib/python3.8/posixpath.py:60 isabs  10     0.000019  0.000038  0.000004
..hy/logger/logger.py:207 Logger.log  5      0.000013  0.000035  0.000007
<frozen importlib._bootstrap>:176 cb  10     0.000021  0.000033  0.000003
.._bootstrap>:78 _ModuleLock.acquire  10     0.000022  0.000026  0.000003
..bootstrap>:58 _ModuleLock.__init__  10     0.000016  0.000024  0.000002
..bootstrap>:103 _ModuleLock.release  10     0.000016  0.000020  0.000002
..p>:151 _ModuleLockManager.__exit__  5      0.000007  0.000018  0.000004
..aphy/logger/logger.py:370 <lambda>  5      0.000007  0.000012  0.000002
..uda/cuda.py:196 Stream.synchronize  1      0.000003  0.000008  0.000008
..aphy/logger/logger.py:371 <lambda>  5      0.000005  0.000008  0.000002
..tlib._bootstrap>:937 _sanity_check  5      0.000005  0.000007  0.000001
..cuda.py:76 Cuda.stream_synchronize  1      0.000004  0.000005  0.000005
../_internal.py:250 _ctypes.__init__  2      0.000004  0.000004  0.000002
..p>:143 _ModuleLockManager.__init__  5      0.000003  0.000003  0.000001
..olygraphy/cuda/cuda.py:24 void_ptr  5      0.000003  0.000003  0.000001
..hy/logger/logger.py:276 should_log  5      0.000003  0.000003  0.000001
..polygraphy/util/util.py:499 volume  3      0.000002  0.000002  0.000001
..0 DeviceArray._check_dtype_matches  2      0.000002  0.000002  0.000001
..ygraphy/cuda/cuda.py:59 Cuda.check  3      0.000002  0.000002  0.000001
..olygraphy/cuda/cuda.py:149 wrapper  3      0.000001  0.000001  0.000000
..core/_internal.py:304 _ctypes.data  2      0.000001  0.000001  0.000001
../cuda.py:203 try_get_stream_handle  2      0.000001  0.000001  0.000000

one thread wall time__________________++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Clock type: WALL
Ordered by: totaltime, desc

name                                  ncall  tsub      ttot      tavg      
..tc_multithread.py:44 run_inference  1      0.005838  0.006989  0.006989
..da/cuda.py:244 DeviceArray.copy_to  1      0.000013  0.000788  0.000788
..rter.py:159 LazyModule.__getattr__  5      0.000005  0.000601  0.000120
..aphy/mod/importer.py:90 import_mod  5      0.000020  0.000593  0.000119
..raphy/cuda/cuda.py:118 Cuda.memcpy  2      0.000470  0.000472  0.000236
..r/logger.py:375 Logger.module_info  5      0.000013  0.000389  0.000078
..y:360 Logger._str_from_module_info  5      0.000010  0.000355  0.000071
..hy/logger/logger.py:363 try_append  15     0.000008  0.000345  0.000023
../cuda.py:368 DeviceArray.copy_from  1      0.000022  0.000340  0.000340
..aphy/logger/logger.py:372 <lambda>  5      0.000010  0.000325  0.000065
..ython3.8/posixpath.py:387 realpath  5      0.000011  0.000309  0.000062
..3.8/posixpath.py:396 _joinrealpath  5      0.000043  0.000244  0.000049
..uda/cuda.py:237 DeviceArray.nbytes  2      0.000006  0.000227  0.000113
..uda/cuda.py:352 DeviceArray.resize  1      0.000006  0.000184  0.000184
..tlib/__init__.py:109 import_module  5      0.000006  0.000167  0.000033
..rtlib._bootstrap>:1002 _gcd_import  5      0.000010  0.000160  0.000032
..lib._bootstrap>:986 _find_and_load  5      0.000039  0.000149  0.000030
../python3.8/posixpath.py:164 islink  30     0.000027  0.000097  0.000003
..lib/python3.8/posixpath.py:71 join  30     0.000050  0.000078  0.000003
..python3.8/posixpath.py:372 abspath  5      0.000004  0.000052  0.000010
..>:147 _ModuleLockManager.__enter__  5      0.000008  0.000045  0.000009
..b._bootstrap>:157 _get_module_lock  10     0.000023  0.000042  0.000004
..ython3.8/posixpath.py:334 normpath  5      0.000027  0.000037  0.000007
..bootstrap>:194 _lock_unlock_module  5      0.000006  0.000033  0.000007
..._bootstrap>:1017 _handle_fromlist  15     0.000018  0.000027  0.000002
..ib/python3.8/posixpath.py:60 isabs  10     0.000012  0.000021  0.000002
..python3.8/posixpath.py:41 _get_sep  40     0.000016  0.000021  0.000001
..hy/logger/logger.py:207 Logger.log  5      0.000009  0.000021  0.000004
.._bootstrap>:78 _ModuleLock.acquire  10     0.000015  0.000017  0.000002
..bootstrap>:58 _ModuleLock.__init__  10     0.000012  0.000016  0.000002
<frozen importlib._bootstrap>:176 cb  10     0.000009  0.000015  0.000001
..bootstrap>:103 _ModuleLock.release  10     0.000010  0.000013  0.000001
..p>:151 _ModuleLockManager.__exit__  5      0.000004  0.000012  0.000002
..aphy/logger/logger.py:370 <lambda>  5      0.000002  0.000007  0.000001
..aphy/logger/logger.py:371 <lambda>  5      0.000004  0.000005  0.000001
..uda/cuda.py:196 Stream.synchronize  1      0.000001  0.000005  0.000005
..cuda.py:76 Cuda.stream_synchronize  1      0.000004  0.000004  0.000004
..p>:143 _ModuleLockManager.__init__  5      0.000003  0.000003  0.000001
../_internal.py:250 _ctypes.__init__  2      0.000002  0.000002  0.000001
..olygraphy/cuda/cuda.py:24 void_ptr  5      0.000002  0.000002  0.000000
..hy/logger/logger.py:276 should_log  5      0.000002  0.000002  0.000000
..polygraphy/util/util.py:499 volume  3      0.000001  0.000001  0.000000
..0 DeviceArray._check_dtype_matches  2      0.000001  0.000001  0.000000
..tlib._bootstrap>:937 _sanity_check  5      0.000001  0.000001  0.000000
..ygraphy/cuda/cuda.py:59 Cuda.check  3      0.000000  0.000000  0.000000
..core/_internal.py:304 _ctypes.data  2      0.000000  0.000000  0.000000
../cuda.py:203 try_get_stream_handle  2      0.000000  0.000000  0.000000
..olygraphy/cuda/cuda.py:149 wrapper  3      0.000000  0.000000  0.000000

you can see that for everything the thread version is slower. I just wonder why it’s like this.