Threading or multiprocessing

ardeal · May 19, 2021, 2:22am

Hi,

I have 3 cameras and 3 3070 GPUs on one of my computer.

I initialized 3 threads from threading class of Python. In run function of the sub-class, I did the following things:

while True:
	read image from camera. camera and computer are both in the same local LAN.
	do yolov3 deep learning. the time needed for this step is around 30ms.
	call imshow of opencv to display image.

From the displayed image, I could see that the delay of the image is around 12 seconds. This is very very long time delay.

Does this prove that threading in Python is not suitable for my case?
Should I use multiprocessing to take place of threading?

cameron · May 19, 2021, 3:07am

I have 3 cameras and 3 3070 GPUs on one of my computer.

I initialized 3 threads from threading class of Python. In run function of the sub-class, I did the following things:
while True:
	read image from camera. camera and computer are both in the same local LAN.
	do yolov3 deep learning. the time needed for this step is around 30ms.
	call imshow of opencv to display image.
From the displayed image, I could see that the delay of the image is around 12 seconds. This is very very long time delay.

Does this prove that threading in Python is not suitable for my case?

Not without more information.

Should I use multiprocessing to take place of threading?

Maybe. That depends on testing.

Put print statements in your code to identify the thread and the
start/stop times of the various steps in the thread. See what consumes
time.

If 12 seconds is a long time versus what you expect, hopefully the slow
step will e pretty obvious.

Is yolov3 a Python library? Does it do its learning in Python code or C
code? If that latter, does it release the GIL while doing compute
intensive stuff?

Normally compute intensive nonPython code tries to release the GIL while
it takes pace. That leaves the CPU free to run other Python code, and
threading is very effective when that’s the model (and time consuming
part).

Also, test your code with just one camera and one Thread. Time things.

If the run time 3 three threads is (roughly) 3 times the 1 thread/1
camera case, something isn’t sharing the CPU.

These tests should help you figure out where that issue is.

If there’s compute intensive stuff which does not release the GIL (eg
done by pure python code, of C code not releasing the GIL) then
multiprocessing may help you by having separate Pythons for each task.
But if not, you may fine multithreading brings no throughput benefit,
and brings a little pain because setting up the communication with the
subprocesses has its complications (which can be very small I gather).

Cheers,
Cameron Simpson cs@cskk.id.au

ardeal · May 19, 2021, 3:19am

Hi Cameron,
Thank you for your reply!
I did the following experiments:
experiment 1:
Only run 1 thread. the time delay is around 1-2s.
experiment 2:
Run 2 threads for 2 GPU respectively, time delay for thread 1 is around 1-2s, and time delay for thread 2 is around 5-6s.

my code is pure python code, however, the code depends on Pytorch which depends on Python and C.

I suspect the reason of the time delay is GIL which allows only 1 thread to run at one time.

encukou · May 19, 2021, 8:50am

The time-intensive computations should be done with the GIL released, but that’s up to the libraries you’re using. Check if they are releasing it.

cameron · May 19, 2021, 9:08am

Did you print() calls aroundd each step corroborate this? eg

print(time.time(), "fetch from camera")
fetch...
print(time.time(), "learn...")
pytorch...
print(time.time(), "done")

Anyway, look at the multiprocessing module; IIRC starting a subprocess
isn’t much harder than starting a thread.

The question is: do these need to share data? If you learn in different
processes, do they benefit.

Have you read this:

https://pytorch.org/tutorials/beginner/dist_overview.html

Down the bottom it suggests DataParallel for your situation I think:

https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html

Cheers,
Cameron Simpson cs@cskk.id.au

Blackward · May 19, 2021, 1:52pm

Howdy Ardeal,

a clear “yes” - if you ask me.

Multithreading in Python is (just) fine for parallelizing I/O-bound tasks - for example for the part of reading the images.

For CPU-bound tasks the way to go is multiprocessing - for example for the part of processing the images.

As long as you do not need IPC (inter process communication), multiprocessing is as easy as multithreading; at the current stage, you seem not to need IPC - so the simplest way might be to just try it out…

If, in the future, you are going to need IPC, consider using multiprocessing.Manager objects - they ease IPC significantly.

Cheers, Dominik

ardeal · May 20, 2021, 12:35am

Hi Dominik,
Many thanks for you and all others’ reply!
Your answer “yes” made me clear.

Before posting this topic, I have search threading of python in Google. I got to know that threading is not good for computing efficiency case(such as image processing). I cannot believe that, so I post this topic to confirm the comments/answers in Google.

Now that everyone knows GIL of python makes threading in python is not suitable for some cases, why doesn’t Python change the GIL rule to enhance Python threading?

GIL makes Python threading like a well-known Chinese saying:
Python threading like chicken ribs. It’s tasteless to eat, but it’s a pity to abandon

Blackward · May 20, 2021, 12:45am

Hi Ardeal,

because that is a quite tough task. The GIL was invented to make the development of Python easier and maybe “cleaner”. You cannot just / easily switch of the GIL - unfortunately…

By the way: IronPython comes without GIL; this means, in IronPython, you have the full power of multithreading at hand. But, the performance of IronPython in the average is worse than that of CPython, as IronPython is based on the CLI (Common Language Infrastructure). Jython also comes without GIL (as it is JVM based).

It is true, that the GIL is one of the big nuisances of Python

It is quite typical, that interpreted languages come with such restrictions. In C# you e.g. won’t find multiple inheritance - for the very same reason: it is hard to implement…

Cheers, Dominik

ardeal · May 20, 2021, 1:23am

Got it. Thank you!

EpicWink · May 20, 2021, 4:40am

The majority of popular analysis libraries, including image processing, machine learning, array manipulation and statistics, are written in C and invoked by Python. When the CPU is running C code, other threads are allowed to run Python code (unless the C library is written poorly).

I would test with both using multi-threading and multi-processing, and see if you get a speed-up: you might find you have I/O contention.

Blackward · May 20, 2021, 5:49am

Hi Laurie,

that does not make sense to me, as a Python function wrapping C code normally will be blocking until the C part has finished, right? I doubt that most Python wrappers are coded in a non-blocking way - am I wrong?

Cheers, Dominik

cameron · May 20, 2021, 8:05am

Before posting this topic, I have search threading of python in Google.
I got to know that threading is not good for computing efficiency
case(such as image processing). I cannot believe that, so I post this
topic to confirm the comments/answers in Google.

Now that everyone knows GIL of python makes threading in python is not suitable for some cases, why doesn’t Python change the GIL rule to enhance Python threading?

The GIL makes compute-intensive-in-pure-Python not benefitted much by
threading, because only one piece of pure Python code can run at once -
the GIL is the single-big-lock pattern, used so that while it is held
you know that nobody else is doing any python-specific-stuff. You’re
free to manipulate the interpreter’s internal state without worrying
about races - allocate Python variables, whatever. (Not you the
programmer, I mean the Python interpreter.)

However, truly compute intensive tasks are poorly served by
interpreted languages, particularly dynamic interpreted languages - the
overhead of interpretation itself brings an order of magnitutde cost
beyond the theoretical limits of the hardware. We use it for its
superior expressiveness and memory safety etc etc.

People do do high performance computing using Python, for example
using Numpy and likely PyTorch. All these libraries are high speed /
lower level (usually, eg C or C++), which get good machine performance.
All of these are arranged around pieces of extension code which look
like this:

# inside the Python interpreter
do some Python things to set up
release the GIL
... do high speed C stuff here, ideally as much as possible ...
reclaim the GIL
update Python interpreter state

I’ve only written one serious piece of C extension code, here:

https://hg.sr.ht/~cameron-simpson/css/browse/lib/python/cs/vt/_scan.c?rev=tip

You can see the pattern above in the scan_scanbuf() function; the chunk
in the middle:

Py_BEGIN_ALLOW_THREADS
unsigned long   offset = 0;
unsigned char   *cp = buf;
for (; buflen; cp++, buflen--, offset++) {
    unsigned char b = *cp;
    hash_value = ( ( ( hash_value & 0x001fffff ) << 7
                   )
                 > ( ( b & 0x7f )^( (b & 0x80)>>7 )
                   )
                 );
    if (hash_value % 4093 == 4091) {
        offsets[noffsets++] = offset;
    }
}
Py_END_ALLOW_THREADS

It does some Python internal setup, releases the GIL at
Py_BEGIN_ALLOW_THREADS, does a pure C scan of a memory buffer,
in this case potentially quite large, then reacquires the GIL at
Py_END_ALLOW_THREADS, and updates the Python state before returning the
the outer Python programme.

While the pure C chunk above is running, other Python threads can
execute at the same time.

While the GIL is released, the C stuff runs at full speed in that
thread and other Python threads are free to execute at the same time.
If things are arranged well, this can produce good multithreaded
performance for compute intensive stuff. Likewise I/O bound stuff - the
interpreter releases the GIL while waiting for significant I/O, so that
other threads run freely while this thread is blocked.

GIL makes Python threading like a well-known Chinese saying:
Python threading like chicken ribs. It’s tasteless to eat, but it’s a pity to abandon

No, it just means you need to do the right things. Even for high
performance stuff, Python’s great for orchestration, and Threads are one
form of orchestration.

Cheers,
Cameron Simpson cs@cskk.id.au

cameron · May 20, 2021, 8:09am

For performance stuff or blocking I/O stuff, you’re wrong.

It is easy to let go of the GIL, do some pure C high speed stuff, and
take it back. During the high speed stuff, other Python threads can run.
The same applies for I/O in C: setup, release the GIL, do the I/O, maybe
blocking while stuff happens, reacquire the GIL, return.

That’s one of the great attractions of Threads - you don’t need to write
nonblocking stuff - instead you use blocking stuff, but block only that
thread - other threads process happily.

Cheers,
Cameron Simpson cs@cskk.id.au

ardeal · May 20, 2021, 8:30am

Hi Cameron,

Thank you for your reply!
You are very professional in Python.
one short question: do you work for Python language? do you work with Guido together?

In order to make full use of Python threading, we have to understand GIL thoroughly.

Blackward · May 20, 2021, 5:38pm

Hi Cameron,

I see, this is what Laurie alluded to - got it. That’s true, if you release the GIL, in that case the following parts are not blocking anymore. Thanks for pointing that out!

Cheers, Dominik

Blackward · May 20, 2021, 5:59pm

Hi Ardeal,

agreed.

With multithreading resp. multiprocessing it often is not easy to predict the outcome in terms of performance. Sometimes multithreading even is faster than multiprocessing.

This not just depends on the code, but also on the machine your code is executed on - and the belonging capabilities - alike number of threading units / CPU cores and so on. In the end you have to try it out.

But as a rule of thumb - avoiding problems is easier with multiprocessing than with multithreading, in my experience. No GIL there. We did not talk about deadlocks yet…

And as soon as Python code is involved significantly (where you cannot release the GIL), the performance generally is better (unless the computing tasks are so small, that the process overhead is relevant).

The downside of multiprocessing is that IPC comes with more effort.

Cheers, Dominik

cameron · May 20, 2021, 10:40pm

one short question: do you work for Python language? do you work with
Guido together?

No and no. I’m currently earning $s as a Python developer and doing some
sysadmin on the side. And like anyone, I’ve a bazillion side projects.

In order to make full use of Python threading, we have to understand
GIL thoroughly.

Well, yes and no. You need to know:

the GIL exists - if your code execution is dominated by pure Python
code then multiple threads do not yield greater CPU utilisation
plenty of high performance stuff like numpy have C (or C++ etc)
library which do the high performance stuff, and provide nice and very
expressive Python idiomatic access - so you do pythonic comfortable
setup, then call a numpy function which runs at machine speed, then
deal with the result in nice comfortable Python code afterward
as with OS system calls, shell scripts etc, the more tiny calls to the
lower level or outside-the-language things you make, the smaller the
gain from the low level thing
threads are also good for I/O blocked stuff eg fetch data from the net
or managing the send and receive parts of network traffic separately
(usually with a little coordination, but not with one side blocking
the other - multithreads let you pipeline things for greater
throughput, for example)
threads can also be good for algorithm expressivity; plenty of things
express well as multiple threads cooperating to perform a task,
instead of some central loop with complex if-statements etc to perform
bit and pieces interleaved; Queues can be good for passing data back
and forth; generators go a long way to separating out flow control for
things like this, but they’re quite serial and sometimes a thread is a
better/cleaner expression
threads are also good for “workers” eg a watcher for a directory
waiting for things to show up to work on, etc - they can sit there
doing their thing and hand off tasks to the rest of the programme as
they show up instead of having the main programme interleave that kind
of thing in a main loop

As Dominik mentioned, communication is the tricky bit with
multiprocessing - because the subtasks are separate Python porgrammes
you can’t just put an object on a Queue etc. There are facilities for
making this easier, but they mostly involve copying data instead of just
passing a references and copying data requires it to be serialisable.

The tricky bit with threads is deadlocks (from too tight integration
between threads or lock/mutex mismanagement) and races (from too loose
control of data structures and lack of locks/mutexes to control shared
access).

Cheers,
Cameron Simpson cs@cskk.id.au

ardeal · May 21, 2021, 1:09am

Thank you all, Cameron, Dominik, Laurie and etc.!

Blackward · May 23, 2021, 3:22am

Hi Ardeal,

I have written a small test program, which compares sequential processing of two tasks with multithreaded processing of two tasks with multiprocessed processing of two tasks - have fun with it:

ThreadingVsProcessing.py:
# -- coding: utf-8 --

#import common libraries
import cv2
import time
import threading
import multiprocessing
import argparse
import os.path
import scipy.ndimage
import numpy


#parse command line arguments
argParser = argparse.ArgumentParser()
argParser.add_argument( "testTypeS",       type=str, nargs="?",                                         \
                        choices =  ("noMulti", "multiThread", "multiProcess"),                          \
                        help     = "type of test to be done - either 'noMulti', 'multiThread' "
                                   "(for multithreading) or 'multiProcess' (for multiprocessing)"       )
argParser.add_argument( "numberOfFramesI", type=int, help = "number of frames to be taken into account" )
argParser.add_argument( "videoPathS",      type=str, help = "path to video file"                        )
arguments = argParser.parse_args()


#check number of frames argument
if not isinstance(arguments.numberOfFramesI, int):
   raise Exception("'numberOfFramesI' must be an integer (and not: %s)!" % arguments.numberOfFramesI)

#check the image file path argument
if not os.path.exists(arguments.videoPathS):
   raise Exception("the file, 'videoPathS' is pointing to: '%s', does not exist!" % arguments.videoPathS)


#task to be accomplished - twice
def openCvTask(params, indexI):
    """ Read and filter params.numberOfFramesI frames from video 
        (params.videoPathS) and save the result in file 'filteredVideoXXX.webm' -
        with XXX is indexI. 
    """

    #open test video
    video                  = cv2.VideoCapture(params.videoPathS)

    #read and filter frames
    framesL                = []
    for looper in range(params.numberOfFramesI):
        #read next frame
        (dummy, currFrame) = video.read()

        #canny edge filter current frame
        edges              = cv2.Canny(currFrame, 300, 300)

        #append to list of frames
        framesL.append( edges )

    #close input video file
    video.release()

    #write frames to new video file
    height, width = edges.shape
    video = cv2.VideoWriter('filteredVideo%d.mp4' % indexI, cv2.VideoWriter_fourcc(*'avc1'), 25, (width,height))
    for currFrame in framesL:
        video.write( currFrame )
        #cv2.imshow( "see frames", currFrame )
        #cv2.waitKey()

    #close output video file
    video.release()


#main
if __name__ == "__main__":

   if   arguments.testTypeS == "noMulti":
        ### the task of processing twice is done sequentially ###
        print ("---noMulti---")
        startF   = time.time()

        openCvTask(arguments, 1)
        openCvTask(arguments, 2)

        endF     = time.time()

   elif arguments.testTypeS == "multiThread":
        ### the task of processing twice is done in parallel using multithreading ###
        print ("---multiThread---")
        startF   = time.time()

        #start background thread
        bgThread = threading.Thread( target=openCvTask, args=(arguments, 1) )
        bgThread.daemon = True
        bgThread.start()

        #start foreground task
        openCvTask(arguments, 2)

        #wait on bgThread
        bgThread.join()

        endF     = time.time()

   elif arguments.testTypeS == "multiProcess":
        ### the task of processing twice is done in parallel using multiprocessing ###
        print ("---multiProcess---")
        startF   = time.time()

        #start background thread
        bgProcess = multiprocessing.Process( target=openCvTask, args=(arguments, 1) )
        bgProcess.start()

        #start foreground task
        openCvTask(arguments, 2)

        #wait on bgThread
        bgProcess.join()

        endF     = time.time()

   else:
        raise Exception("the command line argument testTypeS either must be 'noMulti', 'multiThread' or 'multiProcess'!")


   #result
   print ("Time needed [s]: %s" % (endF - startF))

The outcome on my Windows computer was:

whereas the outcome on my Linux computer was:

which is a good example for the truth of the statement, that whether multithreading performs better than multiprocessing or vice versa might depend on the system…

Screenshots with interesting timings from you or the other forum members are welcome !

Cheers, Dominik

PS: also have a look at the result (“filteredVideo1.mp4”) - its fun too…