Performance Optimization and Troubleshooting in GPU-Based Machine Learning Deployments

I recently deployed a machine learning model utilizing GPU on two parallel servers. The request load was balanced between them using a load balancer. To reduce resource usage, I decided to switch to a single-server setup by routing all requests directly to one of the servers. The server has a Nvidia tesla t4 GPU. The model is using around 1.2GBs only out of 15GBs capacity.

After the change, I observed the following:

CPU Utilization: Remained almost unchanged.

GPU Utilization: Doubled as expected but did not exceed the GPU’s capacity.

Average GPU utilization % went from 7.5% to 16.5%

Max GPU utilization % went from 38% to 48%

90th percentile of GPU utilization % went from 19% to 34%

Model Response Time: Increased by about 10% to 15% on average.

Despite these observations, I’m unable to pinpoint the exact cause of the increased response time from the model. Any insights or suggestions?

Hello,

without going into heavy analysis, this has the telltale signs of a queue.

https://ece-research.unm.edu/jimp/611/slides/chap6_3.html

From Little's Law:
The mean number of tasks in the system = arrival rate * mean response time

Setting arrivale rate = 1

mean number of tasks in the system = 1 x mean response time

solving for response time:

response time = mean number of tasks in the system

But if you use one server vs two for the same amount of tasks, then this is the equivalent of halving the arrival rate (you’re doubling the length of the queue while the total tasks remain the same) and that is equivlent to doubling the response time:

Setting arrivale rate = 1/2

mean number of tasks in the system = 1/2 x mean response time

solving for response time via a little bit of algebra:

response time = 2 * mean number of tasks in the system

So, as we can see, the response time has been doubled.

The total number of tasks has not increased so its utilization should not be expected to increase by much (or different). Since tasks may be different (not constant in processing time, they’re not homogeneous) from one to another (randomly), you’ll see slight differences in the other metrics.