Data Science Performance Challenges via Python

TomMonkeyMan · August 5, 2024, 8:02am

Hi friends,

The background is I’m currently on a data science project where we use Python to perform a series of matrix operations and integrations for thermodynamic temperature calculations. However, I’ve encountered a performance bottleneck and had to switch to Rust because it runs much faster.

I understand that Python should be used in a more Pythonic way and not just like solving a puzzle, but I still want to bring this issue up for discussion. I hope to understand more about this language and its characteristics.

The input data structure is quite simple for a thermal model, like:

epoch_time, current, duration
1, xxxxx, 0.23
2, xxxxx, 0.3
3, xxxxx, 0.5

The physical model has several surfaces for each inductor [R1, R2, R3, R4…] and a matrix of initial parameters. We need to calculate the integral (heat by current over a certain time) via time, where each input value depends on the previous value we calculated.

The calculation parts (matrix operations) were written using numpy and scipy (for the integral part), avoiding dynamic lists/arrays (e.g., append list, np.insert, etc.). However, when the dataset became extremely large (over 10 billion entries), it still ran very slowly.

Then the code was rewritten in Rust (using the rectangular approximation method for integration), and it ran about a hundred times faster than the Python version.

I’m quite confused about why there is such a significant performance difference. numpy is supposed to be written in Fortran, and scipy packages should also be well-optimized. Is this performance gap due to the cost of compiling or something else?

The goal is still to write the entire project in Python so the team can easily understand it, and we do appreciate Python for data science.

Any thoughts and questions are welcome. more details can be provided if needed. or has anyone else done similar work or faced similar issues?

onePythonUser · August 5, 2024, 12:30pm

Hi,

have you considered integrating C into your Python script? Ironically, when developing C embedded applications and we wanted even more speed (for PID feedback loops), we would use assembly language. In this case, using Python, I think that you may have to write some tasks using C if speed is what you’re after.

I believe that it is written in C actually.

From this article here, Numpy makes use of LAPLACK library functions that may be written in FORTRAN or C.

TomMonkeyMan · August 6, 2024, 8:23am

Hi Paul,

Thanks for your reply! The reason why I wrote it in Rust is because I’m not that familiar with C. Though C shall be the best practice, which I totally agree with. I’ll spend more time to make it and compare it with Rust.

Actually, I’m still not sure why the performance is that bad although we are fully using numpy and scipy in the python script…

hprodh · February 7, 2025, 10:40pm

→ Yes.
Consider all the time is spent into memory access and allocation, and not into ‘actual’ computation. The compiler finds the computation nodes then combines chained operations to minimize the temporary data assignments and repeated accesses, (and modernly in a SIMD way).