Optimizing Memory Usage for Large CSV Processing in Python 3.12

Gabrielavale · July 10, 2025, 8:21am

Dear Python Community,

I am refining a Python script for a small business analytics application, developed in May 2025, to process large CSV datasets efficiently while minimizing memory consumption. While the script performs its core functions, it encounters memory issues with larger datasets, impacting performance. Your expertise in addressing this challenge would be greatly appreciated.

My application, built with Python 3.12 on a Linux server (Ubuntu 22.04, 8GB RAM, 4 vCPUs), processes CSV files containing sales data, typically 500,000 rows with 20 columns, amounting to approximately 1GB in size. The script uses pandas to load the CSV into a DataFrame for aggregation and filtering, following guidelines from the pandas documentation.

During processing, memory usage peaks at 6GB, occasionally causing the script to crash on larger datasets. To mitigate this, I implemented chunking with pd.read_csv(chunksize=10000), which reduced memory usage to 4GB but increased processing time by 40%, from 2 minutes to nearly 3 minutes. I also profiled the script using memory_profiler, identifying the DataFrame loading as the primary memory bottleneck. Additionally, I experimented with dtype optimization to use smaller data types, such as int32 for numeric columns, but observed only marginal improvements.

Despite these efforts, memory consumption remains high, suggesting inefficiencies in my approach to handling large datasets. I am particularly interested in strategies to further optimize memory usage while maintaining reasonable processing speeds, ensuring the script can scale to datasets exceeding 1GB without requiring additional server resources.

What specific techniques or libraries would you recommend to optimize memory usage for processing large CSV datasets in Python 3.12?

Thank you for your insights and guidance.

Paddy3118 · July 10, 2025, 8:37am

I haven’t had to try it myself, but you might try reading the CSV file using Polars then outputting a pandas dataframe. Polars is touted as being fast and memory efficient.

barry-scott · July 10, 2025, 9:13am

Can you process the CSV files a line at at time?

If so you avoid the memory overhead of holding the whole CSV in memory.

onePythonUser · July 10, 2025, 11:06am

Hello,

A potential solution is to incorporate function generators that allow you to process smaller sets of data at a time versus the entire set all at once by using the yield keyword. In simple terms, you’ll be iterating (stepping through) your rows in chunks, processing the data and repeating continously until the entire set has been processed versus attempting to process the entire file at once (divide and conquer). This should alleviate the large memory requirements that you are experiencing.

Here are a few tutorials to get you started:

https://labex.io/tutorials/python-how-to-use-csv-reader-with-generators-462141

https://www.datacamp.com/tutorial/python-generators

Hope this helps.

Gabrielavale · July 11, 2025, 6:05am

Hello Paul,

Thank you for your insightful suggestion to use function generators with the yield keyword for processing large CSV files. The tutorials you shared were immensely helpful in guiding my approach. Inspired by your advice, I implemented a generator function to process my 1GB sales dataset in 10,000-row chunks using Python 3.12’s csv.reader with yield, reducing memory usage from 4GB to 2.5GB, as measured by memory_profiler.

However, complex aggregations, such as grouping by product categories, still slow processing by 20%. Could you recommend techniques to optimize generator performance for such aggregations in large datasets?

Best,
@Gabrielavale | CEO of Calchoras

jeff5 · July 11, 2025, 7:44am

First caveat I owe you: I haven’t tried any of this on a sample CSV, although I do use pandas data frames a bit.

A straightforward approach would be only to load the columns that you actually need for a given task. This may mean reading it more times, so you trade memory at run-time for more disk i/o. Of course, don’t leave (references to) the old frame hanging around while you read the new one.

If you have a lot of long string data, it is quite likely the main culprit. (Or maybe it is dates.) There is not much you can do about that directly if you actually need the string data. You could perhaps do part of the process on those strings separately and replace them with the meat they contain.

If the strings/dates are often the same string/date, then you can potentially make big savings. If the strings that can appear in a column are few, then it is categorical data, so essentially a small int. But maybe it isn’t that simple.

A column of object data, like strings or dates, really only contains pointers to the object values, which are in memory elsewhere. But by default (for speed) it allocates a new value for each field it reads.

If you take the trouble to check each value (in a column or related columns) to see if you’ve read it before, you can insert the existing value in the new cell. There is a flag to read_csv() called cache_dates that lets you ask this be done for dates, if you are already using parse_dates. For string I think you would have to roll your own converter (see converters argument), but I think it need only be a function that returns its argument, but is decorated with @functools.cache.

A good way to investigate where your memory is going may be: pandas.DataFrame.memory_usage — pandas 2.3.1 documentation. Be aware that for object data (like strings, dates) it will simply add up the size of all the the pointers in that column. If you use the deep argument, it will add up the size of the strings, but I don’t think it is clever enough to recognise when it isn’t really a new string.

aivarpaalberg · July 11, 2025, 10:24am

Article Speed up your data science and scientific computing code may have relevant/useful information how to reduce memory usage and speed up the code execution.

onePythonUser · July 11, 2025, 11:11am

How large are the numbers that you are expected to process in your columns?

int32 covers ranges between -2,147,483,648 to +2,147,483,647.
int16 covers ranges between -32,768 to +32,767.

If you are only expecting positive values, you can use the numpy library which will allow you to work with unsigned integers and can then double the positive range from traditional intxx types (i.e., np.uint8 , np.uint16 , np.uint32 , and np.uint64). But again, I don’t know what your application needs are in this respect but are an option if only positive values are expected. If the values that you are expected to process are smaller than 64k (if all positive), then using np.uint16 is a viable option as an fyi. This can potentially speed things up a bit.

To truelly speed things up, you can integrate C into your Python script. I have not personally done this (its on my bucket list as things to do, however ). Ironically, when I programmed in C for embedded applications, I used assembly to speed things up, …, go figure.

In any case, here are a few tutorials incorporating cython / c-extensions into Pyhon.

https://arjunsahlot.hashnode.dev/speeding-up-python-100x-using-c-integration

Hope this helps!

barry-scott · July 11, 2025, 11:19am

It seems you are doing the sort of processing that an SQL database would be very good at.

Have you considered entering your data into an SQL database and using SQL queries to get the results you want? SQL databases are very fast compared to CSV files.

jeff5 · July 11, 2025, 2:44pm

Probably, pandas is already doing the heavy lifting in C.

However, if there is custom code, and especially if it works by iterating over rows of the dataframe itself, then that can be a high cost. I would look first for features of pandas that let you do the same thing as column operations, because then pandas will manage the iteration for you. It is difficult to make yourself do this, because we think of rows as objects.

onePythonUser · July 12, 2025, 2:58am

Hello,

Please refer to the middle link (tutorial) that I provided in my previous post. In the tutorial, they highlighted three different pathways for integrating C with Python. For this particular demonstration, I followed the Cython method due to its expected performance and ease of implementation. The Cython performance improvement that I observed relative to the Python version was approximately a factor of ~18.

Here is a synopsis of the steps from the tutorial:

Create a folder named cython_tests and place it on your desktop (or anywhere for easy reference) and place the following three files in it. Note that for the fib_cython.pyx file, you can create it using Python. Note that after having done so, it will be named fib_cython.pyx.py. This is ok. Afterwards, just rename it by removing the .py.

Create three (3) files:

setup.py
fib_cython.pyx
latency_comparison.py

In this test, we will be running the Fibonacci Sequence function 100 times. For testing purposes, we will use 30 as the sequence series argument. The file contents are as follows:

First file:

"""
   fib_cython.pyx
"""
cpdef int fib(int n):
     if n <= 1:
         return n
     else:
         return fib(n-1) + fib(n-2)

Second file:

"""
   setup.py
"""
from setuptools import setup
from Cython.Build import cythonize
setup(
    ext_modules=cythonize("fib_cython.pyx")
)

Third file:

"""
   latency_comparison.py
  
   Test script to compare `Python` version with the `Cython` version.
"""
from fib_cython import fib
from time import time

# Original Python version for reference
def fibo(n):
    
    if n <= 1:
        return n
    else:
        return fibo(n-1) + fibo(n-2)

def latency_measure(func, seq_num):
    
    start = time()

    # Run function 100 times (inserted 10 in the
    # body to minimze the for loop overhead is all)
    for i in range(1,11):
        
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        lat = func(seq_num)
        
    return time() - start

optimized_cython = latency_measure(fib, 30)
regular_fib = latency_measure(fibo, 30)

print('\nThe Fibonacci function sequence series is: 30')
print('\nTest results for 100 iterations:')
print('--------------------------------')
print(f'Latency using Python script: {regular_fib:.2f} s')
print(f'Latency using cython optimized script: {optimized_cython:.2f} s')
print(f'Factor improvement: {regular_fib/optimized_cython:.1f}')

Go to the folder cython_tests and on the top directory entry enter cmd.

folder_creation

Creating the cython file:

A CMD window will open with its default directory set to C:\Desktop\cython_tests.

At the prompt, type: python setup.py build_ext –inplace, like this:

Follow the instructions from the first link if an error is observed. The second link is where it is downloaded from.

After downloading Microsoft Visual Studio, retry. It should now be successful. Now run the file latency_comparison.py.

The output that I observed was:

The Fibonacci function sequence series argument used was: 30

Test results for 100 iterations:

--------------------------------

Latency using Python script: 7.62 s

Latency using cython optimized script: 0.42 s

Factor improvement: 18.2

As you can see, there is a performance improvement of approximately 18x from the Cython version relative to the Python version.

Although the Cython version for the Fibonacci function was created and used for this tutorial / demonstration, you can easily apply the same steps for creating the Cython version of the function that performs the critical processing in your small business analytics application.

Hope this helps.