Dear Python Community,
I am refining a Python script for a small business analytics application, developed in May 2025, to process large CSV datasets efficiently while minimizing memory consumption. While the script performs its core functions, it encounters memory issues with larger datasets, impacting performance. Your expertise in addressing this challenge would be greatly appreciated.
My application, built with Python 3.12 on a Linux server (Ubuntu 22.04, 8GB RAM, 4 vCPUs), processes CSV files containing sales data, typically 500,000 rows with 20 columns, amounting to approximately 1GB in size. The script uses pandas to load the CSV into a DataFrame for aggregation and filtering, following guidelines from the pandas documentation.
During processing, memory usage peaks at 6GB, occasionally causing the script to crash on larger datasets. To mitigate this, I implemented chunking with pd.read_csv(chunksize=10000)
, which reduced memory usage to 4GB but increased processing time by 40%, from 2 minutes to nearly 3 minutes. I also profiled the script using memory_profiler
, identifying the DataFrame loading as the primary memory bottleneck. Additionally, I experimented with dtype
optimization to use smaller data types, such as int32
for numeric columns, but observed only marginal improvements.
Despite these efforts, memory consumption remains high, suggesting inefficiencies in my approach to handling large datasets. I am particularly interested in strategies to further optimize memory usage while maintaining reasonable processing speeds, ensuring the script can scale to datasets exceeding 1GB without requiring additional server resources.
What specific techniques or libraries would you recommend to optimize memory usage for processing large CSV datasets in Python 3.12?
Thank you for your insights and guidance.