I use Python 3.11 on Windows 10. My machine has 32GB of RAM so in-memory data structures should be no problem. I’m reading a price file which is a tab-delimited file exported from an Excel .XLSX file. The size of the .dat file I’m reading is a 12MB tab-delimited file.
I have a .dat file with 100,000 lines. For each row in the file I need to store 4 items in a dictionary which is part of an options class. I use this options class to pass all kinds of data to different functions.
I have defined a class which contains this dictionary, the instance of the class is called “options” and the dictionary I store this data in is “options.pricefiledict”.
In the first part of the program I read this 100,000 line .dat tab-delimited file and store 4 keys per row into the options.pricefiledict dictionary. This is taking 20-30 minutes or more just to read this file. The data in the dictionary is used later in the program. And yes I must read all lines. I must run this program on about 15 other files using the dictionary, so the read process is done many times.
What I tried
- I tried using openpyxl to read the file but it was much slower.
- I would expect doing disk-based access to store these keys and values would be even slower than openpyxl.
Notes
- In this readpricefile() function have to use regex to validate prices and other data, and that slows things down. A regular
.find()
won’t work here. I don’t know how to speed up regex and regex checks are called a lot in my read routine. Perhaps there’s a way to speed up regex? Or use a higher-powered third party regex library rather than the standard Python library?
Is there a faster in-memory data structure than a dictionary to store these? The data structure must be keyed and have a value associated with it.
I’m trying to speed up this read part of the program. Any ideas how I can do this? I am unable to post the full program.