How to read and store data in a dictionary-like data structure more quickly?

I use Python 3.11 on Windows 10. My machine has 32GB of RAM so in-memory data structures should be no problem. I’m reading a price file which is a tab-delimited file exported from an Excel .XLSX file. The size of the .dat file I’m reading is a 12MB tab-delimited file.

I have a .dat file with 100,000 lines. For each row in the file I need to store 4 items in a dictionary which is part of an options class. I use this options class to pass all kinds of data to different functions.

I have defined a class which contains this dictionary, the instance of the class is called “options” and the dictionary I store this data in is “options.pricefiledict”.

In the first part of the program I read this 100,000 line .dat tab-delimited file and store 4 keys per row into the options.pricefiledict dictionary. This is taking 20-30 minutes or more just to read this file. The data in the dictionary is used later in the program. And yes I must read all lines. I must run this program on about 15 other files using the dictionary, so the read process is done many times.

What I tried

  1. I tried using openpyxl to read the file but it was much slower.
  2. I would expect doing disk-based access to store these keys and values would be even slower than openpyxl.

Notes

  1. In this readpricefile() function have to use regex to validate prices and other data, and that slows things down. A regular .find() won’t work here. I don’t know how to speed up regex and regex checks are called a lot in my read routine. Perhaps there’s a way to speed up regex? Or use a higher-powered third party regex library rather than the standard Python library?

Is there a faster in-memory data structure than a dictionary to store these? The data structure must be keyed and have a value associated with it.

I’m trying to speed up this read part of the program. Any ideas how I can do this? I am unable to post the full program.

Have you considered using pandas (or polars which I haven’t used much but which has much of the same functionality)?

Profile it. Or at least comment out the dict and just run the other parts including the regex, to identify which task is actually the limiting factor. Regexes with reverse lookups can be slow.

1 Like

I haven’t considered pandas yet as I don’t know if it will improve performance much at all. Nor do I know if I should loop through pandas one line at a time, or use another method like pandas filtering.

I am also researching how to precompile regular expressions since those don’t change, but the data I’m checking does change on every line of the .dat price file. My loop to read the price file looks something like this. This is oversimplified but you get the idea.

import re

def readpricefile():
    # Code omitted: read file into list prfile.

    lpos = 0 # Starting line to read in prfile.
    maxlines = len(prfile) # Last line number in prfile
    while (lpos <= maxlines):
        lin = prfile[lpos] # Get one line
        if ((lpos % 500)==0): # Show message very 500 lines.
            print(f"Reading line {lpos} or {maxlines}")
        linlist = lin.split("\t") # Split line into a list by tab.
        model = linlist[4]
        grade = linlist[5]
        price1 = linlist[6]
        price2 = linlist[7]
        price3 = linlist[8]
        price4 = linlist[9]
        # Now check model for invalid characters with regex.
        if re.search(r'[^a-zA-Z\d\-\.]', model):
            print(f"ERROR: Invalid characters found in model {model}")
        if re.search(r'[^a-zA-Z\d]', grade):
            print(f"ERROR: Invalid characters found in grade {grade}")
        if re.search(r'[^\d\.]', price1):
            print(f"ERROR: Invalid characters found in price {price}")

        lpos += 1

There may be errors, I didn’t run this. As you can see for each of 100,000 lines I have to use a regex for the model, grade, and 4 prices. And these change for every line. There is more going on but this is the basic idea.

Instead of using regex for those checks, create a set for each variation and use if not valid_chars.issuperset(field). That should be faster.

3 Likes

Oh, that might work! I will search for docs on .issuperset(). You see the data is so full of garbage that I have to check for the opposite of valid characters. And the file is so large we cannot clean up the file by hand.

Do you need that Regex check at all? What if the data contains non-ascii unicode chars? Maybe it would be faster just to use an EAFP based around try: float(price)

Also, not sure if this is the same in your actual code, but put everything into a function instead of doing loops at top level. Variable accesses inside a function are much faster

That might work for prices, but not for models or the grade. Models and the grade can have letters or digits, but nothing else. Prices should be just a digit or period, no negative sign. We do not have negative prices.

Yes, the loop I provided is inside a function called readpricefile() or something like that. I’ll fix my code.

You can try storing each row as a namedtuple instead, which has a lower overhead than a dict.

This can be done more efficiently with:

from string import ascii_letters, digits

strip_valid_model_chars = str.maketrans('', '', ascii_letters + digits + '-.')

...

if model.translate(strip_valid_model_chars):

This can be done more efficiently with if not (grade.isascii() and grade.isalnum()):.

You can validate against negative prices with if price1 < 0: ....

OMG, just using this when checking the model characters is so much faster! Thanks!

I will do more research. This particular project we do twice a year and we always have very large spreadsheets to read.

I did find more regex replacements which can just be dropped in by doing import modeulename as re.

  1. regex. regex · PyPI
  2. Google’s re2 engine. re2 · PyPI Not fully compatible with re API. Some advanced features not supported like back references.
  3. Intel’s Hyperscan. GitHub - intel/hyperscan: High-performance regular expression matching library 1. Cons: Setup is more complex, it requires installing a C++ library.
1 Like

You mentioned the file is tab-delimited. You should probably use the csv module to read it (or pandas as others suggested). Either one will at least split the lines into fields which you can sanity check. If you’re expecting numbers, converting using float or int as appropriate then catching (rate?) errors will almost certainly be faster than your last regular expression.

As for the regular expressions themselves, it’s possible you will see a slight speedup by precompiling them. There’s some caching going on, but since you have a fixed number of regular expressions, there’s no sense taking the chance of a cache miss.

Yeah I know. I’m also suggesting just to support normal strings, and let ‘illegal’ characters in models or the grade fly, and deal with them later.

Am I reading this correctly? 20 minutes for a 12MB tab-delimited file?

Yes, 20 minutes to loop through 100,000 rows and on each row grab the model, grade, 4 prices, do data validation on each, and check for garbage characters in all of them, and make 4+ keys for each row and store the keys and new data into a dictionary. It turns out storing the data in the dict wasn’t the bottleneck, the regular expressions were.

The original 12MB price file was read into a list in seconds, and it takes 20+ minutes to process each of the 100,000 rows. There were significant regex’es involved to do data validation as there was plenty of garbage in those 100,000 rows which needed cleanup. The biggest slow down was probably this regex looking for characters NOT in the regex set:

if re.search(r'[^a-zA-Z\d_]', model):
    print(f"ERROR: bad model")

I replaced that regex with the reply which I marked as the solution.

And 20 minutes was a low estimate, I never did get through the whole file in the initial stages of working on this program. The real time was likely 30-40 minutes.

A significant problem we had was when exporting the Excel XLSX file to a tab-delimited .dat file there were some entries of the model description which were surrounded in double quotes. In most of these there were excess tab characters INSIDE the closing quote, which messed up the tab-delimited file of course.

It was so bad I had to make a preprocesspricefile() function just to loop through the price file first before getting all the data I need, to look for bizarre cases like this and spit out every line number that had a problem. There were 440 lines which problems where I had to manually use a text editor to fix the .dat file and remove the unwanted tab characters before the closing quote.

I was unable to remove the embedded tabs from the Excel file because I am on a laptop without a keypad. So I could not do a find and replace on ALT-009, where I had to enter “009” with the keypad. And my laptop does not have a function to temporarily make a keypad. And other people at the office that do have a keypad are on vacation this week.

Which is why in some projects we read a CSV file instead.

Another thing is that the splitting on tabs already passed over the entire line. After each tab is found you already know which type of data you are expecting. So the two things can be done at once. The split is also allocating memory for all those little strings, many of which you really need the corresponding numeric values. So, those strings are not needed. The chunks of lin you only need to be viewing them up until the moment in which they get turned into the type of data that you actually need.

Just as an example, suppose you have the file

aaa	123.456	bbb
ccc	123.456	ddd

doing

with open('file.txt', 'rb') as infile:
    for line in infile:
        if len(line) > 9:
            print(float(memoryview(line)[4:11]))

In your case, the 4 and 11 you would know from being in the middle of scanning the line, knowing the location of the previous and current tabs, and being in a “state” in which you were expecting to read a float. With the memoryview you can look at the section of the line without having to first make an str, or even a bytes out of it.