I have Python 3.12 on Windows 10. I have never used any type of caching module. I doubt I know the basic concepts.
A file I’m reading has 1800+ lines. I am making 1800+ queries against our database but each query takes about 3 seconds. That’s 5400 seconds or 90 minutes to run this file. Some files will be larger and have more rows.
My intention is to read the cached data into memory from a file at the beginning of the program, update the memory cache (like a dict) as the program runs, and at the end of my program to write the updated data back to a file.
I’ve cached data as JSON files before but not this much data, which will grow to be quite large eventually. Maybe I will have to date stamp each record as well and purge old records each time my program runs.
Questions
I’d like my app to do the caching with the cache file local to my app directory.
How fast is disk caching in Python, like shelve? I’d like to save the data to disk at one point. This on-disk cache could easily grow to 10,000 records over the next year. Each record being about 100 bytes.
I do not have an SSD, I have a normal hard drive where I would be storing data.
What are my other options?
Is there another more automatic option that will cache my data to a file with very little code?
I’m considering using a Pandas dataframe as the cache as it’s easy read from and write to disk. And I’m familiar with Pandas.
Thank you!
p.s. I’m currently reading up on some web pages about this. Any input would help as I don’t have much experience with auto caching.
I do not have admin access to the database. So I cannot tweak it.
I don’t know what a linear scan is. I only work with the Python end to query the database.
The SQL query I’m doing is connecting 2 tables, one is a shipment table, which is very large. Well over 2 million records I’m sure.
The database is Azure MS SQL on the cloud. The Azure db is known to be quite slow at times, normally just before 8am EST.
I use the Python modules pyodbc and pymssql with it.
In my SQL statement in Python the criteria is for the shipment tracking number and the chance the tracking number has an index at the db level is very low.
A linear scan is when a JOIN has to look at every row in a table to complete the join.
The 3 seconds sounds like that is happening.
2 million records is not large in SQL world.
Your organisation’s db admin should be able to optimise that query in the database by adding the right index. Then you should get answers in milli-seconds.
I have used MS SQL server on slow hardware and it’s performance was very good.
I agree with @barry-scott, it would be very good to have a DBA look at that table. That said, keeping a local copy of the data you’re querying could still be helpful in some circumstances.
Does the data returned by the 1800+ individual queries overlap? That is, are different queries fetching the same data? If so, keeping a cache in memory while running the script could avoid database requests. The more duplication, the more saved requests. If there is not much duplication, then the time it saves may not be worth the extra complexity in your program.
Will records you have already queried in previously change in between the times you run the script? It sounds like you would like to persist the cached data between script runs. The trickiest part about any long term caching is invalidation - knowing when the cached version is out of date and needs to be updated. If each script run is only adding new records and never needs to update ones that are already cached, then invalidation is not such a problem. You just have to be very sure that your local copy of the data is not going to diverge from the copy in the database.
Even with a spinning hard disk, and even if the database was optimized better, reading and writing to a local disk should still be an of magnitude faster than making a database query over the internet to Azure. Further if you do what you describe and load from disk once at the start, modify in memory, then save to disk once at the end, that’s two ops over what should be contiguous data.
On the other hand, if each time you run the script it is possible that something you have previously queried has changed in the database since the last run, caching is less likely to be helpful, because checking whether something needs to be updated will probably require a database query anyway. You would need a way to check whether a record was stale or not without running the full slow query.
Also, 10,000 • 100B is still seems in the range that file backed is fine and a separate key-value store like redis would be overkill. The Shelve module would be a fine place to start, and if you want to skip the Pickling process and handle the value bytes yourself, you can also use the dbm module directly. In either case just make sure synchronized writes are off so it’s not writing to disk on every change.
If your data is a little more complicated so that just key/value storage is not convenient, you could also look into in-process database engines like SQLite or DuckDB. These store their data in local files and don’t require a separate process or server to manage.
I forgot to mention this is a vendor db and our IT dept will not touch it, or add an index. I would have to contact the vendor to make an index on the field I’m doing the query on.
Also, at certain times of day this db seems slower. Like early morning. Because our Azure db is a mirror of our real db. And it gets the most recent data from the real db to the Azure db in the early morning. But I start work at 5:30am EST so that’s when the mirror process is running.
I’m in the process of making a local cache taking into account various variables people mentioned here. For testing, yes my data repeats and I can get it from a cache. But when this program is done there will be only unique data because I’m processing all new shipments each week all with new tracking numbers, and I’m querying the DB based on the tracking number, one at a time.
I’m storing my cache as a JSON file. So at the beginning of the program I read the JSON file into a dict used as a cache, the “cache dict”. Then as the program runs, I try to query the “cache dict” first, and if the key is not found I query the db.
At the end of the program, and after I read 100 records, I save the “cache dict” back to the JSON file.
This has speeded up the testing of this program a lot.
Does the data returned by the 1800+ individual queries overlap?
Only during the testing phase, which I’m in right now. Not during production where I process 1800+ new shipments each week, each with a new carrier tracking number.
Will records you have already queried in previously change in between the times you run the script?
Only ingesting new records gives you a lot of flexibility here. Having a local cache makes perfect sense to me - in the worst case you can always delete your local copy and rebuild it. The strategy you are using sounds reasonable also.
Over time if your local data “outgrows” JSON it would be easy enough to convert that into another format - for example, if you wanted something more space efficient or that supported random lookups without loading the whole file. I couldn’t say what library would be best for that without trying a few.
I am curious to know more about the problem you are solving, if it is something you’d like to share. (But of course do not feel obligated to have the internet critique everything, haha)
I also had the thought, that these database requests sound like a nice fit for either producer/consumer multithreading, or asynchronous tasks (asyncio). If you needed to reeeeally maximize throughput. Any case where your program has to sit and wait for a request to finish fits nicely with those, as long as the requests don’t depend on each other somehow.