Effective strategies for optimizing the performance

What are some effective strategies for optimizing the performance of Python applications, particularly in data processing tasks?

You may have to be more specific about your question. But in general, if you really do care about the performance, you may want to check out some third-party libraries such as Numpy or Pandas.

Increase performance in which aspects? Getting database records from a database? Reading or writing files? Holding temp data in memory? I do all of these and I’ll describe what I do.

  1. Database access from Python. Python modules generally have a way to get a handful of records from the database all at once. I might get 20-30 records, process those, and get the next 20-30 records.
  2. Executing database selects. I can’t do anything to speed up the actual SELECT from Python. We don’t have access to the database to change it either, it’s a black box for us. But on the database side one can enable caching (which I believe is more about caching the same records that are used a lot, I don’t normally reuse the same records during any program run), and put an index on any field used in an SQL WHERE clause.
  3. Keyed temp data. I use an in-memory dict to store temp data if the data is keyed. Some times I will have the customer ID and I have to look up the customer name to output. Then after I do many calculations on this data I add required data to a dataset and write it out as an Excel file.
  4. My own cache JSON files. I sometime use JSON data stored on disk as a way to cache values at my program level. These are values that do not change a lot like the customer name and customer id. I have a set of 3-4 functions for each cache file I use. One function reads a key from the JSON file if it’s there. Another function looks up the customer name based on a customer ID. If the customer Id is not found it gets the customer name from the database. Then stores it in the JSON in-memory dict. At the end of my program any of my cache files are written if there were changes to them. A side effect of this is if an existing customer name is updated my cache file won’t know about it. So I just delete the cache file called customer.json so it will get repopulated on the next run. This works fine for my purposes.

General tips

  1. Don’t do commands within a loop if you don’t have to.
  2. Don’t do SQL selects within a loop if you don’t have to.
  3. If you have to get data from a database once, do it only one time when your program runs.

What I’m finding about Python is it’s really good managing memory for doing things like reading large Excel files (5GB or more), and passing large variables to functions.

2 Likes