Predict and exploit available memory deterministically

Hi - After spending a day on SO, reading outdated blogs and making experiments, I decided to come back here for advice. The problem is simple, but the answer might not be.

I need to read chunks of data from disk into memory and after loading I will need to manipulate the data, for which I will need sufficient memory before writing it to disk. The manipulations of the data is largely in numpy arrays, so my code is memory and I/O bound.
By minimizing the number of read and write operations I will have done as much optimization as possible (I believe)

Currently I use:

import psutil, os, gc
free = psutil.virtual_memory().free

# fetch numpy array with 1 item to determine memory consumption per item.
item = read_array(path, slice(0,1)

# split free memory by 3: 1/3 for read, 1/3 for manipulation, 1/3 for output.
n_items = (free // 3) // item.itemsize  # bytes
items = read_array(path, slice(0, n_items)

out_data = np.like(items)  # another 1/3 of memory

# do manipulation ...

np.write(out_data)
del item
gc.collect()
# repeat as loop until done 

Is there a better way?

What is the manipulation you are performing? Is it element-wise, or over some window?

I ask this because it’s a lot simpler if you are operating on single elements (or some small fixed size). The simplest approach is to read the elements one at a time in your code, and let Python and the OS worry about file-buffering for you. Trying to fine-tune it yourself probably isn’t going to do better than that.

Trying to use all your memory is unlikely to be a good idea–you can do it but your OS would really prefer to have some free memory around for other stuff and it will start to chug.

And you really don’t need to call gc.collect() manually or delete items yourself.

2 Likes

I assume you can perform your algorithm on any size of data you choose, and the only difference is performance? If so, this sounds like a great job for a parameter to the program, but if you want to gauge a default, I’d actually measure your usage ratio, rather than assuming 3:1 as in your example. By that, I mean:

  1. Run your program with a known chunk size (not too big)
  2. Use OS-level facilities to see how much memory the program is using
  3. Divide total memory usage by chunk size.

Once you have the actual ratio, which could easily be either higher or lower than the 3:1 you’re using, you can use that with the free memory calculation to determine how much to use. However, I would NOT use virtual memory for this. You’re unlikely to improve performance by pushing yourself into the swapper. If you need (say) 2.5MB for a 1MB chunk, take the physical memory free, divide by 2.5, and use that.

Side note: Have you looked into memory mapping? You MAY be able to reduce your initial load cost by mapping the file (read-only) into virtual memory, and then just proceeding to refer to it as if you’ve loaded it. The OS can handle the details for you, including discarding parts from memory when they aren’t needed.

1 Like

It will be a balance between reducing the number of I/O calls and processing overhead. The best throughput may not be memory limited.
Often using a reasonable buffer size will get you max throughtput.

As Chris said, measure, do not assume what is important.
Using a parameter to change the chunk size measure the effect of increasing chunk size then use the size that works well.

I would not get hung up on details of where the memory is used unless the measurements point to a need to optimise further.

1 Like

The common case is a join-like operation where an index declares the values to be read.

index = [2,5,3,4,6,7,1]  # index.
         -----         chunk 1
               -----   chunk 2
                    -- chunk 3

To populate the output array in I need to read the indices the index and find the appropriate value in the the source array, knowing that only a “chunk” would fit into memory.

If I call psutil for free memory and use this value, chunk 2 will run out of memory unless I use gc.collect().

With async I get limited by IOPS. With chunking or batching I get limited by CPU (unless I get out of memory error).

That is correct.

I will try that today.

I have and this is what I found:

If I mmap a contiguous block and write values to it, I end up having prematurely optimized for the whole array. This means that every subsequent read operation must incur 2.5 times the read time as mmap has a cost associated with the map operation that stages the data.

The compromise I found being more effective was using pages (like databases do) where the page size fits into memory as the overhead of managing pages is in the nanoseconds.

That’s also the direction I was heading. I was mainly wondering if there was some obvious direction I missed.

Thank you all for the inputs. Your responses were very reassuring. I will develop a feature for calibration using a suitable test suite and let that be the source for the optimization.

PS: Here’s the project: tablite · PyPI

Note that the figure of 2.5 was a placeholder, and your ACTUAL figure needs to be determined through experimentation and measurement of your memory usage.

Of course Chris. Thanks

2 Likes

You are on a linux system?

Memory use in linux is not simple. It is a bad assumption that
the free memory can be assigned to your process with no side effects.
You might force the kernel to reclaim caches that where helping you.
On the other hand you can often use more memory then is currenly
free becuase linux has been caching all the files read in buffer caches.
It will not reclaim that memory unless there is pressure to do so.

Best to benchmark with various chunk sizes and see what works best
and not guess what the kernel and C runtime will do under your code
in most cases.

1 Like

I still don’t understand the operation, I think. So you read an array of indexes from some path, and an array of data from somewhere else, and you’re reading data based on the index? Does the chunk size affect the operation at all, or is it an irrelevant detail (for the operation, not the performance)?

As an alternative: If you’re working with plain numpy arrays[1], this could be a good use-case for something like zarr and potentially dask. I’ve used zarr+dask for iterative processing over very large arrays without having to worry too much about this stuff.


  1. i.e. this isn’t shorthand for a much more complicated data structure ↩︎

1 Like