Simulating on-disk files

Hi, I have an executable that I can’t modify that takes input and output filenames as arguments.

myexe input_file_path output_file_path

The executable does not accept input through stdin, and stdout is just standard messages from the executable, not interesting output.

As part of a Monte Carlo analysis, I run the executable many hundreds of times for slight changes in the input file and aggregate the output along the way.

Wanting to avoid IO on disk, I was hoping there was a cross-platform method of simulating the input and output files in memory.

I have seen mention of using pipes for this purpose, but am not clear on whether there is a way to convert any of the pipe-related concepts to expected on-disk names. I have also read about using /dev/fd/<proxyfile> as a way to address this issue, but that won’t work on Windows.

Thank you for any ideas or suggestions you may have.

Have you considered a ramdisk?

The details of how to set up one will, of course, depend on the OS.

I don’t think this is something Python can do. But for Windows, ImDisk is apparently recommended to make in-memory file systems. (I haven’t used it myself though.)

Do you know that you have a I/O issue or is this a guess?

All OS cache files in memory. They write then out in the background.
If you have an SSD then you may not see much advantage to using a ram disk.

Any ram disk solution will be OS specific.

1 Like

Thank you all for your input!
I am guessing that the IO may be an issue on some of the platforms on which my application might be deployed. I was hoping that if there were an easy-to-implement ‘in-memory’ method, I could at least do some benchmarking. Aside from the potential tangible benefit, I was also curious about whether this was even possible.

If ya had access to the code (and it was python) it would be easier to pass around memory directly. Even if not python, maybe it could be called via a so/dll or similar.

If using an external executable, a ram disk is likely the closet you can get.

Thanks for the suggestion. I will try that locally, but I can’t control what is installed on my users’ machines.

I’m still hoping for a clever hack and will explore whether there are any os-related calls/resources I could make us of.

This means that you can process all files in memory. If that’s true, you should read files using buffering; otherwise, you’ll need more memory.

There is such a thing as a “named pipe”. You use os.mkfifo to create
one:

(That link goes directly to os.mkfifo.)

Then the programme can use its pathname where you want a file. You’ll of
course have to write data into the write end of the pipe for the
programme to receive. Or the converse if you’re using the pipe for the
output.

Pipes only allow sequential read or write, but that is likely ok.

It’s worth noting that os.mkfifo / named pipes are only available on Unix systems, so they may not work with the OP’s cross-platform / Windows requirement.

1 Like

I would advise that you do not attempt to optimise until you have evidence that you have a problem.
It a waste of your time to optimise for a problem that does not provably exist.

1 Like

Thank you for the suggestion. I’ll investigate this for the Linux platform.

My profiling indicates that file I/O is taking about 15% of the total time, while the exe accounts for about 80%.

Is it worth your time to reduce that 15%?

I’m weighing that. I am not presently running the software in its ultimate incarnation, which will be MonteCarlo analyses around the exe, all wrapped within a numerical optimization routine. So, running a single case will lead to many thousands of file IO operations. If I’m running the numbers correctly, a batch may take about an hour. If the IO were essentially eliminated, users could save about 10 minutes per batch, which I think would be meaningful to them.

Using buffering, they are eliminated. Is there any specific need to use an in-memory disk?

Even I/O to ram will have overhead. Benchmark with a ram disk and see if you reduce the run time. Make sure you do a significant number of runs to remove startup overhead.

1 Like

Can you please describe this approach in more detail?
Currently, for the input file, I open it, read it, change relevant values, and write it back, and close it. For the (usually very large) output file, I open it, read the contents, close the file, and perform some analyses on the content. I do this many times.

Keep files open if possible (using _setmaxstdio or ulimit), and use buffering to avoid unnecessary I/O on disk.

The performance gain is significant.

Benchmark
import time
import os
import random

# Constants
loops = 10
filename_no_buffer = 'no_buffer_file.txt'
filename_with_buffer = 'with_buffer_file.txt'
data_size = 10 ** 5  # Number of lines to write


def data_content():
    return data_size * ''.join(random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ') for _ in range(16))


# Create files
if not os.path.isfile(filename_no_buffer):
    f = open(filename_no_buffer, "x")
    f.close()

if not os.path.isfile(filename_with_buffer):
    f = open(filename_with_buffer, "x")
    f.close()

# Benchmark writing without buffer
start_time_no_buffer = time.time()
for _ in range(loops):
    with open(filename_no_buffer, 'r+') as file:
        data = file.read()

        # use data here
        data = data_content()

        file.seek(0)
        file.write(data)

end_time_no_buffer = time.time()
time_no_buffer = end_time_no_buffer - start_time_no_buffer

# Benchmark writing with buffer
start_time_with_buffer = time.time()
persistent = open(filename_with_buffer, 'r+', buffering=50 * 1024 * 1024)
for _ in range(loops):
    data = persistent.read()

    # use data here
    data = data_content()

    persistent.seek(0)
    persistent.write(data)

persistent.close()

end_time_with_buffer = time.time()
time_with_buffer = end_time_with_buffer - start_time_with_buffer

# Clean up files
# os.remove(filename_no_buffer)
# os.remove(filename_with_buffer)

print('time_no_buffer  :', time_no_buffer)
print('time_with_buffer:', time_with_buffer)

I understand. Thank you for the great explanation through the example!

1 Like