Hi, I have an executable that I can’t modify that takes input and output filenames as arguments.
myexe input_file_path output_file_path
The executable does not accept input through stdin, and stdout is just standard messages from the executable, not interesting output.
As part of a Monte Carlo analysis, I run the executable many hundreds of times for slight changes in the input file and aggregate the output along the way.
Wanting to avoid IO on disk, I was hoping there was a cross-platform method of simulating the input and output files in memory.
I have seen mention of using pipes for this purpose, but am not clear on whether there is a way to convert any of the pipe-related concepts to expected on-disk names. I have also read about using /dev/fd/<proxyfile> as a way to address this issue, but that won’t work on Windows.
Thank you for any ideas or suggestions you may have.
I don’t think this is something Python can do. But for Windows, ImDisk is apparently recommended to make in-memory file systems. (I haven’t used it myself though.)
Thank you all for your input!
I am guessing that the IO may be an issue on some of the platforms on which my application might be deployed. I was hoping that if there were an easy-to-implement ‘in-memory’ method, I could at least do some benchmarking. Aside from the potential tangible benefit, I was also curious about whether this was even possible.
If ya had access to the code (and it was python) it would be easier to pass around memory directly. Even if not python, maybe it could be called via a so/dll or similar.
If using an external executable, a ram disk is likely the closet you can get.
There is such a thing as a “named pipe”. You use os.mkfifo to create
one:
(That link goes directly to os.mkfifo.)
Then the programme can use its pathname where you want a file. You’ll of
course have to write data into the write end of the pipe for the
programme to receive. Or the converse if you’re using the pipe for the
output.
Pipes only allow sequential read or write, but that is likely ok.
It’s worth noting that os.mkfifo / named pipes are only available on Unix systems, so they may not work with the OP’s cross-platform / Windows requirement.
I would advise that you do not attempt to optimise until you have evidence that you have a problem.
It a waste of your time to optimise for a problem that does not provably exist.
I’m weighing that. I am not presently running the software in its ultimate incarnation, which will be MonteCarlo analyses around the exe, all wrapped within a numerical optimization routine. So, running a single case will lead to many thousands of file IO operations. If I’m running the numbers correctly, a batch may take about an hour. If the IO were essentially eliminated, users could save about 10 minutes per batch, which I think would be meaningful to them.
Even I/O to ram will have overhead. Benchmark with a ram disk and see if you reduce the run time. Make sure you do a significant number of runs to remove startup overhead.
Can you please describe this approach in more detail?
Currently, for the input file, I open it, read it, change relevant values, and write it back, and close it. For the (usually very large) output file, I open it, read the contents, close the file, and perform some analyses on the content. I do this many times.
Keep files open if possible (using _setmaxstdio or ulimit), and use buffering to avoid unnecessary I/O on disk.
The performance gain is significant.
Benchmark
import time
import os
import random
# Constants
loops = 10
filename_no_buffer = 'no_buffer_file.txt'
filename_with_buffer = 'with_buffer_file.txt'
data_size = 10 ** 5 # Number of lines to write
def data_content():
return data_size * ''.join(random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ') for _ in range(16))
# Create files
if not os.path.isfile(filename_no_buffer):
f = open(filename_no_buffer, "x")
f.close()
if not os.path.isfile(filename_with_buffer):
f = open(filename_with_buffer, "x")
f.close()
# Benchmark writing without buffer
start_time_no_buffer = time.time()
for _ in range(loops):
with open(filename_no_buffer, 'r+') as file:
data = file.read()
# use data here
data = data_content()
file.seek(0)
file.write(data)
end_time_no_buffer = time.time()
time_no_buffer = end_time_no_buffer - start_time_no_buffer
# Benchmark writing with buffer
start_time_with_buffer = time.time()
persistent = open(filename_with_buffer, 'r+', buffering=50 * 1024 * 1024)
for _ in range(loops):
data = persistent.read()
# use data here
data = data_content()
persistent.seek(0)
persistent.write(data)
persistent.close()
end_time_with_buffer = time.time()
time_with_buffer = end_time_with_buffer - start_time_with_buffer
# Clean up files
# os.remove(filename_no_buffer)
# os.remove(filename_with_buffer)
print('time_no_buffer :', time_no_buffer)
print('time_with_buffer:', time_with_buffer)