I’m parsing CSV files and writing them to HDF5 using multiple processors and have observed a curious behaviour. The OS seems to let sleeping processes sleep longer.
All writes to HDF5 are concurrent thanks to mplite, such as this:
@timeout
def mp_write_column(self, values, column_key=None): # column task.
"""
multi processing helper for writing column data.
"""
with h5py.File(self.path, READWRITE) as h5:
new_page = Page(values)
if not column_key:
column_key = self.new_id('/column')
dset = h5.create_dataset(name=f'/column/{column_key}', dtype=h5py.Empty('f'))
dset.attrs['pages'] = json.dumps([new_page.group])
dset.attrs['length'] = len(new_page)
return column_key
… are wrapped with this friendly decorator:
TIMEOUT = 10000 # maximum seconds tolerance waiting for OS to release hdf5 write lock
def timeout(func):
@functools.wraps(func)
def wrapper(*args,**kwargs):
t = 0
start = time.time()
while time.time() - start < TIMEOUT:
try:
begin = time.process_time()
result = func(*args,**kwargs)
end = time.process_time()
print(f"waited {t} msec to exec. {func.__name__} in {round(1000*(end-begin))} msec")
return result
except OSError:
dt = random.randint(0,500)
t+=dt
time.sleep(dt/1000) # <--- emphasis here!
raise OSError(f"couldn't write to disk (slept {t} msec")
return wrapper
When I process 7.2Gb csv test file I see this log output from the decorator:
(notice how there is always some process that is waiting far longer than others).
waited 0 msec to exec. mp_write_column in 1109 msec
waited 0 msec to exec. mp_write_column in 891 msec
waited 0 msec to exec. mp_write_column in 750 msec
waited 0 msec to exec. mp_write_column in 750 msec
waited 0 msec to exec. mp_write_column in 812 msec
waited 0 msec to exec. mp_write_table in 16 msec
waited 4232 msec to exec. mp_write_column in 797 msec # <---- emphasis!
waited 0 msec to exec. mp_write_column in 984 msec
waited 5076 msec to exec. mp_write_column in 859 msec # <---- emphasis!
waited 0 msec to exec. mp_write_column in 953 msec
waited 0 msec to exec. mp_write_column in 844 msec
waited 0 msec to exec. mp_write_column in 828 msec
waited 0 msec to exec. mp_write_column in 953 msec
waited 10122 msec to exec. mp_write_column in 781 msec # <---- emphasis!
waited 0 msec to exec. mp_write_column in 875 msec
waited 0 msec to exec. mp_write_column in 766 msec
waited 0 msec to exec. mp_write_column in 750 msec
waited 0 msec to exec. mp_write_column in 828 msec
waited 0 msec to exec. mp_write_table in 0 msec
waited 13716 msec to exec. mp_write_column in 844 msec # <---- emphasis!
waited 0 msec to exec. mp_write_column in 1188 msec
waited 0 msec to exec. mp_write_column in 891 msec
waited 0 msec to exec. mp_write_column in 859 msec
waited 0 msec to exec. mp_write_column in 922 msec
waited 0 msec to exec. mp_write_table in 0 msec
What is going on?
I am guessing that the OS looks at a process that’s sleeping and decides: “Oh - you’re a sleeping proc, you can’t be important. I’ll look at you, once all others have been dealt with.”
With this assumption I deduce the this causes the OS not to touch the sleeping process until IO is busy, whereby the proc will get OSError can't obtain filesystem lock
and it will enter sleep mode again.
Hence I am guessing the “complex” dynamics of a sleeping process is that it will sleep even longer.
Right? / Wrong? And can/should I do something about it?
Here is the whole log with os.getpid
:
15824 waited 0 msec to exec. mp_write_column in 1031 msec
15824 waited 0 msec to exec. mp_write_column in 719 msec
15824 waited 0 msec to exec. mp_write_column in 609 msec
15824 waited 0 msec to exec. mp_write_column in 609 msec
15824 waited 0 msec to exec. mp_write_column in 688 msec
15824 waited 0 msec to exec. mp_write_table in 0 msec
14940 waited 3190 msec to exec. mp_write_column in 781 msec
14940 waited 0 msec to exec. mp_write_column in 969 msec
14940 waited 0 msec to exec. mp_write_column in 1016 msec
14940 waited 0 msec to exec. mp_write_column in 828 msec
14940 waited 0 msec to exec. mp_write_column in 953 msec
14940 waited 0 msec to exec. mp_write_table in 0 msec
17172 waited 8452 msec to exec. mp_write_column in 906 msec
17172 waited 0 msec to exec. mp_write_column in 1156 msec
17172 waited 0 msec to exec. mp_write_column in 1000 msec
17172 waited 0 msec to exec. mp_write_column in 1062 msec
17172 waited 0 msec to exec. mp_write_column in 922 msec
17172 waited 0 msec to exec. mp_write_table in 0 msec
15824 waited 2332 msec to exec. mp_write_column in 891 msec
15824 waited 0 msec to exec. mp_write_column in 1094 msec
15824 waited 0 msec to exec. mp_write_column in 953 msec
15824 waited 0 msec to exec. mp_write_column in 875 msec
15824 waited 0 msec to exec. mp_write_column in 1125 msec
15824 waited 0 msec to exec. mp_write_table in 0 msec
13196 waited 18320 msec to exec. mp_write_column in 922 msec
13196 waited 0 msec to exec. mp_write_column in 1141 msec
13196 waited 0 msec to exec. mp_write_column in 984 msec
13196 waited 0 msec to exec. mp_write_column in 828 msec
13196 waited 0 msec to exec. mp_write_column in 812 msec
13196 waited 0 msec to exec. mp_write_table in 0 msec
17456 waited 22267 msec to exec. mp_write_column in 1297 msec
17456 waited 0 msec to exec. mp_write_column in 1141 msec
17456 waited 0 msec to exec. mp_write_column in 938 msec
17456 waited 0 msec to exec. mp_write_column in 797 msec
17456 waited 0 msec to exec. mp_write_column in 859 msec
17456 waited 0 msec to exec. mp_write_table in 0 msec
18244 waited 27256 msec to exec. mp_write_column in 906 msec
18244 waited 0 msec to exec. mp_write_column in 1578 msec
18244 waited 0 msec to exec. mp_write_column in 906 msec
18244 waited 0 msec to exec. mp_write_column in 781 msec
18244 waited 0 msec to exec. mp_write_column in 906 msec
18244 waited 0 msec to exec. mp_write_table in 0 msec
17172 waited 11936 msec to exec. mp_write_column in 906 msec
17172 waited 0 msec to exec. mp_write_column in 1219 msec
17172 waited 0 msec to exec. mp_write_column in 1047 msec
17172 waited 0 msec to exec. mp_write_column in 797 msec
17172 waited 0 msec to exec. mp_write_column in 812 msec
17172 waited 0 msec to exec. mp_write_table in 0 msec
14940 waited 22050 msec to exec. mp_write_column in 875 msec
14940 waited 0 msec to exec. mp_write_column in 1141 msec
14940 waited 0 msec to exec. mp_write_column in 984 msec
14940 waited 0 msec to exec. mp_write_column in 906 msec
14940 waited 0 msec to exec. mp_write_column in 875 msec
14940 waited 0 msec to exec. mp_write_table in 0 msec
18244 waited 2616 msec to exec. mp_write_column in 906 msec
18244 waited 0 msec to exec. mp_write_column in 1062 msec
18244 waited 0 msec to exec. mp_write_column in 906 msec
18244 waited 0 msec to exec. mp_write_column in 891 msec
18244 waited 0 msec to exec. mp_write_column in 1109 msec
18244 waited 0 msec to exec. mp_write_table in 0 msec
15824 waited 21785 msec to exec. mp_write_column in 969 msec
15824 waited 0 msec to exec. mp_write_column in 1109 msec
15824 waited 0 msec to exec. mp_write_column in 969 msec
15824 waited 0 msec to exec. mp_write_column in 750 msec
15824 waited 0 msec to exec. mp_write_column in 1016 msec
15824 waited 0 msec to exec. mp_write_table in 0 msec
17456 waited 16903 msec to exec. mp_write_column in 1047 msec
17456 waited 0 msec to exec. mp_write_column in 1094 msec
17456 waited 0 msec to exec. mp_write_column in 1172 msec
17456 waited 0 msec to exec. mp_write_column in 812 msec
17456 waited 0 msec to exec. mp_write_column in 859 msec
17456 waited 0 msec to exec. mp_write_table in 0 msec
13696 waited 57751 msec to exec. mp_write_column in 1078 msec
13696 waited 0 msec to exec. mp_write_column in 1312 msec
13696 waited 0 msec to exec. mp_write_column in 906 msec
13696 waited 0 msec to exec. mp_write_column in 766 msec
13696 waited 0 msec to exec. mp_write_column in 859 msec
13696 waited 0 msec to exec. mp_write_table in 0 msec
17172 waited 16995 msec to exec. mp_write_column in 906 msec
17172 waited 0 msec to exec. mp_write_column in 1203 msec
17172 waited 0 msec to exec. mp_write_column in 1094 msec
17172 waited 0 msec to exec. mp_write_column in 828 msec
17172 waited 0 msec to exec. mp_write_column in 875 msec
17172 waited 0 msec to exec. mp_write_table in 16 msec
14940 waited 16833 msec to exec. mp_write_column in 891 msec
14940 waited 0 msec to exec. mp_write_column in 1109 msec
14940 waited 0 msec to exec. mp_write_column in 1094 msec
14940 waited 0 msec to exec. mp_write_column in 812 msec
14940 waited 0 msec to exec. mp_write_column in 812 msec
14940 waited 0 msec to exec. mp_write_table in 0 msec
13196 waited 41483 msec to exec. mp_write_column in 984 msec
13196 waited 0 msec to exec. mp_write_column in 1203 msec
13196 waited 0 msec to exec. mp_write_column in 984 msec
13196 waited 0 msec to exec. mp_write_column in 1062 msec
13196 waited 0 msec to exec. mp_write_column in 906 msec
13196 waited 0 msec to exec. mp_write_table in 16 msec
15824 waited 16946 msec to exec. mp_write_column in 1031 msec
15824 waited 0 msec to exec. mp_write_column in 1141 msec
15824 waited 0 msec to exec. mp_write_column in 906 msec
15824 waited 0 msec to exec. mp_write_column in 828 msec
15824 waited 0 msec to exec. mp_write_column in 1172 msec
15824 waited 0 msec to exec. mp_write_table in 16 msec
18244 waited 26612 msec to exec. mp_write_column in 1031 msec
18244 waited 0 msec to exec. mp_write_column in 1219 msec
18244 waited 0 msec to exec. mp_write_column in 938 msec
18244 waited 0 msec to exec. mp_write_column in 844 msec
18244 waited 0 msec to exec. mp_write_column in 1094 msec
18244 waited 0 msec to exec. mp_write_table in 16 msec
17456 waited 22633 msec to exec. mp_write_column in 1016 msec
17456 waited 0 msec to exec. mp_write_column in 1031 msec
17456 waited 0 msec to exec. mp_write_column in 875 msec
17456 waited 0 msec to exec. mp_write_column in 781 msec
17456 waited 0 msec to exec. mp_write_column in 844 msec
17456 waited 0 msec to exec. mp_write_table in 0 msec
13196 waited 7084 msec to exec. mp_write_column in 1234 msec
13196 waited 0 msec to exec. mp_write_column in 1172 msec
13196 waited 0 msec to exec. mp_write_column in 906 msec
13196 waited 0 msec to exec. mp_write_column in 859 msec
13196 waited 0 msec to exec. mp_write_column in 906 msec
13196 waited 0 msec to exec. mp_write_table in 0 msec
15824 waited 7851 msec to exec. mp_write_column in 1047 msec
15824 waited 0 msec to exec. mp_write_column in 1359 msec
15824 waited 0 msec to exec. mp_write_column in 953 msec
15824 waited 0 msec to exec. mp_write_column in 797 msec
15824 waited 0 msec to exec. mp_write_column in 938 msec
15824 waited 0 msec to exec. mp_write_table in 0 msec
14940 waited 21827 msec to exec. mp_write_column in 953 msec
14940 waited 0 msec to exec. mp_write_column in 1250 msec
14940 waited 0 msec to exec. mp_write_column in 1125 msec
14940 waited 0 msec to exec. mp_write_column in 812 msec
14940 waited 0 msec to exec. mp_write_column in 812 msec
14940 waited 0 msec to exec. mp_write_table in 16 msec
13196 waited 1602 msec to exec. mp_write_column in 906 msec
13196 waited 0 msec to exec. mp_write_column in 1109 msec
13196 waited 0 msec to exec. mp_write_column in 1047 msec
13196 waited 0 msec to exec. mp_write_column in 969 msec
13196 waited 0 msec to exec. mp_write_column in 922 msec
13196 waited 0 msec to exec. mp_write_table in 0 msec
15824 waited 2030 msec to exec. mp_write_column in 1359 msec
15824 waited 0 msec to exec. mp_write_column in 1250 msec
15824 waited 0 msec to exec. mp_write_column in 1109 msec
15824 waited 0 msec to exec. mp_write_column in 1109 msec
15824 waited 0 msec to exec. mp_write_column in 938 msec
15824 waited 0 msec to exec. mp_write_table in 16 msec
18244 waited 23014 msec to exec. mp_write_column in 1062 msec
18244 waited 0 msec to exec. mp_write_column in 1219 msec
18244 waited 0 msec to exec. mp_write_column in 984 msec
18244 waited 0 msec to exec. mp_write_column in 1266 msec
18244 waited 0 msec to exec. mp_write_column in 1109 msec
18244 waited 0 msec to exec. mp_write_table in 0 msec
13196 waited 2857 msec to exec. mp_write_column in 1125 msec
13196 waited 0 msec to exec. mp_write_column in 1297 msec
13196 waited 0 msec to exec. mp_write_column in 1047 msec
13196 waited 0 msec to exec. mp_write_column in 1125 msec
13196 waited 0 msec to exec. mp_write_column in 938 msec
13196 waited 0 msec to exec. mp_write_table in 0 msec
15824 waited 2483 msec to exec. mp_write_column in 938 msec
15824 waited 0 msec to exec. mp_write_column in 1203 msec
15824 waited 0 msec to exec. mp_write_column in 906 msec
15824 waited 0 msec to exec. mp_write_column in 891 msec
15824 waited 0 msec to exec. mp_write_column in 1172 msec
15824 waited 0 msec to exec. mp_write_table in 16 msec
13696 waited 64654 msec to exec. mp_write_column in 969 msec
13696 waited 0 msec to exec. mp_write_column in 1219 msec
13696 waited 0 msec to exec. mp_write_column in 1000 msec
13696 waited 0 msec to exec. mp_write_column in 812 msec
13696 waited 0 msec to exec. mp_write_column in 1062 msec
13696 waited 0 msec to exec. mp_write_table in 0 msec
17456 waited 39378 msec to exec. mp_write_column in 1109 msec
17456 waited 0 msec to exec. mp_write_column in 1203 msec
17456 waited 0 msec to exec. mp_write_column in 1047 msec
17456 waited 0 msec to exec. mp_write_column in 859 msec
17456 waited 0 msec to exec. mp_write_column in 875 msec
17456 waited 0 msec to exec. mp_write_table in 0 msec
18244 waited 12399 msec to exec. mp_write_column in 1328 msec
18244 waited 0 msec to exec. mp_write_column in 1188 msec
18244 waited 0 msec to exec. mp_write_column in 1000 msec
18244 waited 0 msec to exec. mp_write_column in 875 msec
18244 waited 0 msec to exec. mp_write_column in 891 msec
18244 waited 0 msec to exec. mp_write_table in 0 msec
14940 waited 33549 msec to exec. mp_write_column in 969 msec
14940 waited 0 msec to exec. mp_write_column in 1328 msec
14940 waited 0 msec to exec. mp_write_column in 891 msec
14940 waited 0 msec to exec. mp_write_column in 828 msec
14940 waited 0 msec to exec. mp_write_column in 922 msec
14940 waited 0 msec to exec. mp_write_table in 16 msec
13696 waited 7038 msec to exec. mp_write_column in 953 msec
13696 waited 0 msec to exec. mp_write_column in 1234 msec
17456 waited 4320 msec to exec. mp_write_column in 1156 msec
17456 waited 0 msec to exec. mp_write_column in 922 msec
17456 waited 0 msec to exec. mp_write_column in 797 msec
13696 waited 2805 msec to exec. mp_write_column in 750 msec
13696 waited 0 msec to exec. mp_write_column in 812 msec
13696 waited 0 msec to exec. mp_write_column in 844 msec
13696 waited 0 msec to exec. mp_write_table in 0 msec
17456 waited 2365 msec to exec. mp_write_column in 1109 msec
17456 waited 0 msec to exec. mp_write_column in 969 msec
17456 waited 0 msec to exec. mp_write_table in 16 msec
14940 waited 2435 msec to exec. mp_write_column in 922 msec
14940 waited 0 msec to exec. mp_write_column in 1094 msec
14940 waited 0 msec to exec. mp_write_column in 969 msec
14940 waited 0 msec to exec. mp_write_column in 1062 msec
14940 waited 0 msec to exec. mp_write_column in 1250 msec
14940 waited 0 msec to exec. mp_write_table in 0 msec
17172 waited 95376 msec to exec. mp_write_column in 1156 msec
17172 waited 0 msec to exec. mp_write_column in 1266 msec
17172 waited 0 msec to exec. mp_write_column in 953 msec
17172 waited 0 msec to exec. mp_write_column in 875 msec
17172 waited 0 msec to exec. mp_write_column in 1031 msec
17172 waited 0 msec to exec. mp_write_table in 0 msec
18244 waited 17090 msec to exec. mp_write_column in 734 msec
18244 waited 0 msec to exec. mp_write_column in 734 msec
18244 waited 0 msec to exec. mp_write_column in 609 msec
18244 waited 0 msec to exec. mp_write_column in 609 msec
18244 waited 0 msec to exec. mp_write_column in 672 msec
18244 waited 0 msec to exec. mp_write_table in 0 msec
17456 waited 5570 msec to exec. mp_write_column in 656 msec
17456 waited 0 msec to exec. mp_write_column in 906 msec
17456 waited 0 msec to exec. mp_write_column in 719 msec
17456 waited 0 msec to exec. mp_write_column in 594 msec
17456 waited 0 msec to exec. mp_write_column in 672 msec
17456 waited 0 msec to exec. mp_write_table in 0 msec
13196 waited 44126 msec to exec. mp_write_column in 672 msec
13196 waited 0 msec to exec. mp_write_column in 750 msec
13196 waited 0 msec to exec. mp_write_column in 625 msec
13196 waited 0 msec to exec. mp_write_column in 625 msec
13196 waited 0 msec to exec. mp_write_column in 797 msec
13196 waited 0 msec to exec. mp_write_table in 0 msec
15824 waited 42932 msec to exec. mp_write_column in 812 msec
15824 waited 0 msec to exec. mp_write_column in 797 msec
15824 waited 0 msec to exec. mp_write_column in 609 msec
15824 waited 0 msec to exec. mp_write_column in 625 msec
15824 waited 0 msec to exec. mp_write_column in 719 msec
15824 waited 0 msec to exec. mp_write_table in 0 msec
14940 waited 13789 msec to exec. mp_write_column in 422 msec
14940 waited 0 msec to exec. mp_write_column in 484 msec
14940 waited 0 msec to exec. mp_write_column in 422 msec
14940 waited 0 msec to exec. mp_write_column in 438 msec
14940 waited 0 msec to exec. mp_write_column in 656 msec
14940 waited 0 msec to exec. mp_write_table in 0 msec
13696 waited 21334 msec to exec. mp_write_column in 656 msec
13696 waited 0 msec to exec. mp_write_column in 750 msec
13696 waited 0 msec to exec. mp_write_column in 688 msec
13696 waited 0 msec to exec. mp_write_column in 625 msec
13696 waited 0 msec to exec. mp_write_column in 656 msec
13696 waited 0 msec to exec. mp_write_table in 0 msec
import took 365.2052 secs.