Adding atomicwrite in stdlib

There were many suggestions in the os.workdir() thread I had started on
python-ideas and people could not agree (e.g. os, os.path, pathlib,
shutils, contextlib). My initial idea was to have it in the os module,
next to os.chdir() and with a different name.

FWIW, I think contextlib is as good as any module and it also
underlines that the symbol refers to a context manager rather than
a standard function - something which is not always apparent from
looking just at the function names or the documentation (e.g.
the built-in open() doesnā€™t even mention that the returned file
object can be used as a context manager).

1 Like

For an earlier bikeshed on this topic: Issue 8604: Adding an atomic FS write API - Python tracker

The main reason I still believe this is a feature worthy of the standard library is that without it, even something as simple as ā€œprocess A periodically writes a file, process B periodically reads itā€ is unreliable, since process B may try to access the file when process A is halfway through writing it.

By only offering in-place file modification by default, weā€™re setting new users up to run into a well known problem where the patterns for solving it are far from obvious (e.g. why it matters that the temp file be created in the same directory as the target file)

At my current workplace, we ended up with a two layer API:

  • An ā€œatomic_writeā€ context manager
  • A ā€œset_file_contentsā€ helper function

The first one is the non-obvious piece that I think is worthy of the stdlib, while the latter probably fails the ā€œnot every 3 line helper function needs to be in the stdlibā€ test.

(FWIW, I think ā€œshutil.staged_writeā€ would be fine as a name, emphasising the body of the context manager over the end of it. Weā€™d still want to use the phrase ā€œatomic writeā€ somewhere in the documentation, though)

2 Likes

Regarding the boundaries between contextlib and shutil: theyā€™re definitely blurry, but a rough rule of thumb is that contextlib definitely shouldnā€™t affect anything other than the running process.

So redirecting standard streams and changing the current working directory is reasonable (albeit still debatable), but creating files is a definite ā€œNo, put it in shutil insteadā€.

1 Like

Antoine Pitrou:

ā€œI would never go and look in contextlib for OS-related utilities.ā€

But would you look in contextlib for a context manager?

1 Like

Would you look in functools for a random function?
I donā€™t, and I donā€™t look into contextlib for random context managers (and there isnā€™t a classlib module where I can look for random classes, either).

4 Likes

Yes. I was coming here to say this. The individual writes are not necessarily atomic, but the whole file is. From the point of view of other processes, it goes from not existing / being there with some old content directly to being there with new content, with nothing in between those two states. So, something like atomicfile (maybe with underscore in the middle?) would feel right to me.

Despite using ā€œatomic writeā€ as the name for my own staged write implementations, this thread made me realise that there is a major problem with the name: the ā€œatomicā€ property doesnā€™t hold in the presence of multiple writers. Truly atomic operations (think atomic increments and decrements in a CPU) include a mechanism to block other writers to ensure two different writers wonā€™t use the same starting state for their modifications.

While one writer, multiple readers is a common underlying assumption in ā€œatomicā€ write implementations, the key advantage I see to the suggested ā€œstaged_writeā€ name is that the name is correct independent of the number of writers (and helps convey why having multiple writers would be a bad idea).

The feature could also theoretically be used as the basis for a future true ā€œatomic_writeā€ implementation in the multiprocessing module, although I personally doubt weā€™d ever add that (once you need multiple writers, itā€™s usually worth reaching for sqlite instead of a flat file, and file locking has enough nasty failure modes that itā€™s hard to create abstract solutions that donā€™t end up looking like some form of database).

Iā€™m not sure what you mean by that, because multiple writers actually should be fine with such an API. This is why it works for e.g. the JIT caching example I gave: you can have multiple independent processes updating a given cache entry at the same time, and the end result will be correct.

It is also how bytecode caching is implemented, btw, and you can be sure that multiple writers being a problem would have resulted in hundreds of bug reports over the years:

The rename is the only atomic part in the whole implementation,
which is why I believe that using the term transaction is more
accurate.

In database transactions, the .commit() or .rollback()
makes the changes appear for others (modulo transaction isolation
settings and some other details, but letā€™s leave those aside).

Multiple writers will each get a different temporary file name
and write to this name. And all will eventually rename this file
to the final name.

Now, if all writers write the same content (as in the bytecode
example Antoine gave), thereā€™s no problem. The inode of the file
will change, but not its contents.

If, however, they write different content, the final state of
the file will be undefined, since it is not clear which of the
writers will run the os.rename() last.

And the writers wonā€™t notice on Unix, since os.rename() replaces
the file, even if it exists.

On Windows, the situation is different: os.rename() wonā€™t
overwrite existing files.

There are situations where one or the other is better. On Unix,
you can use hardlinks to get similar behavior to Windows,
if you want an error raised for existing files.

1 Like

This uses os.replace(), which does.

AFAIK only on POSIX, not on Windows.

No.

If dst exists and is a file, it will be replaced silently if the user has permission.

The documentation does not promise that os.replace is atomic on Windows, only POSIX.

Rename the file or directory src to dst . If dst is a directory, OSError will be raised. If dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement).

Ok, so you were making an unrelated point while replying to a message that did not talk about atomicity but about overwriting :wink:

It is a good point however. It is true that MoveFileEx does not make any promise about atomicity. It will certainly not be able to when crossing filesystems, which makes it important to remain on the same filesystem.

True, I was conflating atomic writes in general, with a specific use case for them: staged updates, where the write is preceded by a content read.

For that usage to be safe, there can only be one process staging an update at any point in time.

If the source data for the write is independent of the previously written data, then yes, the atomicity holds for multiple writers.

1 Like

Iā€™m not sure if youā€™re proposing any change to the API here. My feeling is that the semantics being proposed are straightforward, and easily described as ā€œopen a new temporary file, do all writes off to the side, then put the new content in place in an atomic action on closeā€. Anything more complex is specialised enough to not need to be in the stdlib.

If youā€™re enumerating cases where there might be more complexity that this basic model might not be able to handle, then thatā€™s fine. If youā€™re proposing that the scope be extended to handle those cases, then Iā€™m not quite clear on the details of what your concerns are - but my initial reaction is likely to be ā€œno, lets keep the scope as it isā€.

There is a reason of placing this function in the tempfile module. The implementation requires tempfile, but tempfile imports shutil, so adding it in shutil would create a loop.

4 Likes

Itā€™s the former case, as Iā€™m just explaining why I think the ā€œatomicā€ term can be misleading, and hence why the ā€œstaged writeā€ naming may be preferable: ā€œatomicā€ CPU operations refer to atomic ā€œread and updateā€ instructions that can be safely performed from multiple writers without risking lost updates.

Staged file writes are different, as if the new output depends on the previous output, then you need to limit yourself to a single writer to avoid potentially lost writes. You can only easily tolerate multiple writers if each new version of the file is generated independently from other sources.