Adding atomicwrite in stdlib

There were many suggestions in the os.workdir() thread I had started on
python-ideas and people could not agree (e.g. os, os.path, pathlib,
shutils, contextlib). My initial idea was to have it in the os module,
next to os.chdir() and with a different name.

FWIW, I think contextlib is as good as any module and it also
underlines that the symbol refers to a context manager rather than
a standard function - something which is not always apparent from
looking just at the function names or the documentation (e.g.
the built-in open() doesn’t even mention that the returned file
object can be used as a context manager).

1 Like

For an earlier bikeshed on this topic: Issue 8604: Adding an atomic FS write API - Python tracker

The main reason I still believe this is a feature worthy of the standard library is that without it, even something as simple as “process A periodically writes a file, process B periodically reads it” is unreliable, since process B may try to access the file when process A is halfway through writing it.

By only offering in-place file modification by default, we’re setting new users up to run into a well known problem where the patterns for solving it are far from obvious (e.g. why it matters that the temp file be created in the same directory as the target file)

At my current workplace, we ended up with a two layer API:

  • An “atomic_write” context manager
  • A “set_file_contents” helper function

The first one is the non-obvious piece that I think is worthy of the stdlib, while the latter probably fails the “not every 3 line helper function needs to be in the stdlib” test.

(FWIW, I think “shutil.staged_write” would be fine as a name, emphasising the body of the context manager over the end of it. We’d still want to use the phrase “atomic write” somewhere in the documentation, though)

2 Likes

Regarding the boundaries between contextlib and shutil: they’re definitely blurry, but a rough rule of thumb is that contextlib definitely shouldn’t affect anything other than the running process.

So redirecting standard streams and changing the current working directory is reasonable (albeit still debatable), but creating files is a definite “No, put it in shutil instead”.

1 Like

Antoine Pitrou:

“I would never go and look in contextlib for OS-related utilities.”

But would you look in contextlib for a context manager?

Would you look in functools for a random function?
I don’t, and I don’t look into contextlib for random context managers (and there isn’t a classlib module where I can look for random classes, either).

3 Likes

Yes. I was coming here to say this. The individual writes are not necessarily atomic, but the whole file is. From the point of view of other processes, it goes from not existing / being there with some old content directly to being there with new content, with nothing in between those two states. So, something like atomicfile (maybe with underscore in the middle?) would feel right to me.

Despite using “atomic write” as the name for my own staged write implementations, this thread made me realise that there is a major problem with the name: the “atomic” property doesn’t hold in the presence of multiple writers. Truly atomic operations (think atomic increments and decrements in a CPU) include a mechanism to block other writers to ensure two different writers won’t use the same starting state for their modifications.

While one writer, multiple readers is a common underlying assumption in “atomic” write implementations, the key advantage I see to the suggested “staged_write” name is that the name is correct independent of the number of writers (and helps convey why having multiple writers would be a bad idea).

The feature could also theoretically be used as the basis for a future true “atomic_write” implementation in the multiprocessing module, although I personally doubt we’d ever add that (once you need multiple writers, it’s usually worth reaching for sqlite instead of a flat file, and file locking has enough nasty failure modes that it’s hard to create abstract solutions that don’t end up looking like some form of database).

I’m not sure what you mean by that, because multiple writers actually should be fine with such an API. This is why it works for e.g. the JIT caching example I gave: you can have multiple independent processes updating a given cache entry at the same time, and the end result will be correct.

It is also how bytecode caching is implemented, btw, and you can be sure that multiple writers being a problem would have resulted in hundreds of bug reports over the years:

The rename is the only atomic part in the whole implementation,
which is why I believe that using the term transaction is more
accurate.

In database transactions, the .commit() or .rollback()
makes the changes appear for others (modulo transaction isolation
settings and some other details, but let’s leave those aside).

Multiple writers will each get a different temporary file name
and write to this name. And all will eventually rename this file
to the final name.

Now, if all writers write the same content (as in the bytecode
example Antoine gave), there’s no problem. The inode of the file
will change, but not its contents.

If, however, they write different content, the final state of
the file will be undefined, since it is not clear which of the
writers will run the os.rename() last.

And the writers won’t notice on Unix, since os.rename() replaces
the file, even if it exists.

On Windows, the situation is different: os.rename() won’t
overwrite existing files.

There are situations where one or the other is better. On Unix,
you can use hardlinks to get similar behavior to Windows,
if you want an error raised for existing files.

1 Like

This uses os.replace(), which does.

AFAIK only on POSIX, not on Windows.

No.

If dst exists and is a file, it will be replaced silently if the user has permission.

The documentation does not promise that os.replace is atomic on Windows, only POSIX.

Rename the file or directory src to dst . If dst is a directory, OSError will be raised. If dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement).

Ok, so you were making an unrelated point while replying to a message that did not talk about atomicity but about overwriting :wink:

It is a good point however. It is true that MoveFileEx does not make any promise about atomicity. It will certainly not be able to when crossing filesystems, which makes it important to remain on the same filesystem.

True, I was conflating atomic writes in general, with a specific use case for them: staged updates, where the write is preceded by a content read.

For that usage to be safe, there can only be one process staging an update at any point in time.

If the source data for the write is independent of the previously written data, then yes, the atomicity holds for multiple writers.

1 Like

I’m not sure if you’re proposing any change to the API here. My feeling is that the semantics being proposed are straightforward, and easily described as “open a new temporary file, do all writes off to the side, then put the new content in place in an atomic action on close”. Anything more complex is specialised enough to not need to be in the stdlib.

If you’re enumerating cases where there might be more complexity that this basic model might not be able to handle, then that’s fine. If you’re proposing that the scope be extended to handle those cases, then I’m not quite clear on the details of what your concerns are - but my initial reaction is likely to be “no, lets keep the scope as it is”.

There is a reason of placing this function in the tempfile module. The implementation requires tempfile, but tempfile imports shutil, so adding it in shutil would create a loop.

3 Likes

It’s the former case, as I’m just explaining why I think the “atomic” term can be misleading, and hence why the “staged write” naming may be preferable: “atomic” CPU operations refer to atomic “read and update” instructions that can be safely performed from multiple writers without risking lost updates.

Staged file writes are different, as if the new output depends on the previous output, then you need to limit yourself to a single writer to avoid potentially lost writes. You can only easily tolerate multiple writers if each new version of the file is generated independently from other sources.