There were many suggestions in the os.workdir() thread I had started on
python-ideas and people could not agree (e.g. os, os.path, pathlib,
shutils, contextlib). My initial idea was to have it in the os module,
next to os.chdir() and with a different name.
FWIW, I think contextlib is as good as any module and it also
underlines that the symbol refers to a context manager rather than
a standard function - something which is not always apparent from
looking just at the function names or the documentation (e.g.
the built-in open() doesnāt even mention that the returned file
object can be used as a context manager).
The main reason I still believe this is a feature worthy of the standard library is that without it, even something as simple as āprocess A periodically writes a file, process B periodically reads itā is unreliable, since process B may try to access the file when process A is halfway through writing it.
By only offering in-place file modification by default, weāre setting new users up to run into a well known problem where the patterns for solving it are far from obvious (e.g. why it matters that the temp file be created in the same directory as the target file)
At my current workplace, we ended up with a two layer API:
An āatomic_writeā context manager
A āset_file_contentsā helper function
The first one is the non-obvious piece that I think is worthy of the stdlib, while the latter probably fails the ānot every 3 line helper function needs to be in the stdlibā test.
(FWIW, I think āshutil.staged_writeā would be fine as a name, emphasising the body of the context manager over the end of it. Weād still want to use the phrase āatomic writeā somewhere in the documentation, though)
Regarding the boundaries between contextlib and shutil: theyāre definitely blurry, but a rough rule of thumb is that contextlib definitely shouldnāt affect anything other than the running process.
So redirecting standard streams and changing the current working directory is reasonable (albeit still debatable), but creating files is a definite āNo, put it in shutil insteadā.
Would you look in functools for a random function?
I donāt, and I donāt look into contextlib for random context managers (and there isnāt a classlib module where I can look for random classes, either).
Yes. I was coming here to say this. The individual writes are not necessarily atomic, but the whole file is. From the point of view of other processes, it goes from not existing / being there with some old content directly to being there with new content, with nothing in between those two states. So, something like atomicfile (maybe with underscore in the middle?) would feel right to me.
Despite using āatomic writeā as the name for my own staged write implementations, this thread made me realise that there is a major problem with the name: the āatomicā property doesnāt hold in the presence of multiple writers. Truly atomic operations (think atomic increments and decrements in a CPU) include a mechanism to block other writers to ensure two different writers wonāt use the same starting state for their modifications.
While one writer, multiple readers is a common underlying assumption in āatomicā write implementations, the key advantage I see to the suggested āstaged_writeā name is that the name is correct independent of the number of writers (and helps convey why having multiple writers would be a bad idea).
The feature could also theoretically be used as the basis for a future true āatomic_writeā implementation in the multiprocessing module, although I personally doubt weād ever add that (once you need multiple writers, itās usually worth reaching for sqlite instead of a flat file, and file locking has enough nasty failure modes that itās hard to create abstract solutions that donāt end up looking like some form of database).
Iām not sure what you mean by that, because multiple writers actually should be fine with such an API. This is why it works for e.g. the JIT caching example I gave: you can have multiple independent processes updating a given cache entry at the same time, and the end result will be correct.
It is also how bytecode caching is implemented, btw, and you can be sure that multiple writers being a problem would have resulted in hundreds of bug reports over the years:
The rename is the only atomic part in the whole implementation,
which is why I believe that using the term transaction is more
accurate.
In database transactions, the .commit() or .rollback()
makes the changes appear for others (modulo transaction isolation
settings and some other details, but letās leave those aside).
Multiple writers will each get a different temporary file name
and write to this name. And all will eventually rename this file
to the final name.
Now, if all writers write the same content (as in the bytecode
example Antoine gave), thereās no problem. The inode of the file
will change, but not its contents.
If, however, they write different content, the final state of
the file will be undefined, since it is not clear which of the
writers will run the os.rename() last.
And the writers wonāt notice on Unix, since os.rename() replaces
the file, even if it exists.
On Windows, the situation is different: os.rename() wonāt
overwrite existing files.
There are situations where one or the other is better. On Unix,
you can use hardlinks to get similar behavior to Windows,
if you want an error raised for existing files.
The documentation does not promise that os.replace is atomic on Windows, only POSIX.
Rename the file or directory src to dst . If dst is a directory, OSError will be raised. If dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement).
Ok, so you were making an unrelated point while replying to a message that did not talk about atomicity but about overwriting
It is a good point however. It is true that MoveFileEx does not make any promise about atomicity. It will certainly not be able to when crossing filesystems, which makes it important to remain on the same filesystem.
Iām not sure if youāre proposing any change to the API here. My feeling is that the semantics being proposed are straightforward, and easily described as āopen a new temporary file, do all writes off to the side, then put the new content in place in an atomic action on closeā. Anything more complex is specialised enough to not need to be in the stdlib.
If youāre enumerating cases where there might be more complexity that this basic model might not be able to handle, then thatās fine. If youāre proposing that the scope be extended to handle those cases, then Iām not quite clear on the details of what your concerns are - but my initial reaction is likely to be āno, lets keep the scope as it isā.
There is a reason of placing this function in the tempfile module. The implementation requires tempfile, but tempfile imports shutil, so adding it in shutil would create a loop.
Itās the former case, as Iām just explaining why I think the āatomicā term can be misleading, and hence why the āstaged writeā naming may be preferable: āatomicā CPU operations refer to atomic āread and updateā instructions that can be safely performed from multiple writers without risking lost updates.
Staged file writes are different, as if the new output depends on the previous output, then you need to limit yourself to a single writer to avoid potentially lost writes. You can only easily tolerate multiple writers if each new version of the file is generated independently from other sources.