Adding atomicwrite in stdlib

methane · November 12, 2021, 6:28am

This is an useful idiom:

Create a (hidden) temporary file with random name.
Write data to it.
Finally, rename the temporary file to target filename.

This idiom is good for:

Avoid creating incomplete file
Support infile=outfile case for some tools. (e.g. Issue 33927: Allow json.tool to have identical infile and outfile - Python tracker)

As far as I know, Python stdlib don’t directly provide this idiom. Users need to implement it by themselves. But implement this idiom both for Unix and Windows is difficult.

By Googling, I found atomicwrite library. It support Unix and Windows.

I am not sure this library is good for stdlib. At least, we need to drop Python 2 support. I want to make fsync on directory optional too (see Option to disable fsync · Issue #17 · untitaker/python-atomicwrites · GitHub).

Anyway, I think this feature is good for stdlib. I want to add this feature regardless its implementation is atomicwrite or

Do you think Python stdlib provide this feature?
Do you think the atomicwrite API is good for stdlib?
Which module should support this? (tempfile, os, os.path, or add atomicwrite module?)
Would you provide sample use cases that this feature is useful for many Python users?

CharString · November 12, 2021, 8:59am

If at all a good fit for stdlib, it would be nice if this kind of behaviour would fit in the pathlib abstraction. I love that api for handling file-like things.
But maybe, in my wish to have a single clear type that I can just .open() for manipulation, I’m trying to shoehorn too much into it. Like looking for it to handle “-” gracefully

malemburg · November 12, 2021, 9:31am

I think something like this would be a good addition, but I have
a few issues with the package you quoted:

The name is misleading, since there’s nothing atomic about
writing to the file. The rename that happens last often is
atomic in various file systems, but that’s about it. Before
that happens, other threads can run freely.

I’d suggest “transactional_write” or shorter “tx_write” as name
of the context manager instead, because that’s what it is all
about: it opens a transaction, writes content and only commits
to the correct filename in case of success.
I’m not sure where you’d want to use the non-overwrite logic
provided by the package. If you’re after a way to use the
file system for locking purposes, there are better mechanisms
for this.
The fsync is only needed for applications which want to make
sure that the data is written to disk right after finalizing
the file. It’s not something you’d want to do in general, because
it slows down processing a lot. Having this as an option with
default off is useful, of course.

The most important use case, IMO, is to be able to make sure that
files only start existing under the final name when they are fully
written (and possibly even verified).

This is important for applications such as backup or ETL tools
that use the file system as a database of existing resources, often
combined with certain characteristics such as special filenames
or extensions.

Just like the chdir() context manager, which was recently added, the
right place to put such a utility would be in the contextlib
module.

pf_moore · November 12, 2021, 10:20am

I agree, this sounds more like a “transactional” write than a “atomic” write to me (even though atomic operations are used under the hood).

One other use case that I have (from pip, where we’ve implemented something like this for ourselves) is transactional replacement of a directory. We unpack a new version of a package into a “working directory” and then replace the existing directory when we’re ready. That would be useful functionality alongside transactional file replacement.

Something that it’s nice to get right (and not always easy, at least in the directory case which is the one I’m more familiar with) is cleanup. In pip, we’ve had issues where the temporary directory is left lying around after an application crash - I don’t know the details, but I’d expect a stdlib implementation to be robust in such situations. (I’d expect our implementation in pip to be robust as well, but obviously it’s trickier to do that than it looks )

I mostly like the API of the atomicwrites implementation - in particular the ability to override things like the arguments to the tempfile.mkstemp call (which IMO would be uncommon, but when you need it, not being able to do so would be a pain).

I don’t know whether the proposal here is just to take the atomic_write context manager, or also the rename_atomic/move_atomic functions. The latter seem unnecessary as os.rename and os.replace are documented as being atomic (at least on Unix - I’d like to see that guarantee extended to Windows, but I don’t see why that couldn’t be done under the existing names).

So for me, +1 on an atomic_write context manager, preferably renamed to use the term “transactional” rather than “atomic”. +1 on a similar function for transactionally replacing a directory. contextlib seems a reasonable place to put these, although I’d also be fine with them being in tempfile. -1 on rename_atomic/move_atomic, but +1 on making os.rename/os.replace atomic on Windows as well as Unix.

erlendaasland · November 12, 2021, 12:33pm

+1 to your contextlib suggestion, @pf_moore.

pitrou · November 12, 2021, 1:34pm

Yes.

Not sure.

You probably want to reflect all the open() arguments (such as encoding, buffering…).
You probably also want to provide different strategies for determining the temporary output file (in some cases you may want to put it into a separate tmp directory, in some cases you’d rather use a special-named file in the current directory…)

shutil sounds like the best place. os is too low-level. tempfile isn’t adequate (the fact that atomic writing may use a temporary file is an implementation detail).

For example Numba has this to cache JIT output on-disk:

github.com

numba/numba/blob/5fc9d3c56da4e4c6aef7189e588ce9c44263d4a6/numba/core/caching.py#L562-L579


      
          @contextlib.contextmanager
          def _open_for_write(self, filepath):
              """
              Open *filepath* for writing in a race condition-free way
              (hopefully).
              """
              tmpname = '%s.tmp.%d' % (filepath, os.getpid())
              try:
                  with open(tmpname, "wb") as f:
                      yield f
                  os.replace(tmpname, filepath)
              except Exception:
                  # In case of error, remove dangling tmp file
                  try:
                      os.unlink(tmpname)
                  except OSError:
                      pass
                  raise

pf_moore · November 12, 2021, 2:16pm

Both of these are in the atomicwrites API. Keyword args in the call are passed to the mkstmp and open calls, so you can do things like this:

from atomicwrites import atomic_write
from pathlib import Path

d = Path() / "s"

with atomic_write("foo.txt", overwrite=True, dir=d, encoding="utf-8") as f:
    print(list(d.iterdir()))
    f.write("Hello €")

methane · November 14, 2021, 3:05am

Thank you for all. No one againsted to add context manager for this feature.

This is my opinions:

fsync

fsync on file has cost, and fsync on containing directory is too expensive.

I don’t think this API should fsync on directory at all. So I don’t want to call this feature as “transactional”. When “transaction” succeeded, user don’t expect rollback after OS crash. I don’t call its “safe” too for same reason.

On the other hand, fsync on file is controversial. Without fsync, we may have broken (often, 0byte) file after OS crash, instead of old or new file. So fsync=False (or longer fsync_on_success=False) option may be nice to have.

But this is not specific to this feature. fsync on close is convenient for all new files created by builtin open() for same reason.

So my current opinion is don’t add fsync at all on this context manager. We may add fsync=False (or fsync_on_close=False) to open() later. If it is rejected, we can discuss adding fsync=False to this context manager again.

Module and name

I don’t like name “transaction” for reasons above. And many core-dev don’t like “atomic” too. So I propose to use “staging”. The term “staging” is famous now thanks to git.

I don’t want to add it to contextlib module. contextlib should have context manager for generic (e.g. closing, ExitStack) or builtin features.

Instead, my current ideas are:

io.staged_write, or io.staged_newfile
- This API hides that this feature is using tempfile.
- But io module depends on tempfile. I don’t like this dependency.
tempfile.staging_file
- I don’t think we must make using tempfile 100% implementation detail.
  - User can assume how staging file is created and removed is same to tempfile.
- We can add staging_dir later.

other behavior

I want to use . as the default prefix to hide the file on Unix.

We may be able to use hidden file feature on Windows, but I am not sure that we can make it unhidden and replace atomictly. Do not having file extension may be enough to prevent end users open staging file with some app accidentally.

malemburg · November 14, 2021, 10:44am

I dont want to add it to |contextlib| module. |contextlib| should have context
manager for generic (e.g. |closing|, |ExitStack|) or builtin features.

contextlib was used for the chdir() context manager as well, which
is rather specific to file systems, so the new context manager
would be in good company.

Putting it in tempfile is misleading, since the context manager will
indeed create a non-temporary file when exiting.

I want to use |.| as the default prefix to hide the file on Unix.

We may be able to use hidden file feature on Windows, but I am not sure that we
can make it unhidden and replace atomictly. Do not having file extension may be
enough to prevent end users open staging file with some app accidentally.

Please be careful with hiding such files. In case of process crashes,
this could easily cause lots of those temporary files to pile up in
arbitrary directories, without the user noticing (except perhaps by
checking the free disk space and wondering where all the space went).

IMO, this should be an option, which is off by default.

A better approach is to keep the file visible and instead add a
special extension to the name of the temporary file, much like
downloaders often do. In this case, “.staging” would probably be
appropriate.

This could also be made an option, of course.

pitrou · November 14, 2021, 4:56pm

Does this happen with modern (journalled) filesystems?

pitrou · November 14, 2021, 4:57pm

It’s not “famous” at all. Even as a git user, I rarely use or encounter the term “staging”. IMHO “atomic” is the right name for this.

cameron · November 14, 2021, 9:25pm

I also prefer the name “atomic”, in that this provides a flow where the
filename is absent, then atomically present with full contents.

I wrote my own version of just this recently, independently, and named
it @atomic_filename which to my mind better indicates the thing that it
atomic - the presence of the filename on completion.

Cheers,
Cameron Simpson cs@cskk.id.au

cameron · November 14, 2021, 9:31pm

I want to use |.| as the default prefix to hide the file on Unix.

We may be able to use hidden file feature on Windows, but I am not sure that we
can make it unhidden and replace atomictly. Do not having file extension may be
enough to prevent end users open staging file with some app accidentally.

Please be careful with hiding such files. In case of process crashes,
this could easily cause lots of those temporary files to pile up in
arbitrary directories, without the user noticing (except perhaps by
checking the free disk space and wondering where all the space went).

IMO, this should be an option, which is off by default.

I am of the opposite view. I prefer the leading-dot-by-default approach.
As a usability thing, it is nice to walking into a dynamic hot folder
(as a typical use situation for this) and do an “ls” and see complete
things.

There’s plenty of precedence for using a leading dot (or otherwise
marking the file as invisible if not on UNIX); for example rsync uses
exactly this approach for its atomic-update effect: scratch file names
with a dot, renamed onto the final name when complete.

I’m using the same in a work project which makes hot folders: we just
tell the consumer that names commencing with a dot should be ignored,
and anything else short be considered ready for use.

A better approach is to keep the file visible and instead add a
special extension to the name of the temporary file, much like
downloaders often do. In this case, “.staging” would probably be
appropriate.

I’m -1 on this. Almost any extension would be likely to collide with
some legitimate use for such an extension.

This could also be made an option, of course.

It certainly sounds like it should be, since we’re always going to
disagree abount this policy setting.

Cheers,
Cameron Simpson cs@cskk.id.au

EpicWink · November 14, 2021, 10:00pm

What’s the case against @pitrou’s suggestion to put this functionality in shutil? That was the first place I thought of, with higher-level file/directory manipulation.

To me, a transaction in the database sense is a sequence of multiple operations which only are committed when all operations succeed. This proposal would only affect one operation (the file write), and not other Python commands which prevent the transaction from completing.

Atomic is when the operation is instant in the eyes of all observers (other threads/processors). This seems to be the intended functionality, regardless of implementation. It seems like the popular implementation is to perform an atomic rename on a fully constructed file (required on all platforms).

Staging seems to reference the popular implementation (and does make sense in that regard, but not because of Git, rather from dev-ops deployments), but not the abstract functionality.

methane · November 15, 2021, 12:25am

Hmm…

BTW, why chdir chose contextlib instead of shutil?

eryksun · November 15, 2021, 1:33am

In Windows, the file can be created in the target directory (with whatever unique name) using CREATE_NEW disposition, GENERIC_WRITE access (or more) and without write/delete access sharing. The single open can be used to write to the file and rename it. To rename, call SetFileInformationByHandle(handle, FileRenameInfo, ...) with ReplaceIfExists enabled. In Windows 10, there’s also FileRenameInfoEx, which supports the flag combination FILE_RENAME_FLAG_REPLACE_IF_EXISTS | FILE_RENAME_FLAG_POSIX_SEMANTICS. This can replace the target file even if it has existing opens, provided they share delete access (still uncommon, unfortunately).

The created file will have a different create time and file attributes, but that’s not a big deal. What bothers me more is that the created file will have a new security descriptor, so the owner and permissions may change. At a minimum that needs to be documented. We could do more. The original file’s security can be queried via GetNamedSecurityInfoW() and set on the new file via SetSecurityInfo(). For this, the open would have to also include WRITE_OWNER and WRITE_DAC access. Setting the file security might fail. For example, the owner might not be one we’re allowed to set. Also, if the target file has a high-integrity label, our process would need at least that integrity level. If we have administrator access, the file integrity level shouldn’t be a problem, and an administrator can enable SeRestorePrivilege and open with FILE_FLAG_BACKUP_SEMANTICS to set an arbitrary owner.

pitrou · November 15, 2021, 9:15am

I don’t know, but that sounds like a mistake to me. I would never go and look in contextlib for OS-related utilities.

malemburg · November 15, 2021, 9:27am

AFAIK, the SC chose contextlib.

cameron · November 15, 2021, 11:25pm

That’s not apparent. I think the submitted PR put it in contextlib and
the SC didn’t object.

PR: bpo-25625: add contextlib.chdir by FFY00 · Pull Request #28271 · python/cpython · GitHub

Personally I think both the chdir() context manager and this atomic
suggestion would both better belong in shutil.

Cheers,
Cameron Simpson cs@cskk.id.au

ericvsmith · November 16, 2021, 12:29am

I completely agree.