`open()`able objects

I propose that we teach open() to accept “openable” objects, which are objects with some/all of these methods:

  • __open_r__(buffering=-1, encoding=None, errors=None, newline=None)
  • __open_rb__(buffering=-1)
  • __open_w__(buffering=-1, encoding=None, errors=None, newline=None)
  • __open_wb__(buffering=-1)

These methods would be tried by open() after confirming that its argument isn’t path-like or an integer. If the user requests text mode, but the openable object only provides binary-mode methods, then we call an appropriate method and wrap its result in io.TextIOWrapper.

Someone smarter than me can figure out what methods we should call for other modes, including multi-character modes and modes with “+” characters. I don’t think we need to design an algorithm here and now, as long as we’re satisfied that one could be designed if needed.

Background

pathlib.Path provides an __fspath__() method, and both pathlib.Path and zipfile.Path provide open() methods, and so our options for opening some kind of path as a file object currently look like this:

open(obj) obj.open()
str OK AttributeError
pathlib.Path OK OK
zipfile.Path TypeError OK

Motivation

I’m working on adding ABCs for pathlib.Path-like objects. Most methods fall clearly into either an “input” or “output” group, and so I’ve split the ABCs into ReadablePath and WritablePath (they have a common base class called JoinablePath.) But the open() method is an outlier:

  1. It supports both reading and writing
  2. More generally, it bundles quite a lot of useful stuff (text encoding etc) into a single method, which is great for users but annoying for folks who need to implement it.
  3. It shadows a built-in name, for folks who care about that sort of thing.

In this proposal, the ReadablePath ABC would have an __open_rb__() abstract method, and WritablePath would have __open_wb__(). The zipfile.Path class would gain both methods.

In an ideal world perhaps we’d deprecate pathlib.Path.open() and zipfile.Path.open() and everyone can move to using the open() built-in. But realistically it’s not worth breaking users scripts IMO.

Sample implementation

I’ve implemented a variant[1] of this as an internal magic_open() function. Source code: cpython/Lib/pathlib/_os.py at cc39b19f0fca8db0f881ecaf02f88d72d9f93776 · python/cpython · GitHub

This is exposed as a public function in my pathlib-abc pypi package: API Reference - pathlib-abc documentation

Example magic method implementations for zipp.Path: Add dependency on `pathlib-abc` by barneygale · Pull Request #132 · jaraco/zipp · GitHub

Grateful for any feedback - thanks.


  1. Supports arbitrary modes, but not very well! ↩︎

8 Likes

I like this conceptually, because it fits next to builtins like iter as a generic interface that anyone can extend. One could imagine implementing some kind of datastore that uses this, or subclassing Path for a specific file type that needs special handling.

I occasionally use Path.open in a function that I expect to receive Path objects, until I inevitably decide that I would like to handle strings as well and change it. That’s just a me problem, but this convention might nudge me to stop doing it.

1 Like

To keep things consistent and avoid many methods, maybe one __open__ with mode argument could be an option?

If mode is not supported, it could raise NotImplemented. E.g. in your magic_open:

if text:
    try:
        stream = obj.__open__(mode='r', buffering, ...)
    except NotImplemented:
        stream = obj.__open__(mode='rb', buffering)
        stream = TextIOWrapper(stream, ...)

@dg-pb I think that loads too much functionality into a single method. Plus it would be good if openable-for-reading and openable-for-writing types were distinguishable just from their interface IMO. Otherwise my ReadablePath / WritablePath problem remains unsolved!

But how would this ever work for mode a, append or w+, read-and-write? I don’t think you can just handwave this as “we can figure it out later”. This is an important part of these methods and you [1] need to make sure that their needs are fulfilled.

I can think of a few solutions:

  • Have method for every unique mode - this is a finite list, but it is quite a long list - I don’t think this is a good solution.
  • Have a mode parameter in these methods in addition to what they already have - this makes the duplicate definitions kinda pointless - it would be the same as just having one __open__ method (which I do think is the better interface tbh)
  • Use a single __open__ method and have a class attribute __supported_open_modes__ that contains a list of which modes are valid to be passed in.

  1. as the person suggesting them ↩︎

In practice, implementations like zipfile.Path tend to only support “r” and “w” modes, and maybe “a”. The remaining modes are fairly specific to the local filesystem and don’t usually generalize to virtual paths.

Even so, a general algorithm might look like:

  1. Record whether binary or text mode was requested, and remove “b” and “t” characters
  2. Record whether updating was requested and remove “+” characters
  3. Sort remaining characters
  4. Call method like __open_{mode}{binary}{update}__, where binary is b in binary mode, and update is _plus in update mode.

But tbh I don’t think we need it. Supporting “r”, “w” and maybe “a” would be sufficient.

That is a definite positive. But is this crucial? There are fairly many places, where NotImplemented is used to tell.

Implementation could have separate methods which are called from single __open__.


I get that it falls nicely into places from angle that you are working, but I am just thinking from users perspective, who doesn’t know much about Path objects.

So to implement iter, one needs to implement __iter__, etc…

If I want to make arbitrary openable quickly, then intuitively, I can just look up args of open, and easily infer the rest.

class TextEditorView:
    ...
    def __open__(self, mode='r', *args, **kwds):
        assert not args
        assert not kwds
        assert mode == 'r'
        return io.StringIO(self.join_all_text())

Otherwise, if interface is special, then I need to go search the docs and understand bunch of information that I probably don’t need at this point.

2 Likes

I am not very well versed in the whole Path infrastructure. Could you please spell out for me what is the exact problem?

I, as a user and potential implementer of this interface, would be utterly confused by a magic method that only allows me to implement a subset of the behavior of the method itself. This would be like __pow__ not supporting the modulo parameter. Yes, it’s only useful/meaningful sometimes, but it is still part of the semantics.


I think you are tying yourself too closely to the idea of ABCs here. Classes are allowed to communicate supporting something via other means as well. 1 + "a" is invalid despite both classes having __add__ and __radd__ defined.

4 Likes

Adding __open__() would create a magic method that is expected to do too much IMO. Its return value would vary by mode; it would be less clear which modes are supported; it wouldn’t be possible to make reading or writing support a requirement in ABCs; etc. It’s more-or-less a historical accident that we have a single open() function that can do so many different things, and it’s not something I want to propagate to a magic method when it comes with so many drawbacks.

2 Likes

Please add a separate function for this. open() is already too overloaded.

Or you can just call the open() method.

4 Likes

That’s my intention if folks don’t want this in built-in open(). It’ll become pathlib.open() or something like that.

Damn why didn’t I think of that?

I really prefer a direct member method here rather than the builtin open() / io.open() picking up more behaviors. What “Raw I/O” objects (ex. FileIO) that are the base of the I/O stack in open() need to do according to docs (RawIOBase) vs. what they do in code vs. code comments all differ a bit (working on resolving this), and I’m worried more code implementing that interface without more docs is likely to result in more weird edge cases as the Buffered I/O and Text I/O layers depend on that specific behavior. The Raw I/O layer in particular doesn’t retry partial writes, but a lot of code wants/implements the Buffered I/O layer “write all or throw exception” (with special cases around O_NONBLOCK file descriptors).

I think it’s more straightforward for this case to return a “buffered I/O” (BufferedIOBase) level “file-like” object, then that being composable into text. Having common code which can do that composition reliably would be more interesting to me (ex. want to .open('rt') so make the zip file buffered and then wrap with a TextIOWrapper). Ideally to me open() would be able to use that common helper, just focused on integer file descriptors and/or file names.

Instead of adding 4 somewhat oddly-named dunders, could you just decorate open with singledispatch and let objects register themselves?

Another benefit of dispatch is that if Python ever adds a better interface (I agree with your points about open having an overloaded interface), then that better interface can also have the same kind of dispatch and you don’t have this weird interface baked into the objects themselves.

I would like open to have its dunder, while I am unlikely to make use of pathlib.open.

Although it might be the case that open is too overloaded due to historical reasons, having many dunders for it doesn’t fix the whole situation and IMO while it would fix this on the dunder side of things, would bring more confusion in more general sense by introducing an exception to builtin.dunder <-> __dunder__.

Maybe:

def open_dunder_split(self, mode, ...): ...

class A:
    __open__ = open_dunder_split

    def __open_r__
    def __open_rb__
    ....

And place open_dunder_split with your proposed algorithm to convenient location so that it can be re-used.

Like others above, I don’t really like the idea of having these four specific dunder methods: it’s an incomplete API (for example as stated it doesn’t provide for + at all) and inherits the downside of the classic “bunch of letters” mode parameter without the benefit:

  • the downside being that a bunch of unrelated features are adjusted with arbitrary characters in a string, and that these affect what is allowed to be done with the resulting object in a way that’s incredibly annoying not just for typing but also understanding by people new to this concept, that doesn’t really show up anywhere else in programming these days;
  • the benefit being that a single function is used for many use cases that (probably) have the majority of their functionality in common.

Using a single __open__ dunder would be better than that, but I wouldn’t support this either on the grounds that the downside of the mode parameter still makes it not worth bothering with.

What would be better still?

IMV you need the full set of functions, with some extra arguments, and they should be named more appropriately:

__open_bytes_reader__(*,
    buffering=-1) -> BytesReader
__open_bytes_writer__(*, append=False, must_create=False, truncate=None,
    buffering=-1) -> BytesWriter
__open_bytes_duplex__(*, allow_create=False, must_create=False, truncate=None,
    buffering=-1) -> BytesDuplex
__open_str_reader__(*,
    buffering=-1, encoding=None, errors=None, newline=None) -> StrReader
__open_str_writer__(*, append=False, must_create=False, truncate=None,
    buffering=-1, encoding=None, errors=None, newline=None) -> StrWriter
__open_str_duplex__(*, allow_create=False, must_create=False, truncate=None,
    buffering=-1, encoding=None, errors=None, newline=None) -> StrDuplex

You then translate mode as follows:

r..._reader()
w..._writer()
x..._writer(must_create=True)
a..._writer(append=True)
r+..._duplex(truncate=False)
w+..._duplex(truncate=True)
PHP x+..._duplex(must_create=True)
PHP c..._writer(truncate=False)
PHP c+..._duplex(allow_create=True)

combined with

bbytes_...
tstr_...
neither b nor tstr_...

This covers all possible modes you can pass to open, while presenting a better interface which could also be used for a future split of builtin open and pathlib.Path.open to the above six functions (without their leading/trailing underscores of course).

The return type of each function can be specified differently:

  • Bytes... objects take/return bytes objects
  • Str... objects take/return str objects
  • write is available only on ...Writer and ...Duplex
  • read is available only on ...Reader and ...Duplex

What are these PHP modes?

The PHP documentation for fopen describes modes x+, c and c+ which are not mentioned in the Python documentation.

c and c+ aren’t implemented by Python, but IMO they should be - or at least c+ as this is the more useful of the two.

x+ in Python does not raise an exception if I request it (using builtin open) but I’ve not tested thoroughly if it actually does what I expect it to do. If Python does already implement this mode, it ought to be documented; otherwise, it should (preferably, IMO) be implemented and documented, or if it’s decided that Python should not support this mode, attempting to use this mode should raise an exception and, again, this should be documented.

If it’s decided that Python shouldn’t support c or c+, then the allow_create arg can be removed from ..._duplex, the truncate arg can be removed from ..._writer, and the default for truncate in ..._duplex could be False, rather than None.

I was in two minds about whether to include these in this post - since it may well seem like an arbitrary, even unrelated, addition to the matter at hand - but I felt ultimately that it was better to include it, since if I simply proposed this API without consideration of x+, c and c+, the small differences in how the arguments would be specified might make it more difficult to add them later in a natural way.

Why the extra arguments?

There’s a possibly subtle reason why I’ve decided to use boolean arguments for append, must_create and truncate, while using separate functions altogether to determine read/write and binary/text. The reason is: append, must_create and truncate do not affect the API of the resulting object, and only affect what happens during the open call and not what happens afterwards (not counting the caveat regarding append and seek/tell described below).

append=True should override must_create and truncate; meanwhile, the default of truncate=None should act like truncate=True for open_..._writer or truncate=False for open_..._duplex. I specify None as the default here rather than True or False because a default of True on open_..._writer would be confusing when setting append=True. (As above, if the c mode is not added, then the truncate argument would be removed from open_..._writer and this problem goes away.)

When translating the r+ and w+ modes, the call should explicitly specify the arg in both cases since it’s not necessarily obvious otherwise. (I’ve chosen truncate=False to be the default behavior for open_..._duplex since I believe r+ to be much more commonly used than w+, but I haven’t specifically researched this.)

A note on seek/tell

The above doesn’t specify whether or not seek/tell can be used on the resulting object.

Now in almost all cases, this isn’t a problem, because it is “the thing that is being opened” that will determine this (so the appropriate dunder method can be typed to return an interface that does/doesn’t have seek/tell defined based on the knowledge that files opened from a particular class will/won’t be able to use them).

The sole exception, to my knowledge, is append mode: due to OS-related reasons, seek won’t (or at least might not) work there (and I’m not sure about tell), even if it theoretically could.

I think this combination is minor enough that it suffices to place a note in the doc comment of any seek (and tell if needed) interface function to state that it won’t work if the file was opened in append mode.

11 Likes

@rrolls superb - that’s a really compelling proposal IMO. I’ll have a crack at implementing it for the pathlib ABCs.

2 Likes

@rrolls I’m playing around with a local implementation of your idea and I have a some feedback.

The interaction between allow_create, must_create and truncate is a bit subtle I think, and we’re pushing that complexity into user implementations of __open_bytes_duplex__() etc. In practice their implementations will only be called with certain combinations of argument values, but the space of theoretically possible values is larger, perhaps too large.

I do really like the naming though. I wonder: did you consider something like this?

r..._reader()
w..._writer()
x..._creator() (or perhaps _exclusive_writer() to leave room for PHP’s “c”)
a..._appender()
r+..._duplex_reader() (or perhaps _duplex(truncate=False))
w+..._duplex_writer() (or perhaps _duplex(truncate=True))

Obviously we lose the nice uniqueness of the return types, but I think we make life a bit easier for implementers this way.

Thanks for working on this! I’m glad you found my suggestion useful so far :slight_smile:

Yes, I agree with you that the argument handling is going to be a bit tricky for implementors of these functions, and that did come to mind at the time.

But I suppose I’m optimising to make the caller’s life easier while you’re optimising to make the callee’s life easier. I’m definitely in the habit of prioritising simplicity for the caller over simplicity for the callee. But I can’t say where I got that habit from or whether it’s right or wrong. In this situation I don’t think there’s a perfect solution that wins for both parties (but keep reading for what might be a very good compromise…).

The point of naming the functions reader, writer, duplex is that they create and return a thing of that type: either something that reads, something that writes, or something that does both. I can’t really agree with your suggested extra functions:

  • If you have a function named ..._creator it sounds like calling that function should return something that creates things - which is nonsense here because the file would already have been created; you still get a writer, all you did was say “I want you to raise an exception if the file already exists, but without TOCTTOU problems, please and thank you”. Similarly, ..._exclusive_writer is misleading: it is only the creation of the file that is exclusive; once the file has been opened, some other program can still come along and open it and write to it at the same time, so what you have obtained is not an “exclusive writer” in any sense of the term.
  • ..._appender is rather harder for me to throw out, and honestly, if I were to pick one of your functions to add to my set of three, it’d be this one, since then nobody has to theorise over “what if append=True and truncate=True(but this will become irrelevant in my ‘compromise’…) and the seek/tell issue can be resolved too (because the appender function could return a different type and then the existence of seek/tell can vary between writer and appender). The seek/tell issue though, is not relevant in an “ideal world” - it’s only an issue due to OS limitations - and if we’re talking about abstract path-like things like the contents of zipfiles, it may well not be relevant there either.
  • I really don’t like having different functions for r+ and w+, because these two modes really do do almost identical things: they both open a file with the ability to read from and write to it, and the only difference is that r+ does not truncate while w+ does. And again the names ..._duplex_{reader,writer} are misleading: what is a “duplex reader”? You’re not getting a “duplex reader”, you’re getting a duplex (a thing that can read or write); “duplex reader” implies that only reading is possible in which case the word “duplex” must refer to something else. Same for writer. So I’d say this is the point I’m most adamant about: these should definitely both be the same ..._duplex function with a truncate argument.

The other possibility I thought of (here’s my ‘compromise’!) was to use a single enum argument in place of the append, allow_create, must_create and truncate arguments (there’d be two different enums, one used by open_..._writer and the other used by open_..._duplex). I felt that having the multiple arguments was “better” - because it’s simpler for the caller - but replacing them with a single enum argument would definitely make things simpler for the callee, while still being natural enough for the caller. So I’d say that using an enum, and sticking to the three reader,writer,duplex functions, would be much better than having additional separate functions. I suppose the modes would be:

  • for writer
    • Append (mode a)
    • Truncate_If_Exists (mode w)
    • Must_Create (mode x)
    • Allow_Create_Dont_Truncate (mode c)
  • for duplex
    • Must_Exist_Dont_Truncate (mode r+)
    • Truncate_If_Exists (mode w+)
    • Must_Create (mode x+)
    • Allow_Create_Dont_Truncate (mode c+)

This way, we get to keep the one-to-one correspondence between reader,writer,duplex in the function name and Reader,Writer,Duplex in the return type. Which is natural and intuitive.

Does this solve your concerns with the subtle interactions between the separate arguments?

3 Likes

While this is generally a balance that should be kept in mind, I think that in this case callee’s simplicity should be valued more. The caller is almost always going to be open, a builtin function we have control over (or maybe a new version of it somewhere else), i.e. this protocol is almost surely going to have far more implementers than direct consumers. (even if both numbers are probably going to stay below 100 for the foreseeable future)

1 Like