`open()`able objects

Thank you Rob - I really appreciate your input.

Personally I like this more than the multiple-booleans option. We might be coming full circle here, but what if we accept a mode character instead of an enum member?

r..._reader()
w..._writer(mode='w')
x..._writer(mode='x')
a..._writer(mode='a')
r+..._duplex(mode='r')
w+..._duplex(mode='w')
PHP x+..._duplex(mode='x')
PHP c..._writer(mode='c')
PHP c+..._duplex(mode='c')

Using a string rather than an enum might not sit well in languages like Java, but I think in Python and in this case specifically (where other arguments like encoding and errors accept string arguments) it seems reasonably natural to me.

We’d document that _writer() accepts ‘w’, ‘x’, ‘a’ and ‘c’, and that _duplex() accepts ‘r’, ‘w’, ‘x’ and ‘c’. If we wanted to omit the PHP-specific modes to begin with, we could do that without painted ourselves into a corner.

From a user perspective, the open() algorithm is reasonably easy to grasp:

  1. If the mode includes ‘+’, remove it and call _duplex(); otherwise
  2. If the mode is ‘r’, call _reader(); otherwise
  3. If the mode is non-empty, call _writer(); otherwise
  4. Raise ValueError

(I’m excluding handling of text/bytes and invalid modes for brevity)

What do you reckon?

3 Likes

I think this covers all the bases, actually!

  • We have the three functions/types - reader, writer, duplex - with no oddities
  • All use cases are covered
  • It’s natural for both caller and callee
  • It’s easy to document the possible mode values and the way open() will decide which mode to pass, as you describe
  • It’s a lot simpler to write 'w' than some_module.SomeEnum.Truncate_If_Exists - and in this case, the modes are widely understood and there’s precedent, so a string wins out here

I’d suggest that mode be a required, keyword-only arg, and that the documentation specifies precisely that only those four single-character strings are possible in each case (so implementors can just raise ValueError or whatever if something else is passed, rather than trying to allow for invalid values that’ll never appear). It can also be typed as Literal['a', 'w', 'x', 'c'] (replacing 'a' with 'r' for duplex, of course) to make this abundantly clear.

1 Like

True!.. but there’s one subtlety - my aim was to come up with a good interface not just for dunder methods that open() can call, but for a possible future replacement for open() and pathlib.Path.open() as well:

So with the latest adjustment (see my post immediately above this one) I think we’ve achieved that: something that can be used here and now for the dunder methods, but also in future for publicly callable functions; if the interface for both is the same then the latter can just be trivial passthroughs to the former. :slight_smile:

2 Likes

Any thoughts on whether _reader() should accept a mode argument? Its value would only ever be 'r' in this proposal, but it would be more consistent with _writer() and _duplex(), and would give us the option of adding other read modes in future, if that’s even worth considering

I don’t think it should - the purpose of mode is to determine append, truncation, and creation behavior, and it doesn’t make sense for _reader to do any of these things. It isn’t there to be consistent with open(), it’s there to tell writers how to initialise things. _writer and _duplex are writers; _reader is not.

I also can’t see how including the argument would aid future-proofing. Whether the argument is there or not, the caller cannot know whether the callee supports some newer version of the interface or not; it can only make the call and see what happens. Passing a previously-invalid mode is just as bad as passing an argument that wasn’t previously part of the signature, I think.

(Actually, if you allow use of reflection, then the caller could check what arguments are in the signature before making the call… which means that omitting the argument until it’s needed is better for future-proofing. But I don’t think this is a good way of doing it either, because it breaks cases where the function is defined with **kwargs.)

If you really wanted to be able to call _writer or _duplex in future with a currently-unsupported mode, I’d think you’d have to just go ahead and make the call and let the (ultimate) caller deal with the exception that ought to be raised by the function upon receiving an invalid mode. For _reader, the equivalent would be to just go ahead and make the call and let the TypeError from argument mismatch propagate.

If you insisted (via documentation) that implementors return NotImplemented if passed a valid mode, instead of raising, then I suppose it could make sense to add a mode argument to _reader, but then every implementor has to check that this is r and return NotImplemented otherwise… and that’s kind of fragile, because you’re relying on implementors to do as they’re told; someone will quickly write an implementation that doesn’t bother checking…

Ultimately I think it’d be better not to include the mode argument in _reader, and if some additional feature is desired in future that’d require a different signature, add a new dunder method entirely. I can’t really see any other way to avoid a BC break at that time.

Is there precedent for what should be done to add arguments (or expand the range of allowed values of existing arguments) in the signature of an interface method (whether dunder or not)? Is there precedent for any future-proofing placed in such interfaces (whether or not it was ever used)? None come to my mind…

3 Likes

A probably dumb question: if there will be dunders for open(), shouldn’t there be __close__()? And probably a builtin close() for symmetry? These could be also called by contextlib.closing.

fp = open(obj) # obj.str_reader()
with fp:
    # fp.__enter__()
    ...
    # fp.__exit__() -> fp.close()
3 Likes

Cheers Rob - I was thinking along the same lines :slight_smile:

One more suggestion: should we use “text” rather than “str” in the method names? It might be a slightly better match to pathlib’s read/write_text() (vs read/write_bytes()) methods, and there are other cases where we use “text” like this (subprocess.run(text=...) comes to mind)

That would leave us with:

__open_bytes_reader__(*, buffering=-1) -> BytesReader
__open_bytes_writer__(*, mode, buffering=-1) -> BytesWriter
__open_bytes_duplex__(*, mode, buffering=-1) -> BytesDuplex
__open_text_reader__(*, buffering=-1, encoding=None, errors=None, newline=None) -> StrReader
__open_text_writer__(*, mode, buffering=-1, encoding=None, errors=None, newline=None) -> StrWriter
__open_text_duplex__(*, mode, buffering=-1, encoding=None, errors=None, newline=None) -> StrDuplex
1 Like

For the whole set of functions I’d note in the current I/O stack you can’t do text mode without buffering / the buffering argument must be non-zero (Can either be line buffering or fixed size buffering at the moment).

I’d advocate either explicitly always doing Raw I/O / RawIOBase (direct byte stream access with POSIX read/write semantics “write some” + retry loop necessary) or explicitly always higher level I/O / BufferedIOBase (which generally gives “write all”). From my perspective the sort of exposed buffering at all the layers makes a lot more variants with some very specific behaviors that implementers would need to match. If can keep that out of public API will simplify implementations.

Re: BytesDuplex as a note the current Python I/O code calls this BufferedRandom rather than duplex.

4 Likes

I’d like to ask some general question about the design: Is it reasonable to differentiate “files” with random access from those with sequential access only?

1 Like

In favor of {bytes,str}:

  • (1) It matches the type names bytes and str, and the existing dunder methods __bytes__ and __str__.

In favor of {bytes,text}:

  • (2) It matches pathlib.Path.{read,write}_{bytes,text}.

IMO, pathlib.Path.{read,write}_text should never have been called that and should have had str in the name, not text. But it’d be silly to change that now (or would it?).

IMO, point (1) is more important than point (2), so even with point (2) in mind, I would still use str in the name rather than text, if it were up to me.

However, given that point (2) exists, I wouldn’t be outright offended if the final decision was to use the term text. I’d say I’d be -0 on that spelling.

Either way, I think the terms should match. Either (preferably) __open_str_writer__ and return StrWriter as per my original suggestion, or __open_text_writer__ and return TextWriter. It makes no sense at all to me to have __open_text_writer__ return StrWriter.

Another reason why I consider point (1) to be important: some types are made generic over typing.Union[str, bytes], such as typing.IO: written typing.IO[str] or typing.IO[bytes]. It’s downright weird if you want to refactor between individually-named types and something with a str or bytes typevar, unless you use the terms str and bytes in the individual type names. Note that there is also the pair typing.TextIO, typing.BinaryIO (not BytesIO). In my own code I tend to pair the term str with bytes and the term text with binary, in this same way.

1 Like

Thank you Cody and Rob, I agree with your analyses :slight_smile:

If we disallow buffering=0 then the return types could be annotated like:

__open_binary_reader__(*, buffering=-1) -> io.BufferedIOBase
__open_binary_writer__(*, mode, buffering=-1) -> io.BufferedIOBase
__open_binary_random__(*, mode, buffering=-1) -> io.BufferedIOBase
__open_text_reader__(*, buffering=-1, encoding=None, errors=None, newline=None) -> io.TextIOBase
__open_text_writer__(*, mode, buffering=-1, encoding=None, errors=None, newline=None) -> io.TextIOBase
__open_text_random__(*, mode, buffering=-1, encoding=None, errors=None, newline=None) -> io.TextIOBase

Alternatively we might use typing.BinaryIO and typing.TextIO.

Question for anyone: are the _text_ methods worthwhile? I’m struggling to think of a VFS that provides native support for reading/writing text. Note that our open() function would take care to wrap the result of a _binary_ method in io.TextIOWrapper if text mode is requested.

1 Like

I can imagine data sources (which could theoretically be presented as a VFS) that only provide string data. In particular, something that supports writing text but not binary (because not all binary data is a valid encoding of text) is plausible.

But is it likely to be needed in practice? I doubt it. Is “omit for now, with the option of adding it of a use case comes up” a possible approach?

1 Like

I was wondering similarly… One case for _text_ is if a text document is composed of multiple text documents which should be individually accessible using open(). An example would be a YAML file which contains a number of YAML files/documents inline (which may be encoded themselves / ex. base64), cloud-init type systems have a tendency to do this. Another would be MIME / emails with attachments. In CPython there’s WinConsoleIO (cpython/Modules/_io/winconsoleio.c at main · python/cpython · GitHub) which is the closest I know of, where care is taken to not split utf-8 characters across multiple writes. That said with Windows 10+ that might not be needed…

I’m not sure how much those are worth including as an expansion point in the first version vs. deferring (it’s a lot easier to add later; once something ships need to maintain for time)

1 Like

With the _text_ methods removed and the mode arguments annotated:

__open_binary_reader__(buffering=-1) -> BinaryIO
__open_binary_writer__(mode: Literal['a', 'w', 'x'], buffering=-1) -> BinaryIO
__open_binary_random__(mode: Literal['r', 'w'], buffering=-1) -> BinaryIO

I’ve speculatively allowed positional arguments (removed *, ), because most other dunders seem to allow them.

Hmm. In those cases, would a user doing open(obj, 'r') expect to read the 7 bit-clean ‘carrier’ text, or the decoded text (e.g. roughly b64decode(text.encode(...)).decode() in the former case?).

1 Like

FWIW, I like the Booleans better than the modes. The modes are an anachronism, and force the reader to read documentation. Booleans are self-commenting.

For reference, if we bring back the booleans but exclude the PHP modes we might have:

__open_binary_reader__(buffering=-1) -> BinaryIO
__open_binary_writer__(append=False, must_create=False, buffering=-1) -> BinaryIO
__open_binary_random__(truncate=False, buffering=-1) -> BinaryIO

This leaves only one possible call (_writer__(True, True)) that doesn’t correspond to a mode string, which seems fine to me. If we add support for PHP modes I think this approach gets a bit too complicated for implementers.

2 Likes

Yes, I see your point. I think I’m more interested as a reader of code than as an implementer.

1 Like

From my perspective that is a “what API do you want to provide” / per-library decision. In a general YAML library, it may just return the text and user has to decode. If a library is for a specific well-known format then it may make sense to handle the “embedding” format/details for that. I’m not familiar with enough specialized container formats to have a really good sense (ex. HDF5, Zip, Tar, MIME, container tarballs, kubernetes secrets, …).

Thinking a bit more providing text openers in this case is an optimization rather than necessary. It may save an encode/decode round trip (which in PYTHONUTF8 mode is likely just a memcpy). If it shows up as an implementation nuisance and/or in performance profiles a lot can follow up.

Agreed on readability of the booleans. For me though introducing bool flags/names is a new standard for Python and need to “convert” from open() form to them. Already more legible forms exist such as .read_bytes() / .write_text(). While non-obvious, r/w/a/x are already in implementation and I think reasonably documented.

1 Like

Looking at the latest points discussed:

  • Dropping the text-mode functions: This seems fair enough, because yes, as has been quite rightly pointed out, open() (and similar) can handle converting between text (for the caller) and binary (for the implementor) when text mode is requested.
    • Q: With the text-mode functions removed, is there any reason to keep the word binary in the function names? I guess the text-mode functions probably wouldn’t be re-added in future, and even if they were, there would be no harm in only those functions specifying “text”, because it’s now clear that binary-mode is much more the “normal” use case.
  • Allowing positional arguments: I’m against this - I think it’d be confusing to allow something like __open_writer__('x', 42). I’d suggest we stick with keyword-only unless there’s a very compelling reason to allow them to be passed positionally.
  • Renaming duplex to random: I’m against this too, for multiple reasons:
    • Yes, we know that “random” means “random-access”, but the intuitive meaning is something akin to what you might use import random for. I don’t really think we should be using this term to mean “random-access” in new interfaces.
    • More importantly - “random” (or “random-access”) absolutely does not mean the same thing as “duplex”. “Duplex” means “you can both read from it and write to it”. “Random [access]” means “you can seek in it”. Seekability is a property of the file, so doesn’t need to be expressed in the method name (as I’ve mentioned in a previous post). Being duplex is a property of how you’re interacting with it, not the file itself.
  • mode vs booleans: At this point I firmly believe the single-char str mode is better than the separate boolean args I originally proposed so I’d suggest we keep mode.
  • Return types: We should still have three different return types to represent reader/writer/duplex (just like how in asyncio we have StreamReader and StreamWriter as separate types).
  • PHP modes: Even if we don’t support them initially, it’s probably good to include 'x' and 'c' in the annotations, so that implementors can still choose to support them. If they’re not supported, implementors can raise ValueError for those modes.
  • Q: What’s the difference in practice between using typing.BinaryIO and io.BufferedIOBase here? I see a switch from one to the other but no mention of why.
  • Q: Is there any reason buffering isn’t annotated as int?

My current preference would be for:

__open_reader__(*, buffering=-1) -> Reader
__open_writer__(*, mode: Literal['a', 'w', 'x', 'c'], buffering=-1) -> Writer
__open_duplex__(*, mode: Literal['r', 'w', 'x', 'c'], buffering=-1) -> Duplex
2 Likes