Unbuffered stdin, bug in docs or implementation?

facundo · October 26, 2022, 11:52am

Hello!

The documentation for -u command line option says “Force the stdout and stderr streams to be unbuffered. This option has no effect on the stdin stream.”.

However, I’m now inspecting the code in main and I find that it calls setbuf/setvbuf also on stdin.

So, is it a bug in the docs or implementation? Or am I misunderstanding something?

Thanks!

. Facundo

guido · October 26, 2022, 3:29pm

It looks like a docs bug. I don’t know when this was introduced (too much was changed) but it’s been like that for over 4 years, probably much longer.

njs · October 26, 2022, 9:48pm

Do those methods affect reading, or just writing? “Buffered input” mostly doesn’t make sense in the first place, so my guess is that it’s always unbuffered whether you call those or not.

guido · October 26, 2022, 11:17pm

I think there’s a way that buffering affects input – if process A writes to a pipe and two processes B and C are reading from it, first B, then C, if B buffers more data than it actually consumes, some data intended for C will be left unprocessed in B’s buffer. E.g. here we might expect the second Python process to print the second line of a file:

cat file | (python -c 'import sys; sys.stdin.readline()'; python -c 'import sys; print(sys.stdin.readline())')

If the first Python process buffers more than one line, the second process will not see that line.

methane · October 26, 2022, 11:53pm

setbuf/setvbuf works for C FILE*. But sys.stdin, sys.stdout, sys.stderr in Python don’t use C FILE*.
So it doesn’t affect to Python stdin. Updating the doc would confuse Python users.

guido · October 27, 2022, 12:22am

You’re right. But then I’m totally confused.

In initconfig.c there are several calls to setvbuf()/setbuf():

    if (!config->buffered_stdio) {
#ifdef HAVE_SETVBUF
        setvbuf(stdin,  (char *)NULL, _IONBF, BUFSIZ);
        setvbuf(stdout, (char *)NULL, _IONBF, BUFSIZ);
        setvbuf(stderr, (char *)NULL, _IONBF, BUFSIZ);
#else /* !HAVE_SETVBUF */
        setbuf(stdin,  (char *)NULL);
        setbuf(stdout, (char *)NULL);
        setbuf(stderr, (char *)NULL);
#endif /* !HAVE_SETVBUF */
    }

I’m assuming that’s the code Facundo saw. Do those calls have any effect at all? Are they just there for C extensions using <stdio.h>?

It does appear that Python’s sys.std{in,out,err} are initialized in pylifecycle.c according to the docs.

facundo · October 27, 2022, 1:28am

That is exactly what I saw, after following the breadcrumbs of the -u command line flag…

methane · October 27, 2022, 5:54am

Python core uses C stdio in several cases. For example, reading Python script from stdin (e.g. cat script.py | python -u), printing exceptions that can not be handled, etc.

So I am not sure that we can just remove them without breaking anything.

malemburg · October 27, 2022, 7:56am

This is one of the problems with stdin buffering. The other is related to consuming a specified number of input bytes on a non-seekable input.

This (older) page explains things in more detail: stdio buffering

Note: Some of the codecs use such input buffers as well to store read data which could not yet be processed (e.g. say you read 2 bytes of a 3 byte UTF-8 encoded code point).

malemburg · October 27, 2022, 8:51am

It is not really clear to me, whether the unbuffered setting in initconfig.c has an effect on the underlying file descriptors. The code in pylifecycle.c works directly on file descriptors and always opens stdin in buffered mode:

    /* stdin is always opened in buffered mode, first because it shouldn't
       make a difference in common use cases, second because TextIOWrapper
       depends on the presence of a read1() method which only exists on
       buffered streams.
    */
    if (!buffered_stdio && write_mode)
        buffering = 0;
    else
        buffering = -1;
    if (write_mode)
        mode = "wb";
    else
        mode = "rb";
    buf = _PyObject_CallMethod(io, &_Py_ID(open), "isiOOOO",
                               fd, mode, buffering,
                               Py_None, Py_None, /* encoding, errors */
                               Py_None, Py_False); /* newline, closefd */
    if (buf == NULL)
        goto error;

Here’s what the man page on stdin has to say: “…FILEs are a buffering wrapper around UNIX file descriptors… mixing use of FILEs and raw file descriptors can produce unexpected results and should generally be avoided. (For the masochistic among you: POSIX.1, section 8.2.3, describes in detail how this interaction is supposed to work.) A general rule is that file descriptors are handled in the kernel, while stdio is just a library.”

Unfortunately, the POSIX section doesn’t go into details on buffering. Since the POSIX functions for file descriptors don’t support buffering, my guess is that the buffering code in initconfig.c thus only affects C lib stdio usage and not the file descriptor based code used by the io module for sys.stdin.

So it seems that when using python3 -u you get an unbuffered C lib stdio stdin stream, but a buffered Python sys.stdin.

guido · October 27, 2022, 4:56pm

How could it? Buffering is not a property of file descriptors – it is implemented independently by C <stdio.h>'s FILE object, and by Python’s buffered IO classes (inheriting from io.BufferedIOBase).

Given what I’ve learned in this thread, I definitely think a doc update would be good – unless we want to keep all these details implementation secrets, which doesn’t seem a good idea, as they clearly affect users (in some cases).

guido · October 27, 2022, 5:00pm

This does feel somewhat unfortunate. But I guess the saving grace is that the main use for the C-level stdin is to read a script, which Python always reads until EOF. So perhaps it doesn’t matter? Maybe Facundo was just misled by the code (as was I when I tried to reproduce his investigations).

malemburg · October 27, 2022, 5:42pm

Yep, I learned that while researching the rest of the post

gpshead · October 27, 2022, 6:15pm

Yeah, that this is only coming up now, when this code has been in place for likely decade(s)… suggests that. It’d impact an extension module that used C’s FILE* stdin… but really what code would ever do that? And anything that tries without a complete read would then see problems with data “disappearing” from the perspective of Python’s stdin as some was slurped into the C buffer so they’d eventually learn not to.

So +1 to just documenting this state of affairs and no need to change the behavior without a compelling use which has so far never surfaced.

malemburg · October 27, 2022, 7:27pm

Well, stdin is not only used for piping in data, but also in interactive sessions in a text console, where you may want to have unbuffered reads (e.g. to control a cursor or player in a game via keyboard input). I guess those will have to use other methods of getting more direct access to keyboard input with Python 3, as it doesn’t seem possible to make sys.stdin unbuffered.

Anyway, +1 on updating the documentation and just mentioning the status quo.

vstinner · October 31, 2022, 12:19pm

Having a really unbuffered reader in the io module would be “nice”, but the use cases for that are really corner cases and so far, nobody strongly required it so nobody implemented it.

I agree that it’s better to just document the exact behavior, explain the difference between C stdio streams (FILE *stdin) and Python stream objects (sys.stdin).

On Windows, when Python is run in a console, sys.stdin (sys.stdin.buffer.raw) is usually a WindowsConsoleIO which is a different beast.

The relationship between sys.stdin, sys.stdin.buffer and sys.stdin.buffer.raw is not obvious neither.

The reality is complicated

vstinner · October 31, 2022, 12:23pm

Right, handling pipes (another_program | python and python < input_file) is complicated, and I think that the main motivation to always buffer stdin is mostly performance. Reading stdin with read(1) syscall (one byte per one bye) would just be too inefficient. For a pipe, it’s not possible to “ungetch” a character or a sub-string (“seek backward”). Usually, -u command line option is used to get “partial write” (without newline) into stdout and stderr (see progress/messages “immediately” when stdout/stderr are redirected), rather than getting unbuffered stdin.

storchaka · October 31, 2022, 2:29pm

What if make -u only turning the output unbuffered, and -uu turning also the input (both C and Python) unbuffered?

malemburg · October 31, 2022, 3:43pm

As I understand the comments in the implementation (pylifecycle.c), the io stack does not support unbuffered streams with TextIOWrapper, so we don’t have the option to make sys.stdin unbuffered.

BTW: Is is still possible to reopen sys.stdin in Python 3 using different io wrappers, e.g. a binary one ?

eryksun · October 31, 2022, 5:28pm

The following comment from create_stdio() in “Python/pylifecycle.c” is mistaken about read1():

    /* stdin is always opened in buffered mode, first because it shouldn't
       make a difference in common use cases, second because TextIOWrapper
       depends on the presence of a read1() method which only exists on
       buffered streams.
    */

TextIOWrapper prefers to read a chunk of bytes via read1() if the wrapped file object has it, but it will otherwise call read(). A raw file doesn’t have read1(). It would be silly if it did.

Topic		Replies	Views
Why doesn't sys.stdout.flush() call os.fsync? Python Help	2	507	September 15, 2023
Should ast wait for stdin if we didn't pass anthing in `python3 -m ast` Python Help	10	281	July 7, 2023
Python 3.11.1: async subprocess stdout lost when piped Python Help asyncio	3	1748	December 9, 2022
Subprocess.run unexpected behavior when forking in loop Python Help	4	347	November 13, 2023
Is it safe to delete the buffer when returning it to ctypes? Python Help help	5	235	February 5, 2024

Unbuffered stdin, bug in docs or implementation?

Related Topics