Popen() in very old Python2 script, having issues

moosemaimer · August 23, 2023, 8:50pm

Trying to adapt a Python 2.4 script for 3, it has a ton of Popen() instances that were working before and now I’m running into no end of issues. Code extract:

        # Inspect and log an extract of dmesg
        if no_errors_yet:
            log_centered_whitespace(' dmesg ')
            dmesg = Popen(('dmesg',), stdout=PIPE, stderr=STDOUT)
            accumulating = False
            for dmesg_line in dmesg.stdout:
                if 'new high speed USB device' in dmesg_line:
                    dmesg_lines = [dmesg_line,]
                    accumulating = True
                    continue
                if accumulating:
                    dmesg_lines.append(dmesg_line)
            for dmesg_line in dmesg_lines:
                log(dmesg_line)
            log_exit_status(dmesg, 'dmesg')
        
        # Inspect and log current filesystem status
        if no_errors_yet:
            log_centered_whitespace(' df ')
            df = Popen(('df',), stdout=PIPE, stderr=STDOUT)
            for df_line in df.stdout:
                log(df_line)
                find = df_line.find('/offsites')
                if find >= 0 and preexisting_mount_point is None:
                    preexisting_mount_point = df_line[find:].strip()
            if preexisting_mount_point:
                log('Information: Found an already-mounted "offsites" mount point',
                  summary=True)
                log('Information: using preexisting %s' % preexisting_mount_point,
                  summary=True)
            log_exit_status(df, 'df')

Traceback (most recent call last):
  File "./backup_staging3.py", line 211, in <module>
    if 'new high speed USB device' in dmesg_line:
TypeError: a bytes-like object is required, not 'str'

If I try to cast the output as a str like:

for df_line in str(df.stdout):

further down the line I get this as output from a logging function:

230823 14:22                               df
230823 14:22 <
230823 14:22 _
230823 14:22 i
230823 14:22 o
230823 14:22 .
230823 14:22 B
230823 14:22 u
230823 14:22 f
230823 14:22 f
230823 14:22 e
230823 14:22 r
230823 14:22 e
230823 14:22 d
230823 14:22 R
230823 14:22 e
230823 14:22 a
230823 14:22 d
230823 14:22 e
230823 14:22 r
230823 14:22
230823 14:22 n
230823 14:22 a
230823 14:22 m
230823 14:22 e
230823 14:22 =
230823 14:22 6
230823 14:22 >

and there is a function that relies on the return code of the Popen() object, so I’m not sure how to proceed with this.

def log_exit_status(popen_object: Popen, label, summary=False):
        """Record (and return) an exit status in this run's logs."""
        exit_status = popen_object.wait()
        if exit_status:
            log_centered_whitespace(' %s exit status: %s ' % (label, exit_status), summary=True)
        else:
            log_centered_whitespace((' %s: ok ' % label), summary=summary)
        return exit_status

Rosuav · August 23, 2023, 9:10pm

Fortunately, you’re in luck! The Popen constructor can be asked to return strings rather than bytes; if you do that, the iteration will behave the same way it did in Python 2 (yielding individual lines, complete with their line endings). Change the Popen call to this:

df = Popen(('df',), stdout=PIPE, stderr=STDOUT, encoding="utf-8")

and you should be in luck. This is assuming that df is going to return UTF-8, which is probably safe (df is almost certainly going to give you pure ASCII), but if you have other programs being called, you may need to check what they do.

Ultimately, you’ll have to decide on a case-by-case basis whether to specify an encoding on the Popen call (and thus convert everything to text as early as possible), or to read bytes and then decode to text later on, or to work entirely in the bytes domain (which would be appropriate for binary or mixed text/binary protocols).

kknechtel · August 24, 2023, 1:27am

Impressive. The last (source-only, security fix only) release on the 2.4 branch was in 2008. But last I checked, the banks are still reliant on COBOL, so I guess nothing really surprises me.

It would be helpful to say what the issues are.

This wouldn’t work in any version of Python. str converts things to human-readable strings for display; it doesn’t do magic related to their type - in particular, it doesn’t read data from a file (such as df.stdout). What you are seeing is the string representation of that file object itself, split apart one character per line - because you’re iterating over the string.

But the actually interesting part here is what happens without the attempt to “cast” this. You should have said explicitly what happens, but I think I can guess: df.stdout will be a binary-mode stream, so iterating over it gives you bytes objects which are not strings (causing problems later when you try to write them to a text file, or concatenate them with strings, or various other things).

In Python 3, you are required to be explicit about string encoding, one way or another, and raw chunks of bytes cannot simply be treated like text - because they are not text, no matter what your years (or decades) of experience with ASCII and “code pages” might tell you. Aside from that, streams opened in text mode produce strings, while streams opened in binary mode produce byte-sequences - the difference is no longer just a matter of special Windows-specific newline handling.

There are many ways around the issue, but the direct and intended way is to pass text=True to the Popen constructor. This will tell subprocess to open the stream in text mode, using the system’s default text encoding to interpret the bytes as text. You can override this encoding by passing an appropriate value for the encoding keyword parameter (it should be a string that names an encoding, such as 'utf-8').

Aside from that, you may find that instead of using the Popen class directly, it is easier to use other wrappers provided by the modern subprocess library, particularly subprocess.run. The documentation is here.

What actually is the issue here? The .wait method works the same as it always has. If something is going wrong, please be explicit about what happens, what is supposed to happen instead, and how that differs.

You may find, however, that you want to upgrade the string formatting to use newer tools - they are much more pleasant. I wrote a comprehensive guide to string formatting in Python here:

Rosuav · August 24, 2023, 1:33am

Percent formatting is fine. There is no reason to change it, and plenty of reason to retain it. Percent formatting is superior to long-hand formatting in some ways, but most importantly, it is better to keep things unchanged as much as possible during a translation - it saves a lot of risk, hassle, and review effort.

Please don’t push people to make unnecessary changes. Percent formatting is not deprecated.

moosemaimer · August 24, 2023, 1:43pm

df = Popen(('df',), stdout=PIPE, stderr=STDOUT, encoding="utf-8")

This is working AFAICT, script runs to completion now and produces the expected result. Still some weirdness but that’s present on the existing machine TBH. Now I just have to try and un-spaghetti this thing so it makes a little more sense, but that’s for the future.

The author of this script (and some others we use) passed away a number of years ago, so a couple of other folks have made small edits and that’s all the work that’s been done on it since. It tries to mount an external drive and copy a number of folders onto it, so that makes it difficult to test when you don’t have live hardware to work with.

I don’t have much experience period with subprocesses yet, so the I/O part of that was kind of a black box to me.