Error handling pattern

abessman · June 14, 2024, 1:42pm

I’m writing an application where I would like to catch all errors and show the user a short summary of what went wrong, while also logging the full exception to a file. Is there a common, established pattern for this? I have not been able to find one.

What I would like to achieve:

Prevent scary-looking errors from reaching the user
Log those same errors to a file
Exit with 0 if there are no errors, 1 (or possibly a more specific error code) otherwise

This is what I’m currently doing:

import logging
import sys
from io import StringIO

LOGGER = logging.getLogger()

def main():
    debug_stream = setup_logging()

    try:
        step_1()
        step_2()
        return 0
    except Exception:
        LOGGER.debug("", exc_info=True)
        write_debug_log(debug_stream)
        return 1
    except KeyboardInterrupt:
        return 1

def setup_logging():
    error_handler = logging.StreamHandler(sys.stderr)
    error_handler.setLevel(logging.ERROR)
    LOGGER.addHandler(error_handler)
    debug_stream = StringIO()
    debug_handler = logging.StreamHandler(debug_stream)
    debug_handler.setLevel(logging.DEBUG)
    LOGGER.addHandler(debug_handler)
    # INFO and WARNING handlers omitted for brevity.
    return debug_stream

def step_1():
    try:
        ...
    except Exception:
        LOGGER.error("Error: Step 1 failed")
        raise

def step_2():
    try:
        ...
    except Exception:
        LOGGER.error("Error: Step 2 failed")
        raise

def write_debug_log(debug_stream):
    try:
        with open("debug.log", "w") as fout:
            fout.write(debug_stream.getvalue())
    except Exception:
        LOGGER.error("Error: Failed to write debug log")
        LOGGER.error("Please re-run app with --debug and save the output manually")
    else:
        LOGGER.error("Traceback written to debug.log")
    finally:
        LOGGER.error("Please report this bug and include the debug log")

sys.exit(main())

In reality the LOGGER.error("Error: Step 1 failed") messages are more meaningful.

I’m looking for general feedback on this approach, as well as specific feedback on the following points:

Is there a better way to log the exception at DEBUG level than calling logging.debug with an empty message?
Is Exception the right thing to catch? Too broad, or too narrow?
Is catching KeyboardInterrupt silently good practice? I would prefer to avoid showing the user a Traceback (most recent call last): File "<stdin>", line 3, in <module> KeyboardInterrupt-type message.
If so, is 1 an appropriate exit code when catching KeyboardInterrupt?

JamesParrott · June 14, 2024, 2:26pm

If the logging works, then it works.

Overall, I think this approach makes things much trickier to test. Using try / except Exception to catch everything is a classic way to create a debugging nightmare for yourself. Things can be challenging enough, even when only catching ImportErrors. EAFFTP is great - I just advise putting as little code in the try: as possible, and catching specific classes of Exception. So when something goes wrong, it’s obvious what it was.

If an app suppresses errors (i.e. purposefully hides useful information from the user, that they could otherwise use to try to fix what they’ve done wrong themselves) then the onus is on the developer to think of everything possible that can go wrong in every possible situation, and provide some other sort of constructive feedback to the user for each. That’s not impossible, but it’s akin to assuming your code is bug free, is far more work, confuses Python users, and is un-Pythonic IMHO.

franklinvp · June 14, 2024, 3:18pm

It looks like 130 is the exit code used by convention in Linux. See here

abessman · June 14, 2024, 3:47pm

I’m aware of this design principle, and usually adhere to it. But in this case, every text on CLI design I’ve read agrees: Throwing tracebacks at the user when they haven’t explicitly asked for it is Bad Form. Thus, I can see no alternative to wrapping most of my business logic in one big try/except, even knowing the caveats of that approach.

This is my goal. I don’t think this is necessarily akin to thinking one’s code is bug free; rather, it recognizes that if an exception occurs at certain points in the program, that indicates a the presence of a bug. I deal with this by asking the user to submit a bug report.

Could it be that pythonic design is at odds with CLI design?

barry-scott · June 14, 2024, 4:05pm

If you have a log with full details then the user can share that log with the developer. I use this pattern myself in production systems and it’s great for maintaining them.

Fir a cli you can print a message telling the your something unexpected went wrong and where the log file is to be put in a bug report.

JamesParrott · June 14, 2024, 4:55pm

Good point. Logging is seldom a bad idea - I was referring more to masking everything with try / except.

It’s possible to have an internal application to work on and debug, and just use this pattern to mask it, to improve UX and reduce low value bug reports.

barry-scott · June 14, 2024, 4:59pm

How else can you make sure every error is logged?

JamesParrott · June 14, 2024, 5:03pm

Capture stdout and stderr or monkey patch BaseException, but I take your point.

cameron · June 15, 2024, 2:49am

No, it is not.

That table is the return codes presented in the response to a wait*()
call.

Regular programme success is 0.
Regular program failure is nonzero, often 1. I use 2 for usage errors
i.e. bad CLI options etc. A few programmes have a variety of values for
specific failure sitations.

128 upward encode programme termination due to a signal. 130 is signal
2 i.e. SIGINT, which is usually cause by someone typing Ctrl-C.

franklinvp · June 15, 2024, 2:37pm

See their last question.

cameron · June 15, 2024, 9:35pm

If they’ve caught it then 130 is wrong. ~~(And also not doable.)~~

It looks like that second sentence is wrong. You can return a number >= 128 from a UNIX process. Possibly this postdates when I first dug into the UNIX wait() system call; I’m sure this 128+signum was wired directly into things at the OS level then, so you only got a 7 bit value from a process exit.

These days we get an 8 bit value from the exit status and test for a signal with the WIFSIGNALED(status) macro.

py3 -c 'import sys; sys.exit(0)'; echo $?
0
py3 -c 'import sys; sys.exit(1)'; echo $?
1
py3 -c 'import sys; sys.exit(130)'; echo $?
130
py3 -c 'import sys; sys.exit(255)'; echo $?
255
py3 -c 'import sys; sys.exit(256)'; echo $?
0
py3 -c 'import sys; sys.exit(257)'; echo $?
1

I remain of the opinion that 130 is not the typical chosen exit code for catching an interrupt and exiting. I still use 1 for that, absent some weird requirement.

cameron · June 15, 2024, 10:12pm

Bah! Nay, not so. This 128 thing is a shell level thing. A gander at the V7 shell source shows it getting an 8 bit exit value from the wait() status number from the OS. The 128 stuff is some munging of that if there was a signal. I learnt this stuff on V7 UNIX, so this conflation of the process exit status with the shell exit code must have happened in my head then.

Still strongly against returning 130 directly though.

For those who care, the V7 shell goes:

    INT             rc=0, wx=0;
    INT             w;

then:

    p=wait(&w);

to fetch the exit status (into an INT, a 16-bit word then). Then computes rc, the shell level return code, thus:

            w_hi = (w>>8)&LOBYTE;

            IF sig = w&0177
            THEN    IF sig == 0177  /* ptrace! return */
                    THEN    prs("ptrace: ");
                            sig = w_hi;
                    FI
                    IF sysmsg[sig]
                    THEN    IF i!=p ORF (flags&prompt)==0 THEN prp(); prn(p); blank() FI
                            prs(sysmsg[sig]);
                            IF w&0200 THEN prs(coredump) FI
                    FI
                    newline();
            FI

            IF rc==0
            THEN    rc = (sig ? sig|SIGFLG : w_hi);
            FI
            wx |= w;
    OD

    IF wx ANDF flags&errflg
    THEN    exitsh(rc);
    FI
    exitval=rc; exitset();

I’ve omitted some surrounding logic.

barry-scott · June 15, 2024, 11:43pm

As trivia the above is from at&t v6 unix originally and its C code.
There are a set of #define statements that make the IF THEN FI expand to valid C code.