Truncating SyntaxError

Nineteendo · January 1, 2025, 4:02pm

In my own JSON library jsonyx, I use a subclass of SyntaxError to report errors, because they provide more context to the user (the file name and offending line):

json.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

versus

  File "<stdin>", line 1, column 2
    [,]
     ^
jsonyx.JSONSyntaxError: Expecting value

This proposal isn’t about changing this behaviour in the json library ^[1], it’s about providing the tools to implement this in a third party library. Let me explain.

Unlike python files, lines in a JSON file can get very long, making the error hard to read:

  File '<stdin>', line 1
    {"glossary": {"title": "example glossary", "GlossDiv": {"title": "S", "G
lossList": {"GlossEntry": {"ID": "SGML", "SortAs": "SGML", "GlossTerm": "Sta
ndard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1
986", "GlossDef": {"para": "A meta-markup language, used to create markup la
nguages such as DocBook.", "GlossSeeAlso": ["GML", "XML"]}, "GlossSee"}}}}}
                                                                            
                                                                            
                                                                            
                                                                            
                                                                      ^
jsonyx.JSONSyntaxError: Expecting ':' delimiter

So, I truncated the line and adjusted the offset (which is far from trivial):

Show code

def _get_err_context(doc: str, start: int, end: int) -> tuple[int, str, int]:
    line_start: int = max(
        doc.rfind("\n", 0, start), doc.rfind("\r", 0, start),
    ) + 1
    if match := _match_whitespace(doc, line_start):
        line_start = min(match.end(), start)

    if match := _match_line_end(doc, start):
        line_end: int = match.end()
    else:
        line_end = start

    end = min(line_end, end)
    if match := _match_whitespace(doc[::-1], len(doc) - line_end):
        line_end = max(end, len(doc) - match.end())

    if end == start:
        end += 1

    max_chars: int = get_terminal_size().columns - 4  # leading spaces
    if end == line_end + 1:  # newline
        max_chars -= 1

    text_start: int = max(min(
        line_end - max_chars, end - 1 - max_chars // 2,
        start - (max_chars + 2) // 3,
    ), line_start)
    text_end: int = min(max(
        line_start + max_chars, start + (max_chars + 1) // 2,
        end + max_chars // 3,
    ), line_end)
    text: str = doc[text_start:text_end].expandtabs(1)
    if text_start > line_start:
        text = "..." + text[3:]

    if len(text) > max_chars:
        end -= len(text) - max_chars
        text = (
            text[:max_chars // 2 - 1] + "..." + text[2 - (max_chars + 1) // 2:]
        )

    if text_end < line_end:
        text = text[:-3] + "..."

    return start - text_start + 1, text, end - text_start + 1

  File '<stdin>', line 1
    ...s such as DocBook.", "GlossSeeAlso": ["GML", "XML"]}, "GlossSee"}}}}}
                                                                       ^
jsonyx.JSONSyntaxError: Expecting ':' delimiter

But now there’s no way to determine what the column number is ^[2], so I’m requesting a way to truncate the line AND display a column number:

  File '<stdin>', line 1, column 371
    ...s such as DocBook.", "GlossSeeAlso": ["GML", "XML"]}, "GlossSee"}}}}}
                                                                       ^
jsonyx.JSONSyntaxError: Expecting ':' delimiter

Possible solutions, ranging from most to least work for people using this feature:

Automatically truncate SyntaxError and display the offset
Add a TruncatedSyntaxError which will be truncated automatically and also displays the offset
Allow specifying the column number in the constructor of SyntaxError

My implementation already:

strips whitespace
reserves 1 character at the end of the line
expands tabs
truncates start middle and end, potentially all 3

But some questions still remain:

min, max and default values for the available number of columns
what to do with unprintable characters
how should this be configured

Examples from my unit tests

# ("columns", "doc", "start", "end", "offset", "text", "end_offset")

# Remove leading space
(8, " current", 0, 8, 1, " current", 9),
#    ^^^^^^^^             ^^^^^^^^
(8, "\tcurrent", 0, 8, 1, " current", 9),
#    ^^^^^^^^^             ^^^^^^^^
(8, " current", 1, 8, 1, "current", 8),
#     ^^^^^^^             ^^^^^^^
(8, "\tcurrent", 1, 8, 1, "current", 8),
#      ^^^^^^^             ^^^^^^^

# Remove trailing space
(8, "current ", 0, 8, 1, "current ", 9),
#    ^^^^^^^^             ^^^^^^^^
(8, "current\t", 0, 8, 1, "current ", 9),
#    ^^^^^^^^^             ^^^^^^^^
(8, "current ", 0, 7, 1, "current", 8),
#    ^^^^^^^              ^^^^^^^
(8, "current\t", 0, 7, 1, "current", 8),
#    ^^^^^^^               ^^^^^^^

# No newline
(9, "start-end", 0, 5, 1, "start-end", 6),
#    ^^^^^                 ^^^^^

# Newline
(8, "current", 7, 7, 8, "current", 9),
#           ^                   ^
(8, "current", 7, 8, 8, "current", 9),
#           ^                   ^

# At least one character
(9, "start-end", 5, 5, 6, "start-end", 7),
#         ^                     ^

# Expand tabs
(9, "start\tend", 5, 6, 6, "start end", 7),
#         ^^                     ^

# Truncate start
(6, "start-middle-end", 13, 16, 4, "...end", 7),  # line_end
#                 ^^^                  ^^^
(7, "start-middle-end", 16, 17, 7, "...end", 8),  # newline
#                    ^                    ^

# Truncate middle
(12, "start-middle-end", 0, 16, 1, "start...-end", 13),
#     ^^^^^^^^^^^^^^^^              ^^^^^^^^^^^^
(13, "start-middle-end", 0, 16, 1, "start...e-end", 14),
#     ^^^^^^^^^^^^^^^^              ^^^^^^^^^^^^^

# Truncate end
(8, "start-middle-end", 0, 5, 1, "start...", 6),  # line_start
#    ^^^^^                        ^^^^^

# Truncate start and end
(7, "start-middle-end", 5, 6, 4, "...-...", 5),
#         ^                          ^
(8, "start-middle-end", 5, 6, 5, "...t-...", 6),
#         ^                           ^
(11, "start-middle-end", 7, 11, 5, "...middl...", 9),
#            ^^^^                       ^^^^
(12, "start-middle-end", 7, 11, 5, "...middle...", 9),
#            ^^^^                       ^^^^
(13, "start-middle-end", 7, 11, 6, "...-middle...", 10),
#            ^^^^                        ^^^^

If you want to play around with this you can install my library and then use the jsonyx format command:

$ pip install --force-reinstall git+https://github.com/nineteendo/jsonyx
$ echo '[,]' | jsonyx format
  File "<stdin>", line 1, column 2
    [,]
     ^
jsonyx.JSONSyntaxError: Expecting value

Although I personally won’t use it until this functionality is added. ↩︎
If provided, VS Code allows you to jump to that exact position in the file ↩︎

sirosen · January 1, 2025, 5:59pm

No strong opinion from me on the feature/request, but nice write-up! I feel like I understand the problem pretty well from reading your post.

Question: is it “wrong” that you are using SyntaxError rather than ValueError (as stdlib json does) for your base? I feel this might be a source of contention, since SyntaxError is meant for invalid Python code.

I see the difference in presentation which you showed, but can you not achieve something similar with your own error messages? (And you mention VSCode jump-to-line, so maybe there’s a connection there?)

This is the form I think you should advocate for. It’s simple to use and places a very minimal burden on the language to implement.

But I think you need to answer why you can’t get the desired functionality when inheriting directly from Exception.

Nineteendo · January 1, 2025, 7:26pm

Well, it’s also an error that’s raised by a parser, just not a Python parser. I don’t really have a choice anyway, as it’s the only way to display an exception in this format.

I thought about putting this information in the error message, but it doesn’t really look right:

Traceback (most recent call last):
jsonyx.JSONSyntaxError: Expecting value
  File "<stdin>", line 1, column 2
    [,]
     ^

It’s the easiest to implement for me (as I already have the logic to truncate):

if sys.version_info >= (3, 14):
    super().__init__(msg, (filename, lineno, offset, text, end_lineno, end_offset, colno, end_colno))
else:
    self.colno = colno
    self.end_colno = end_colno
    super().__init__(msg, (filename, lineno, offset, text, end_lineno, end_offset))

But it’s the hardest to implement for anyone else. Unless I publish a library with a base exception you can inherit from. It also wouldn’t be available for use in standard libraries like json, which would be a bit of a shame.

gerardw · January 1, 2025, 10:45pm

This is not the exception you are looking for.

The SyntaxError docs say:

Raised when the parser encounters a syntax error. This may occur in an import statement, in a call to the built-in functions compile(), exec(), or eval(), or when reading the initial script or standard input (also interactively).

Emphasis mine. The use of the definite article “the,” the context, and the list of where it may occur make it clear SyntaxError is intended for reporting syntax errors detected by the Python interpreter, not third-party libraries parsing non Python code.

Subclass Exception

All built-in, non-system-exiting exceptions are derived from this class. All user-defined exceptions should also be derived from this class.

and format the message however you’d like.

sirosen · January 2, 2025, 6:45am

It is a “syntax error” broadly writ, yes, and it is raised by a parser. But SyntaxError is pretty specifically for syntactic errors in Python source.

I’m not following why this is your only choice. What is it in terms of output formatting which must come from SyntaxError treatment? Why can’t you replicate the behavior that you want with a custom exception class?

Right now this looks to me like an XY problem. You’re saying you want to change SyntaxError, but I still don’t understand why you’re using SyntaxError in the first place.

storchaka · January 2, 2025, 10:03am

SyntaxError is not appropriate for other parsing errors. One problem is that its representation for location of the error is different from json.JSONDecodeError, re.error, pickle.UnpicklingError, pyexpat.ParseError, etc (global offset vs the line-column pair, 0- vs 1- based indices, text vs bytes, single point vs span). Other problem is that the parsed source often has too long lines for non-Python sources. SyntaxError is also tightly associated with Python source errors, so some user code can be confused if other exceptions will became a subclass of SyntaxError.

I see two general solutions:

Introduce a special protocol (a set of attributes and methods) for parsing errors. SyntaxError, json.JSONDecodeError, etc (maybe even UnicodeError), should implement such protocol, and the traceback module should use that protocol instead of special casing SyntaxError.
This is a part of more ambitious plan for uniting notes and tracebacks (and the chain of handled exceptions) – instead of separate __notes__, __traceback__ and __context__, add notes, tracebacks, etc to a single linked list. Location information can be added as a kind of traceback node. For example, see new detailed exception notes for pickling or JSON serializing errors in the main branch. They could be structured nodes instead of plain string notes.

Nineteendo · January 2, 2025, 5:30pm

VS Code only understands a line column pair for the start and end position in a text file (not a binary file):

github.com

microsoft/vscode/blob/2bdb3e9b41bd72048ea2067a350d8536c82fc7f6/src/vs/workbench/contrib/terminalContrib/links/browser/terminalLinkParsing.ts#L64-L126


      
          	// The comments in the regex below use real strings/numbers for better readability, here's
          	// the legend:
          	// - Path    = foo
          	// - Row     = 339
          	// - Col     = 12
          	// - RowEnd  = 341
          	// - ColEnd  = 789
          	//
          	// These all support single quote ' in the place of " and [] in the place of ()
          	//
          	// See the tests for an exhaustive list of all supported formats
          	const lineAndColumnRegexClauses = [
          		// foo:339
          		// foo:339:12
          		// foo:339:12-789
          		// foo:339:12-341.789
          		// foo:339.12
          		// foo 339
          		// foo 339:12                              [#140780]
          		// foo 339.12

This file has been truncated. show original

This includes:

foo, line 1, column 3
foo, line 1, column 3-4
foo, line 1-2, column 3
foo, line 1-2, column 3-4

GitHub uses a similar format in the link above: L64-L126

Hence why I’m requesting to support truncating the line (manually or automatically) to fit on the screen…

Both are fine by me, I just thought a simple suggestion would receive less opposition.

Nineteendo · January 3, 2025, 6:56am

If that is a problem, would it be a solution to introduce BaseSyntaxError? Then the 3k examples from Github could do the right thing: /class \w+\(SyntaxError\)/

I would be happy with even a simple solution in Python 3.14. We can add something more sophisticated later.