JSON library unable to provide index during parse operations

This is a repost re some feedback here:

Proposal:

import json

def some_hook_logic(pairs: []):
    # some logic here

if __name__ == '__main__':
    with open('some.json', 'r') as fp:
        json.load(fp, object_pairs_hook=some_hook_logic)

Currently hooks are provided to allow for some custom processing of json files. However, there is no way to obtain the index (of the codepoint of the json string being parsed) of the hooked operation. This would be especially useful for providing validation feedback on files, but there may be other benefits to providing this information to the hooks.

JSONDecoder has a raw_decode method on it that can be overwritten, which could be passed the index, however currently this parameter is essentially unused, and the index is available only to the scanner implementation (which can either be pure python or the _json.c module).

The proposal/feature is to add the index as a parameter to the hooks, or make it available as a part of JSONDecoder, possibly as new index-aware hooks so as to leave existing code interoperable.

This has been discussed in irc chat previously, the current community recommendation is to pull in an external library (e.g. https://pypi.org/project/demjson3/ or https://github.com/ijl/orjson), however this would not be much of an api change, e.g.:

def some_hook_logic(pairs: [], index: int):
    # some logic here

json.load(fp, indexed_object_pairs_hook=some_hook_logic)

This would mostly be a matter of passing the index in the parser implementations, and/or fixing the raw_decode method to actually be used versus being essentially a dead-code passthrough.

Can you provide an example where this index would be useful? By the time the hook is called, the JSON has already been parsed; the object had been transformed from a string to a list of key/value pairs. The job of the hook is to post-process this list, not handle parser errors.

1 Like

Yes, and the post-processing is a time when we want to handle errors, or do other things as well.

def some_hook_logic(pairs: [], index: int):
    # check the pairs
    print("you aren't allowed those pairs, at %d" % index)
    # or
    if some_determination(index):
       # do one thing
    else:
      # do other thng

Fair enough. That kind of business logic, IMO, should be implemented after json.load returns. The main purpose of this hook is to build a data structure that, unlike a dict, allows for the duplicate keys that are permitted in a JSON object.

It seems that for this to be useful, it should also convert the index to line & column. And a filename (or other source identifictation) would also be helpful.
For this use case, today it’d be better to use some third-party library with better overall support for reporting locations. For json, this change could make sense as a first step toward that, but it would be great to have some bigger plan first.

To do it afterwards, the locations/indexes would need to be stored in json.load’s result, so, you still need the hook.