None-safe traversal of dictionaries, e.g. from JSON

None safe

nonesafe makes it safe to parse dictionaries.

When parsing a dictionary from an external source, e.g. a JSON request, dictionary keys might be missing or there may be unknown dictionary keys.

For example suppose you know (or only care about) keys a and b at the top level and that a is also a dictionary that has a c.

>>> d_ok = {'a': {'c': 1}, 'b': 0}

This would be easy to use directly as a dictionary:

>>> d_ok['a']
{'c': 1}
>>> d_ok['a']['c']
1
>>> d_ok['b']
0

But if instead from the external source you got:

>>> d_not_ok = {'a': {'c': 1}, 'not_b': 0}

Then the code above using a dictionary would fail. Instead, use:

>>> import nonesafe as ns
>>> A = ns.declare('A', c=int)
>>> Safe = ns.declare('Safe', a=A, b=int)
>>> s = Safe(d_not_ok)
>>> s.a
A(c=1)
>>> s.a.c
1
>>> s.b

The missing value b is replaced by None (in the doctest above None is treated as not returning a value) and the extra value not_b is ignored. The usage s.expr indicates safe (will not raise an access exception but might return None instead).

There is also a utility function onnone(value, otherwise), that takes a value that might be None and if it is returns otherwise. EG:

>>> ns.onnone(s.b, -1)
-1

The function declare is very flexible, the following are all the same as each other:

>>> Ex0 = ns.declare('Ex0', {'a': int, 'b': int})
>>> Ex1 = ns.declare('Ex1', [('a', int), ('b', int)])
>>> Ex2 = ns.declare('Ex2', a=int, b=int)
>>> Ex3 = ns.declare('Ex3', {'a': int}, b=int)
>>> Ex4 = ns.declare('Ex4', [('a', int)], b=int)

Constructing an instance of a nonsafe class is also very flexible, the following are all the same as each other:

>>> ex0 = Ex0({'a': 0, 'b': 1})
>>> ex1 = Ex0([('a', 0), ('b', 1)])
>>> ex2 = Ex0(a=0, b=1)
>>> ex3 = Ex0({'a': 0}, b=1)
>>> ex4 = Ex0([('a', 0)], b=1)

and these are also the same as each other:

>>> ex5 = Ex0({})
>>> ex6 = Ex0([])
>>> ex7 = Ex0(None)
>>> ex8 = Ex0()

Installation

Simply copy nonesafe.py (note LICENCE), run some examples by executing nonesafe.py

Alternatives

Very similar can be achieved with packages like Pydantic, but they are much too heavyweight for casual use and their inclusion has previously been rejected in favour of dataclasses (PEP 557).

There is also a rejected PEP 505 and a proposal to revive it Revisiting PEP 505
that failed to reach a consensus. 505 proposed introducing new None aware operators ?? (same as onnone), ?., and ?[] (last too equivalent to declare’s behaviour). This module is considerably easier to add than three operators (current proof on concept circa 60 lines) and is arguably superior, because it is declarative. Note operators also need to be added to IDE’s, type-checkers, etc. and need to be taught. For newbies and none computer-science people they will be unfamiliar.

TODO

  1. Check field value is of correct type or None (auto-convert if possible). Presently ugly error!
  2. Add todict: should ‘extras’ not parsed be added back in? Should None be omitted? Yes & Yes.
  3. Allow declare to be used as a class decorator. Copy docstring from decorated classes.
  4. Decorated classes can provide defaults other than None.

Next steps

Is this something people are interested in pursuing?

Why not just use .get() with a default value on a regular dict?

Because a.get(‘b’, None) will fail if a is None and it’s verbose. EG your external source was empty.

This seems to read ok to me:

a.get(b) if a is not None

I guess that gets tougher if it’s a.b.c or deeper but I guess that doesn’t typically happen to me. If it’s data I control, I’d rather it be a dataclass or something more typing friendly then have correct annotations to get the type checker to yell at me if I forget something.

Sort of reminds me a bit about using jq and json to get a specific nested key in a list of objects.

Two comments:

  1. Dataclasses don’t deal with nested dicts nor if the dict is None.
  2. I haven’t added it yet, but todict for the round trip is difficult without nonesafe because parts of a hierarchy could be missing.

Is this a proposal for the stdlib?

It’s rare that the data is fully unknown, and when it is, the syntactic overhead of handling it is not typically that onerous.
You need the combination of unknown data shape and deep nesting for this sort of approach to be worth it.

And in those cases, why not use glom or jmespath?
Requiring a library for a niche case seems perfectly in line with how the stdlib and language should be developed.

Yes stdlib instead of PEP 505. There is a lot of interest in this space for JSON in particular. However, no agreement on best approach. You often get extra elements in JSON responses across versions and also None data completely omitted.

The benefit of PEP 505 compared to the various third party libraries in my mind is:

  • short concise syntax that makes it obvious for readers that this piece of code is none aware
  • no need to declare the expected structure before hand (some third party libraries already allowed this)

This proposal has neither and is just another solution.

Upload this on pypl if you want, but I don’t think it’s a good fit for the stdlib.

3 Likes

Related parallel “brainstorming” thread :
https://discuss.python.org/t/linked-booleans-logics-rethinking-pep-505/

1 Like

Isn’t this what, (a), JSON Schema is supposed to do?
If there is already a leading json schema, should we not implement and use that?

Given that JSON Schema has multiple drafts (versions) and that many of the older drafts have ambiguities, I think direct integration is a bad idea.

I work with the JSON Schema folks a little bit and they’re actively interested in improving the language (that is, the language of schemas).

So it’s a moving target. Even jsonschema doesn’t have full support for some of the features – it’s all very active and good stuff as far as most people ought to be concerned, but not the sort of thing you’d want pinned to a stale version in the stdlib.

The typing elements are especially interesting, since you’d need a type checker plugin which can read a schema and deduce Python types. And that’s not always even possible given a schema (unless you want to count Any).

Glom’s lookup path based data access is the most mature library I’m aware of that works along the lines of the OP’s proposal here.

Assuming PEP 750’s template string proposal gets accepted, it will become even more powerful (since it will become easier to inject values for data dependent lookup requests)

3 Likes

Thanks for the feedback everyone. I’ve updated the API to add todict so it can now be used to write and read/modify/write as well as read dicts to/from external data (e.g. JSON). Details in README on GitHub.

A motivating example:

Declare what is of interest (there should be at least a dict with a and b and a is a dict with c):

>>> from nonesafe import *
>>> A = nsdict('A', c=int)
>>> Safe = nsdict('Safe', a=A, b=int)

Consider a particularly tricky example, suppose we read:

>>> tricky = {'b': None, 'unknown': 'u'}

Then added in a.c:

>>> st = Safe(tricky)
>>> st.a.c = 0

Finally write it out again:

>>> st.todict()
{'b': None, 'unknown': 'u', 'a': {'c': 0}}

There is a lot going on this example:

  1. a.c has been added at the end, note it is not in input tricky hence at end.
  2. b despite being None is in output, because it was in tricky. If a field is in the input it is retained; even if None, which would normally be trimmed.
  3. unknown is retained, even though Safe doesn’t know about this field. It is retained because it is in the input.

The above example is much harder using any of the other alternatives.

My motivation for writing nonesafe came from a previous company where we supplied a wrapper around a JSON API to customers, that was built using dataclasses, and also from processing data from an internal Asana database (this code used Pandas). In both cases the nonesafe library would have been superior (but I hadn’t thought of it!).