Unpacking generalization for dict assignment with type conversion/data normalization

Preface:
The following idea was originally part of my earlier proposal to Expand structural pattern matching-like syntax to assignments/generalized unpacking, but that proposal was shot down because it tried to achieve both pattern matching and unpacking at the same time and would too easily result in ambiguous situtations. This proposal hopefully avoids ambiguity by instead focusing only on unpacking for dict assignment, with optional type conversion and/or data normalization. Okay, that still may be too ambitious with two proposals in one, but I think they go well hand-in-hand to achieve the goal of an assignment with an ad-hoc schema.

Motivation:
It happens too often when dealing with dicts from records of a predefined schema that all we want is for values of certain keys to be assigned to respective variables while converting some of them to appropriate types or formats:

Gender = Enum('Gender', {'MALE': 'M', 'FEMALE': 'F'})

with open('persons.csv') as file:
    for row in csv.DictReader(file):
        name = row['name']
        gender = Gender(row['gender'])
        birthdate = date.fromisoformat(row['birthdate'])
        print(f"{name}'s fortune of the day is {get_fortune(gender, birthdate)}'")

Would it not be cleaner if the variable = constructor(row[key]) boilerplate above can be more intuitively rewritten with the following unpacking syntax instead?

for {
    'name': name,
    'gender': Gender(gender),
    'birthdate': date.fromisoformat(birthdate)
} in csv.DictReader(file):
    print(f"{name}'s fortune of the day is {get_fortune(gender, birthdate)}'")

Proposals:

  1. Dict assignment: If an assignment target is a dict literal enclosed in { and }, unpack the corresponding value on the right as a mapping.
    a. The value for each specified key on the left is retrieved from the mapping on the right and assigned to the corresponding variable under the key:
    {'foo': a} = {'foo': 1, 'bar': 2} # a = 1
    
    b. For convenience, we may not want to enforce that all keys in the mapping on the right are accounted for in the dict literal on the left so the above assignment is valid even if the dict literal on the left is missing the key 'bar'. But if we do (and even if we don’t), we should allow those unspecified keys and values to be stored in a double-starred dict variable:
    {'foo': a, **rest} = {'foo': 1, 'bar': 2} # a = 1; rest = {'bar': 2}
    
  2. Type conversion/data normalization: If an assignment target on the left is not a plain variable name:
    a. If the expression contains a call (or mutiple calls), there should be one and only one variable name in the entire expression that would be substituted with the corresponding value on the right to make the call(s), and the returning value of the call(s) would be assigned to that variable name:
    int(a), b = '123.1', 1 # a = 123; b = 1
    
    We can allow method calls too:
    a.replace(' ', '').rstrip(), b = 'foo bar\n', 1 # a = 'foobar'; b = 1
    
    But multiple variables in the same assignment target should produce an error due to ambiguity:
    base = 2
    int(a, base), b = '10', 1 # disallowed because both a and base are variable names
    
    b. Subscripts are treated together with belonging name as a whole as an assignment target, and variables within subscripts are allowed and are evaluated first:
    b = 0
    int(a[b]), c = '1.2', 2 # a[0] = 1; c = 2
    
    c. Dotted names are valid assignment targets as they are today:
    int(foo.a), b = '1.2', 2 # foo.a = 1; b = 2
    
  3. Structurally nestable: If an assignment target is a literal of dict, tuple or list, recursively unpack the corresponding value on the right to further apply assignments, like how it is done for assignments of nested tuples and lists today:
    {'foo': {'bar': a}} = {'foo': {'bar': 1}} # a = 1
    
    and:
    {
        'foo': (
            a,
            {'baz': int(b)},
            *c
        )
    } = {'foo': ('bar', {'baz': '1.2'}, 1)} # a = 'bar'; b = 1; c = [1]
    

Current workaround:
One can currently use operator.itemgetter together with tuple unpacking assignment to assign values under specific keys to corresponding variable names. Not as pretty as the proposal above IMHO, no nested structures allowed, and not able to perform type conversion/data normalization in one go:

from operator import itemgetter
name, weight = itemgetter('name', 'weight')({'name': 'foo', age: '30', 'weight': '150.5'})
weight = float(weight)

I think the structural assignments make some amount of sense, even though it looks a little bit odd and will take time to get used to.

The data conversion/normalization part of this is one step too far for me. It looks wrong and confusing and I’m not sure how you would actually make it work correctly in an unambiguous way, since you now have actual expressions on the LHS that need to be de-sugared and executed on the RHS.

The whole point of structural unpacking is that the structure on the LHS matches the one on the RHS, so if you start putting arbitrary expressions on the LHS for type conversion the symmetry is broken, especially since on the LHS it’s supposed to be applied after structurally matching and you get the output of the function, rather than the input. But the name is used in the argument list of the call, so it looks like an input…

Data conversion would need its own clearly distinct unambiguous syntax that belongs with the declaration of the name, i.e. something like this would make a lot more sense to me. Although there’s probably a better keyword for this operation.

for {
    'name': name,
    'gender': gender astype Gender,
    'birthdate': birthdate astype date.fromisoformat
} in csv.DictReader(file):
    print(f"{name}'s fortune of the day is {get_fortune(gender, birthdate)}'")

Although I personally think that’s still too much logic to pack into variable assignments.

1 Like

Perhaps you can use PEP 634 patterns for what you want to do?

That PEP and its predecessor explain why all of pattern matching could not make it into assignment statements.

for row in in csv.DictReader(file):
    match row
        case {
            'name': name,
            'gender': Gender(gender),
            'birthdate': birthdate,
        }: 
           print(f"{name}'s fortune of the day is {get_fortune(gender, date.fromisoformat(birthdate))}'")
1 Like

I would probably solve this problem by defining a Dataclass to represent rows of the CSV, which then separates the task of parsing from the computation performed to get the output.

But the name is used in the argument list of the call, so it looks like an input…

Data conversion would need its own clearly distinct unambiguous syntax that belongs with the declaration of the name, i.e. something like this would make a lot more sense to me. Although there’s probably a better keyword for this operation.

Okay, I totally get why reusing the same variable name as both an argument to the data cleaning callable(s) and an assignment target can be considered confusing (where int(x), = '1.2', implicitly means x = '1.2'; x = int(x)), but the point is to provide a more concise way to perform data cleaning with callables that don’t necessarily take the input as the first argument, or involve multiple calls made in different ways, e.g. int(x.rstrip()), = '1.2\n', means x = 1.2\n’; x = int(x.rstrip()) and have the result assigned back to a variable, all in an unpacking assignment. This simply cannot be done with the variable astype constructor syntax that you propose since it forces constructor to be one callable that takes one argument.

So how about that we use a special variable name such as _ for catching an input value from unpacking, and then use the as keyword to assign the resulting value from calls to an assignment target? The _ variable name should only be special when there is an as keyword:

int(_.rstrip()) as a, = '1.2\n', # _ = '1.2\n'; a = int(_.rstrip())

Perhaps you can use PEP 634 patterns for what you want to do?

That PEP and its predecessor explain why all of pattern matching could not make it into assignment statements.

That’s actually why I was proposing to include pattern matching in this new syntax in my previous proposal because I was indeed inspired by the structural unpacking feature that the pattern matching syntax offers, but that proposal failed and deserved to fail because pattern matching and unpacking assignment serve very different goals.

The match statement is first and foremost meant for pattern matching and variable binding, rather than type conversion/data cleaning/normalization and assignment, so the callables in a case clause are simply used for type matching and are not actually called to perform conversion or normalization. What I would like to see for my use case is to force calls on items unpacked from an assignment before the values are assigned to variables.

(Side note: int("1.2\n".rstrip()) will raise ValueError - was that your intention?)

It suonds like you want a completely general system, which is great, but it’s going to fall foul of ambiguities. Consider:

  • int(x) = "5" - equivalent to x = "5"; x = int(x)
  • int(x[1]) = "A5" - equivalent to x = "A5"; x = int(x[1])
  • x[1] = "A5" - is this equivalent to x = "A5"; x = x[1] or is it subscript assignment into x?

Since the third one already has meaning, this can’t be changed, so you’re going to need to come up with some sort of rule about exactly what limits this new syntax. And even if you can come up with something that’s technically unambiguous (that is, that the Python syntax has precisely one interpretation for), it’s still going to be confusingly similar.

If this were a proposal with no cost, it might be of some small value, but I don’t think it has enough to justify the complexity it adds.

Thanks for appreciating the effort to generalize the syntax of unpacking. The ambiguity you raise here is actually already addressed by the section b. of my proposal 2.:

  1. b. Subscripts are treated together with belonging name as a whole as an assignment target, and variables within subscripts are allowed and are evaluated first:
    b = 0
    int(a[b]), c = '1.2', 2 # a[0] = 1; c = 2
    

Or with my revised syntax proposed above in response to David Salvisberg’s comment:

b = 0
int(_) as a[b], c = '1.2', 2 # _ = '1.2'; _ = int(_); a[b] = _; c = 2

And ah yes I totally forgot that while the int constructor can take either float and str as an argument, the string cannot be a representation of a float. Consider that to be valid just for my illustration purpose for now. Thanks.

This is what I mean about a techincal vs practical ambiguity, though. You may be able to design a set of rules that avoid technical ambiguities, but they will still be extremely confusing, as there will be highly similar constructs that have vastly different results. This is a path to chaos.

Yeah, if you switch it out for float instead, it’ll be valid and meaningful, and won’t change your argument any.

With my proposal 2b. it actually aligns perfectly with the existing way of assigning a value to a target with a subscript, so I fail to see your point of ambiguity here:

Current assignment to a subscript:

a[b], = int(1.2),

Proposed assignment to a subscript:

int(_) as a[b], = 1.2,

In both of the above cases a[b] are treated as a whole as an assignment target.

I never said that the LHS of astype needs to be a callable, it could just as well be an arbitrary expression where the locals created by the unpacking operation are already available. This is new syntax, so any number of things are possible. But even if you stipulated that it needs to be a callable, you could just use a lamdba

Your proposed syntax along with the few hacks to address a subset of the ambiguity concerns is confusing to parse for both machines and humans and arbitrarily limits the things it can do as well. So you are not getting enough return for your mental investment.

I’m not even really on board for the astype version, which has much cleaner and easy to understand semantics, and also has to make none of the concessions you were forced to make, you could even say that astype happens in a second step after unpacking, so you could combine multiple elements in the conversion (e.g. to perform something like a cross product).

In any case it’s very rare that I want data transformation on this level without data validation, so I’d much rather use a pydantic model for this kind of an operation.

1 Like