Better fields access and allowing a new character at the start of identifiers

samuelcolvin · March 22, 2022, 5:38pm

namedtuples, dataclasses, pydantic, django models, SQLAlchemy and many others suffer from a problem:

How do you allow arbitrary field names without the risk that they clash with library provided properties and functions.

Minimal example:

from whatever_library import orm_decorator

@orm_decorator
class Fisherman:
    name: str
    seas: List[str]

fisherman = Fisherman(name='Fred', seas=['atlantic', 'pacific'])
print(fisherman.seas)  -> ['atlantic', 'pacific']
# here "fields" is a method which returns the field names
print(fisherman.fields())  -> ['name', 'seas']

@orm_decorator
class Farmer:
    name: str
    fields: List[str]  # !!! this breaks, either now or later

This is not a new problem, it’s been around for years. Existing libraries deal with it with
a variety of hacks:

namedtuples provides a ._dict() method - uses private variables
dataclasses provides dataclassses.fields()
pydantic just uses .dict(), .json() etc. and forbids those names for fields
django uses model_instance.objects.whatever which is slightly different (table vs. row) but is used for a similar purpose

All these approaches have significant drawbacks.

I therefore propose that a new character is allowed at the start of an identifier
(2. Lexical analysis — Python 3.10.3 documentation) which is available
on most keyboards, and by convention that character is used in field names
within ORM/dataclass like contexts.

Two obvious options are “@” or “$”.

The above example would therefore become:

from whatever_library import orm_decorator

@orm_decorator
class Farmer:
    $name: str
    $fields: List[str]  # this works fine

farmer = Farmer(name='Jones', fields=['meadow', 'highlands'])
print(farmer.$fields)  -> ['meadow', 'highlands']

The other potiential solution to this would be to create a new “accessor” method,
e.g. : or ::. So fields could be accessed via farmer::fields while the method is
still available via farmer.fields (or visa versa).
IMHO this would be more confusing and might require a bigger language change and is
therefore a less good solution.

Here’s a discussion about potential workarounds in pydantic: pydantic#1001 (sorry, I can only include 2 links in this post as I’m a new user)

What do people think?

Is there another option I haven’t through about?

Since we’re near April 1st, there’s also the idea of using a random, rarely used unicode character (https://twitter.com/samuel_colvin/status/1472283581087158273):

from whatever_library import orm_decorator

@orm_decorator
class Farmer:
    ᚑname: str
    ᚑfields: List[str]  # this works fine

farmer = Farmer(name='Jones', fields=['meadow', 'highlands'])
print(farmer.ᚑfields)  -> ['meadow', 'highlands']

This already works, but don’t do it!

guido · March 22, 2022, 6:22pm

We’re coming pretty close to April 1 again…

On the assumption you’re actually serious: allowing a new character in identifiers would just postpone the issue – sooner or later someone else in a totally different area adopts that same character for a different special case, and at some point the two conventions clash.

Python itself uses dunders for this kind of thing (though not consistently).

Using @ is impossible since it is already a token used for decorators.

Using $ might work, but will be really confusing to anyone who is used to its special meaning in shell, Makefile, and many other languages (Perl comes to mind :-).

Can’t you just use a single leading _?

samuelcolvin · March 22, 2022, 11:14pm

Can’t you just use a single leading _ ?

Well yes, but that’s pretty confusing since the method is private. Hence why all implementations I know of except namedtuples do something else.

Forgot to say above, the reverse of what I suggested might work better - methods/properties use the $ or similar prefix, and field names stay vanilla. So you have model_instance.fields for the field and model_instance.$fields for the library method to get information about the model’s fields.

We’re coming pretty close to April 1 again…

The thing is, there’s already thousands of non alphanumeric characters that can be used in identifiers (including as the first chracter), see https://gist.github.com/samuelcolvin/68f56c96bf18b03bb8a3ffa988380966. E.g. the weird cross I showed above, or this forward slash: Ⳇ (that’s a coptic capital alfa). The only problem is that they don’t appear on typical keyboards so are hard for most developers to use.

In Summary: allowing one more character in identifiers would be really helpful and wouldn’t actually represent a fundamental change to what identifiers are.

The other option is to have a single letter namespace for methods, e.g. model_instance.m.json().

guido · March 22, 2022, 11:35pm

You’d have to write a PEP, and you’d have to argue it a bit better (using examples encountered in the real world, not seas/fields). Let’s see if there’s anyone else here who’s run into this problem, and how they dealt with it. (And what they do if a field is named e.g. ‘class’.)

samuelcolvin · March 22, 2022, 11:44pm

Of course. I’ve asked to present it at the language summit, if accepted perhaps I can argue it (better) then.

WRT class etc. - in pydantic we allows “aliases” (basically alternative external names for fields) which take care of this case as well as field names like “kebab-case”.

There are lots of real world examples - people regularly want to use json, fields, dict and many other names for fields.

fonini · March 22, 2022, 11:49pm

I always had the impression that _semi_dunder_names_, starting and ending with one underscore, were a sort of informal standard for “names that are reserved for use by libraries”: they should never be defined by end users (unless as part of a protocol with semantics defined by a library), but since they’re not reserved for the language itself, libraries are free to choose what to do with them.

Of course, this doesn’t work 100%—there’s quite a long stack of “actors” between “the language” and “the end user”, and I don’t think it’s always clear who has the prerogative to define the meaning of such names. Anyway, it’ a convention that’s always worked quite well in my experience.

EDIT: there’s much, much more of those in the stdlib than I thought:

~/cpython$ ack -w '_[a-zA-Z][a-zA-Z_]*[a-zA-Z]_' -o | \
         > cut -d: -f3 | sort -u | head
_aa_
_abc_
_abstract_
_after_
_align_h_
_all_
_all_bits_
_always_
_AM_COND_VALUE_
_and_

~/cpython$ ack -w '_[a-zA-Z][a-zA-Z_]*[a-zA-Z]_' -o | \
         > cut -d: -f3 | sort -u | wc
    182     182    2038

steven.daprano · March 23, 2022, 12:47am

This has been a (non?) problem since the early days of Python 1.x, when the Python cookbook published a recipe for the “bunch” class:

https://code.activestate.com/recipes/52308/

It is also why Python dicts haven’t copied Javascript in allowing dot access to key/value pairs.

Any time you then try to use the same interface for both the object’s API and the user-specified data, you run into the problem that they can collide. So in that sense, it is problem, but in another sense it is a non-problem: provide two interfaces, one which is purely used for your object API, and one for user data:

dict.update

dict['update']

Problem solved. For some definition of “solved”.

Another solution is “stropping”:

This has been discussed before, in the context of allowing reserved words as identifiers:

https://mail.python.org/archives/list/python-ideas@python.org/thread/3BJLET3HCEZTTAP45HHL7W36X4RU54KT/#3BJLET3HCEZTTAP45HHL7W36X4RU54KT

Aside from the colour of the bikeshed (backslash, at-sign, dollar-sign, something else?), we need to argue precedence. Consider a mapping class that allows dot access to keys:

mapping.name

Under current behaviour, dot access first looks for the attribute associated with the instance (and so a key “name” will over-ride a method “name”). Changing that will break backwards compatibility, so presumably we don’t change that.

This implies that the stropped version

mapping.\name

will have the opposite effect, skipping the instance attribute (the field) and allowing access to the class attribute (presumably a method).

But that is the reverse of the effect we might want for verbatim names, where the name with a sigil overrides the reserved name:

if = value   # syntax error due to reserved word

\if = value  # verbatim name allows use of reserved word

I’m not sure how to reconcile the two without breaking backwards compatibility.

ferdnyc · March 23, 2022, 5:43am

Riverbank used trailing underscores, rather than leading ones, for the Qt methods that were special names in Python. Even ones where they didn’t strictly have to (or maybe they did, in Python 2?), like QApplication.exec(). That used to be called QApplication.exec_(), and is still available under that name as a deprecated alias.

The trailing-rather-than-leading thing nicely skirts around the whole loaded question of member privacy, so that’s a plus.

hynek · March 23, 2022, 8:16am

Since I’ve been asked for feedback re: attrs (but I think that also applies to dataclasses as well):

I see no benefit of adding methods to generated classes at all. I find the current approach of keeping a private state and then use functions to work on it much cleaner, easier/safe to extend and all-in-all user-friendlier. I chose it specifically because I believe that keeping model classes as clean as possible is a virtue (probably als due to the experience with nametuples’s underscore shenanigans). If a user wants methods that do that specific work, they can add the methods themselves and call functions from there.

Therefore this whole problem looks like self-inflicted pain to me, not worth solving at a language level.

Tinche · March 23, 2022, 3:21pm

Hi. I’m a major contributor to attrs (second only to Hynek in commit count) and I’m also working on open sourcing a Mongo ODM based on attrs, so this has been fresh in my mind. As it happens, 6 days ago I started an issue over at the attrs repo to brainstorm exactly some of these use cases, with a slightly different focus.

In my opinion this is definitely an issue for tooling, not for the language itself. Two approaches available today:

the attrs/dataclass approach of fields(Model)
the SQLAlchemy approach of having the attributes under an easy to use attribute (Model.c), with the added ability to choose a different attribute name in case c is unavailable

I think these are very adequate in a runtime context.

I think your approach also precludes using the same model in multiple contexts.

@orm_decorator
@json_decorator
class Farmer:
    $name: str
    $fields: List[str]

Farmer().fields() # Do the attributes come from the ORM or JSON?

Whereas the attrs approach just works:

from orm import fields as orm_fields
from myjson import fields as json_fields

orm_fields(Farmer())
json_fields(Farmer())

You need different attributes in different contexts because each library might add specific functionality to the attributes, like the ORM overriding operators so you can make queries and a JSON library providing OpenAPI support, things of this nature.

Anyway, my biggest issue currently is the static analysis context. I’ve merged some work to Mypy recently to expose properly typed attributes under the __attrs_attrs__ magic fields, but due to some limitations I wasn’t aware of (Mypy doesn’t support generic classvars in protocols) I’ll need to implement attrs.fields in the Mypy plugin itself too. This is going to be the next frontier where we can make very exciting things happen, I think.

samuelcolvin · April 4, 2022, 2:29pm

Thanks everyone for your input.

I still think this is a significant problem worthy of a solution. I don’t think most people are taking onboard quite how lenient identifiers are - not allowing $ is the exception to the rule of allowing many many symbols at the start of identifiers.

Still, I see that I’m very unlike to persuade most people of my suggestion. So I see no point in continuing to expend energy on this.

On the specific case of pydantic, I’m going to rename all public fixed methods to either use a trailing underscore or be prefixed by model_ or similar, then prevent (or warn about) using that pattern in field names.

E.g. json() will become either model_json() or json_().

stoneleaf · April 6, 2022, 8:11pm

Many thanks.

The solution I used for my dbf library is the same as @Tinche’s: I moved the methods out of the record class (which support both attribute and index lookup), and into the module as separate functions. So instead of record.delete() to delete a record, it became dbf.delete(record).

I would recommend the same approach for pydantic – I personally hate trailing underscores, and a method_ prefix is very verbose.

guido · April 6, 2022, 8:24pm

The <module>. prefix is also pretty verbose though. What’s wrong with a leading underscore, other than that people might mistake it to mean “internal” or “private” or some such?

stoneleaf · April 6, 2022, 8:39pm

The internal/private suggestion of a leading underscore is pretty significant. As a module level function, a “normally” named routine could be imported directly, or the module could be aliased to a much shorter name; e.g. pyd instead of pydantic.

daniele · April 6, 2022, 8:45pm

I know there is no much consistency in this but the operator module (if I would guess one of the oldest modules in the standard library) uses a trailing underscore for symbols that would clash with keywords. This is also what I use in my code. It is not aesthetically very pleasing but it is effective and it solves the issue of people mistaking the symbols as private.

CAM-Gerlach · April 7, 2022, 8:16pm

FWIW, this is also the convention I’ve seen in a number of other major libraries, e.g. scikit-learn

samuelcolvin · April 21, 2022, 6:21am

Seems to me you’re all inadvertently making a good case for a fix - there seems no agreement whatsoever about which of the workarounds is the least worst…

guido · April 22, 2022, 12:37am

So find a core dev or PEP editor who is willing to sponsor the PEP you’re going to write about this.

storchaka · April 22, 2022, 4:38am

I concur with Ethan. We write len(obj), and not obj._len(), obj.len_() or obj.$len(). If we want the behavior depending on the object, we add a dunder attribute or method, but use an external function in user code.

In your example I suggest to add attribute __fields__ or __mylib_fields__ in your class and add global function fields() for convenience.

And I think that we should set an example by moving from underscored atttributes in named tuples to dunders and global functions. It would be nice if the same function work with dataclasses and named tuples, and user classes that implement some protocol.

guido · April 22, 2022, 4:52am

Now that would be a proposal I could get behind.