Introduce a "bareword" list/dict literal

blhsing · December 25, 2023, 2:52am

One of the annoyances when I jumped from Perl to Python many years ago was the lack of a qw-equivalent operator/literal that would make a common pattern of defining a long list of words/identifiers clean and easy, where the Perl expression:

qw(foo bar baz)

is semantically equivalent to the list:

"foo", "bar", "baz"

Wouldn’t it be nice if we can take advantage of one of the currently meaningless pairs of symbols, say < and >, to denote a list literal where barewords are parsed not as names but as strings, with commas being optional?

Taking opcode.py as an example, instead of writing:

__all__ = ["cmp_op", "stack_effect", "hascompare", "opname", "opmap",
           "HAVE_ARGUMENT", "EXTENDED_ARG", "hasarg", "hasconst", "hasname",
           "hasjump", "hasjrel", "hasjabs", "hasfree", "haslocal", "hasexc"]

we can write:

__all__ = <cmp_op stack_effect hascompare opname opmap
           HAVE_ARGUMENT EXTENDED_ARG hasarg hasconst hasname
           hasjump hasjrel hasjabs hasfree haslocal hasexc>

Furthermore, Perl also allows keys in hashes (what we call dicts in Python) to be barewords when they conform to the rules of an identifier, so:

%hash = (foo => 3, bar => 8, baz => 5);

is semantically equivalent to:

%hash = ("foo" => 3, "bar" => 8, "baz" => 5);

So perhaps we can make < and > denote a dict as well, where bareword keys are parsed as strings, and while we’re at it, make commas optional as well when keys are on separate lines.

Taking opcode.py again as an example, instead of writing:

_cache_format = {
    "LOAD_GLOBAL": {
        "counter": 1,
        "index": 1,
        "module_keys_version": 1,
        "builtin_keys_version": 1,
    },
    "BINARY_OP": {
        "counter": 1,
    },
    ...
}

we can write:

_cache_format = <
    LOAD_GLOBAL: <
        counter: 1
        index: 1
        module_keys_version: 1
        builtin_keys_version: 1
    >
    BINARY_OP: <
        counter: 1
    >
    ...
>

or make it even cleaner by going full-blown YAML-like by making identation imply dict nesting:

_cache_format = <
    LOAD_GLOBAL:
        counter: 1
        index: 1
        module_keys_version: 1
        builtin_keys_version: 1
    BINARY_OP:
        counter: 1
    ...
>

And while we’re at it, we can generalize indentation-implied nesting to lists as well, so that this dict of lists from _opcode_metadata.py:

_specializations = {
    "RESUME": [
        "RESUME_CHECK",
    ],
    "TO_BOOL": [
        "TO_BOOL_ALWAYS_TRUE",
        "TO_BOOL_BOOL",
        "TO_BOOL_INT",
        "TO_BOOL_LIST",
        "TO_BOOL_NONE",
        "TO_BOOL_STR",
    ],
    ...
}

can be optionally written as:

_specializations = <
    RESUME:
        RESUME_CHECK
    TO_BOOL:
        TO_BOOL_ALWAYS_TRUE
        TO_BOOL_BOOL
        TO_BOOL_INT
        TO_BOOL_LIST
        TO_BOOL_NONE
        TO_BOOL_STR
>

Note that only bareword keys are parsed as strings, while dict values are parsed normally as expressions.

Lastly, we can allow items and keys to be optionally quoted for more flexibility:

commands = <
    add
    find
    "list-servers" # we can arguably make quotes optional here too since dashes 
    "list-clients" # don't make these words syntatically ambiguous
    remove
>

By removing the noises of quotes and commas from the literals it makes these definitions more readable and easier to maintain.

The current workaround for defining a long bareword list is to use textwrap.dedent and str.splitlines on a docstring:

commands = textwrap.dedent(```\
    add
    find
    list-servers
    list-clients
    remove
```).splitlines()

But still it would be cleaner if Python supported the usage natively.

sirosen · December 25, 2023, 5:30am

I’ll preface by saying that I think the treatment of unquoted symbols as strings, which Perl and Ruby both support (I’m personally more familiar with the Ruby variants), is unnecessary and makes code more confusing. I much prefer the Python way, in which strings are consistently quoted and no special syntaxes alter that.

How does this make the code easier to maintain?
I see that it removes some characters and makes the code marginally shorter (but it’s really a marginal gain). I don’t think you can treat this as self-evident. You’ll need to make a more complete argument if you want to convince people, especially skeptical people like me, that Python needs some form of bare words array.

I can’t recall ever seeing code which uses a multi line string as a string array like this. Can you cite one or more examples in the stdlib or mainstream Python packages or programs?

Projects like codespell use independent text data files, but that’s much more a matter of separating code from data than anything else. That’s the closest thing I can think of.

csm10495 · December 25, 2023, 6:19am

I’m not a fan of this.

It’s confusing and implicit. Like does < thing1 thing2 ...> make a list, a set, an iterable of some sort?

Explicit is better than implicit. I don’t see an issue with the current syntax in this case.

If you insist on this type of thing, do:

__all__ = '''cmp_op stack_effect hascompare opname opmap
           HAVE_ARGUMENT EXTENDED_ARG hasarg hasconst hasname
           hasjump hasjrel hasjabs hasfree haslocal hasexc'''.split()

Edit:

If you don’t like quotes in the dict case, do:

_cache_format=dict(
    LOAD_GLOBAL=dict(
        counter=1,
        index=1,
        module_keys_version=1,
        builtin_keys_version=1,
    ),
    BINARY_OP=dict(
        counter=1,
    ),
    ...
)

This is very explicit and easy to read as well.

blhsing · December 25, 2023, 6:28am

It’s one of the reasons why YAML has become a preferred configuration file format for many projects. That most strings can be left unquoted as long as they don’t include control characters so they are easy to write and clean to look at.

You don’t see such a workaround in any reputable big projects because it makes the code slower to load and incompatible with linters, which understand foo as a name reference in __all__ = ['foo', 'bar'] but doesn’t if it’s written as __all__ = 'foo bar'.split(). The proposed new syntax would make it both efficient to load and compatible with linters.

The proposed syntax would be useful for lists that are either tightly related to the code itself (such as __all__ and list of commands) and/or not quite long enough to justify living in a separate text file.

blhsing · December 25, 2023, 6:43am

Why do you introduce confusion yourself by mentioning a set or an iterable when I only propose this syntax for a list and a dict?

Do you confuse [1] as a set? No, because the doc says [...] is a list literal. We all learned that. The same applies to <...>. We can learn that. Also note that we also all learned to distinguish a dict literal from a set literal by looking for a colon in the {...} syntax.

Charles Machalow:

If you insist on this type of thing, do:

__all__ = '''cmp_op stack_effect hascompare opname opmap
           HAVE_ARGUMENT EXTENDED_ARG hasarg hasconst hasname
           hasjump hasjrel hasjabs hasfree haslocal hasexc'''.split()

See my previous post for the reasoning of loading time and linter compatibility.

Charles Machalow:

If you don’t like quotes in the dict case, do:
_cache_format=dict(
    LOAD_GLOBAL=dict(
        counter=1,
        index=1,
        module_keys_version=1,
        builtin_keys_version=1,
    ),
    BINARY_OP=dict(
        counter=1,
    ),
    ...
)
This is very explicit and easy to read as well.

There’s a reason why in most established codes {key: value} is preferred over dict(key=value). That the former is cleaner to look at. This proposal would make the former look even cleaner when the keys are strings. It also allows the possibility of optionally quoting the strings as necessary, whereas keyword arguments to the dict constructor simply can’t because they have to conform to identifier rules.

blhsing · December 25, 2023, 7:50am

Updated my original post to optionally adopt a more YAML-like syntax for a dict literal by making indentation imply dict nesting:

_cache_format = <
    LOAD_GLOBAL:
        counter: 1
        index: 1
        module_keys_version: 1
        builtin_keys_version: 1
    BINARY_OP:
        counter: 1
    ...
>

sirosen · December 25, 2023, 8:00am

I very strongly disagree. YAML is popular for a variety of reasons, but I wouldn’t look at it as a model of good design.

Specifically, regarding the “benefit” of allowing unquoted strings, it leads to all kinds of bugs.

The following YAML loads differently depending on the parser used:

yes: 22:22

That may be equivalent to {true: 1342}, if you use a particularly cursed parser.

Another classic I see a lot of is

python-version: 3.10

(granted, that one’s also possible in JSON, but the YAML habit of leaving strings unquoted makes it harder to spot)

I consider it a good sign when I see engineers produce YAML files in which all or most strings are quoted. That way, “no” won’t turn into “false” by surprise.

blhsing · December 25, 2023, 8:07am

Stephen Rosen:

The following YAML loads differently depending on the parser used:
yes: 22:22
That may be equivalent to {true: 1342}, if you use a particularly cursed parser.

Another classic I see a lot of is
python-version: 3.10
(granted, that one’s also possible in JSON, but the YAML habit of leaving strings unquoted makes it harder to spot)

Did I ever propose to adopt YAML in its entirety? No. So why are you pointing out one of the rarely used features of YAML that I didn’t include in my proposal as a counter-point?

The bareword string feature of YAML is one of the reasons why it became so popular. Enforced indentation is another, like Python, which I now added as a possible inclusion to the new syntax.

rrolls · December 25, 2023, 8:10am

Let’s please not turn Python into YAML.

YAML has become a preferred configuration file format for many projects.

Not everyone likes YAML. I find it unnecessarily complex and confusing, to the point where the moment I find out that something requires me to write YAML, I look for an alternative product. If I have to use YAML, I end up cargo-culting it.

Regardless of YAML, using <> for this purpose is incredibly unintuitive. Why shouldn’t we use <> for inline XML like JSX/TSX do? JS/TS has Electron, people have written transpilers for Python → JS, people have written webapps in Python, I could easily see someone advocating making Electron-style apps in Python and they’d have a far better argument to use <> for something convenient to them than any argument to use it for weirdly-represented dicts.

(I’m not even a fan of the idea of Electron apps, I prefer toolkits that use native solutions like wxWidgets, but there’s still a strong argument to reserve this kind of usage of <> for if inline XML is ever needed for any reason.)

One of the annoyances when I jumped from Perl to Python many years ago was the lack of a qw-equivalent operator/literal

If you prefer Perl, use Perl!

Personally I actually use “qw-equivalent literals” in Python pretty frequently, just fine, with 'foo bar baz'.split(), similar to what @csm10495 suggested. If you want it to be even more like Perl, you could even write def qw(s): return s.split() and then you can type qw('foo bar baz') so that the only difference is the quotes.

The Zen of Python:

There should be one-- and preferably only one --obvious way to do it.

blhsing · December 25, 2023, 8:22am

Once again I’m not adopting YAML in its entirety, but rather draw inspiration from one and only one of its best features, bareword strings (or two if you count indentation implying nesting), neither features of which are what I believe to be what you are referring to as complex or confusing in YAML.

< and > are merely my suggestion and I welcome an alternative, more readily understood syntax that implements the same idea, such as a prefix of \ to [ and {, much like a prefix of r to " to denote a raw string, but personally I think people can get used to angle brackets more quickly.

We can close about half of the idea posts here if all it takes is for someone to reply “If you like X feature from Y language, use Y language!”. Obviously I prefer Python for many of its strong points over other languages, but that doesn’t mean it cannot draw inspiration from some other languages for improvements.

What’s wrong with a native support from Python for better speed and linter compatibility? Again there’s a good reason why no big projects use this workaround.

sirosen · December 25, 2023, 8:33am

These issues are relevant because the allowance for unquoted strings is key to the deceptive “simplicity” of YAML and it’s many snares and footguns.

I said earlier that you need to make a complete case for how this simplifies or improves maintenance, to which your reply was, to crudely paraphrase, “this is why YAML is popular”.

I’m not sure if that even is an argument for this feature. It seems like a pretty bold assertion about why YAML is widely used. (Personally, I attribute it to the lack of standardization around JSONC.) Is there evidence to back up that claim?

Even that aside, my broader point is that YAML doing something is not a great foundation for an argument. Given that YAML has major problems, I just don’t think this is likely to convince people, myself included, that this would be an improvement.

Going back to the original question I asked, you say this will improve maintenance. It sounds like you’re primarily concerned with __all__. Why is this a better way to write the __all__ tuple?

Is that the only case? What percentage of time in application and library maintenance is spent on those tuples? 0.1%? Less? It’s definitely not a lot of time and it’s not difficult maintenance today.

Improving project maintenance isn’t the only reason for a feature. Is this a QoL improvement for everyday usage? If so, how often does it come up? In what contexts?

Bare word arrays might have enough value to be worth adding to Python, but the case in favor needs to be made clearly so that it can be weighed against the cost.

blhsing · December 25, 2023, 9:08am

No, they are not relevant because I did not include them in my proposal. Once again, my proposal is not “hey let’s embed YAML in Python”, but rather to adopt only one or two of the most commonly used features from it while avoiding all the other features that make it overly flexible and sometimes a complexity/ambiguity nightware as a result.

Why do you think there’s only this one usage just because I used __all__ as an example (even though it is indeed a good example because many modules do have a long list of it)? Of course I’m not going to enumerate all possible usage of a list of words or a dict of simple key-to-value mappings at the risk of boring everyone to sleep. But as another example, you can go a little bit further than opcode.py to find the _opcode_metadata.py module that it imports to find a good number of long lists and dicts that can benefit from this cleaner syntax.

blhsing · December 25, 2023, 9:58am

Updated my original post with a possible generalization of indentation-implied nesting for lists, where we can optionally write this dict of lists from _opcode_metadata.py:

_specializations = {
    "RESUME": [
        "RESUME_CHECK",
    ],
    "TO_BOOL": [
        "TO_BOOL_ALWAYS_TRUE",
        "TO_BOOL_BOOL",
        "TO_BOOL_INT",
        "TO_BOOL_LIST",
        "TO_BOOL_NONE",
        "TO_BOOL_STR",
    ],
    ...
}

as:

_specializations = <
    RESUME:
        RESUME_CHECK
    TO_BOOL:
        TO_BOOL_ALWAYS_TRUE
        TO_BOOL_BOOL
        TO_BOOL_INT
        TO_BOOL_LIST
        TO_BOOL_NONE
        TO_BOOL_STR
>

sirosen · December 25, 2023, 10:01am

This definitely reads as putting words in my mouth.
I mentioned __all__ because it seemed to be your area of focus. However, I also asked a leading follow-up immediately afterwards.

I’m trying to encourage you to flesh out your argument and make it clear why this is better.

Not everyone is going to have the same opinion that you have. If you just say that something is “cleaner syntax” and call it a day, you won’t be able to convince someone who doesn’t already agree. We all have different experiences and different preferences.

Right now, that contribution – which was meant to be primarily positive here – does not seem to be having the desired effect. The whole thread feels relatively hostile, so I’ll bow out of this one and hopefully you can flesh out the argument better with someone else’s help.

ajoino · December 25, 2023, 10:56am

I get that you’re not trying to turn Python into yaml, but not enforcing quoted string is AFAICT the universally most disliked feature of yaml. I think your argument would be much stronger if you just stop mentioning yaml altogether (it’s a very divisive format) and argue how it would make Python better in its own terms.

Personally, I love explicit quotes and would not want it any other way, even if I sometimes forget them.

pf_moore · December 25, 2023, 1:13pm

We can, and we probably should. If that’s the only argument for the feature, it’s not a good enough argument. If you can’t justify a feature in terms of its benefit to Python users, and its appropriateness in Python, then it shouldn’t be added.

Inspiration, yes of course. And examples of prior art also. But again, that’s not a persuasive argument.

You need a good argument for your proposal over "a b c".split() - not just personal preference, not “performance” unless you can demonstrate that the import-time improvement matters in real-world cases. Establishing a case for a new feature in Python is pretty hard - and it should be, because it has a significant impact.

In fact, I quite liked Perl’s qw/…/ construct, but it’s very much in line with Perl’s philosophy and approach, and IMO that’s not the case for Python.

Quercus · December 25, 2023, 1:15pm

The desire for that ease and cleanliness is quite understandable, and when seen separately from surrounding Python code that does use quotes, the bareword style has a clean and casual look. But we should consider here the wisdom behind the oft quoted, though ironic, thirteenth line of the Zen:

There should be one-- and preferably only one --obvious way to do it.

Admittedly, there is already more than one obvious way to denote a string literal, with there being single and double quote delimiters, triple quoted multiline string literals, f-strings, and the like. However, what all those have in common is the visual cue of quotes as delimiters. A mix of sections of Python code with and without those quotes risks creating a visual dissonance of style that may detract from the comfort of readability. The contrast would require switching back and forth of cognition regarding how to distinguish names from literals, as if we were reading two different languages. Then the gain in ease for the writer of the code becomes a loss of ease for those who need to read and maintain it.

Gouvernathor · December 26, 2023, 11:33pm

I disagree. First, the comparison is between {"key": value} and dict(key=value), you forgot the quotes around key in the dict literal, and that makes a big difference.
Second, readability is not the reason why dict literals are sometimes preferred over keyword arguments to the dict constructor. As far as I know, there are three reasons :

support for non-identifier keys, such as 5, (7, "bramb") or "here it is", or evaluating a variable as a key
execution time performance, since it saves a function call (albeit implemented in C)
risk of the dict name being shadowed by variable.

Now, personally, I find the second syntax so much more readable and quicker to write that whenever I can afford for all my keys to be identifiers, I always opt to using it, and to hell with the minor performance drawback.

In the case of the list proposal, I just don’t really see the point, but I don’t really mind.
In the case of the dict proposal however, I think mixing the dict-literal and dict-kw-call syntaxes are massive footguns which will inevitably lead to people defining the "a" key instead of defining a key from the value of the a variable, and vice versa. Examples of other languages “adding quotes automatically” (let’s phrase it like that), those that I had the displeasure of working with at least (such as JS), only confort me in that opinion.
That, and the fact that you actually made that very mistake in the example I’m quoting (which happens to anyone, I did it myself a few days ago, we’re humans after all).

PythonCHB · December 28, 2023, 8:25am

"a b c".split() … not “performance” unless you can demonstrate that the import-time improvement matters in real-world cases.

I’m sure that could be optimized away in the compiler if it saw wide use anyway.

holdenweb · December 28, 2023, 11:49pm

Explicit is better than implicit. If the example from opcode.py isn’t good enough, there are options such as

__all__ = ("cmp_op stack_effect hascompare opname opmap "
       "HAVE_ARGUMENT EXTENDED_ARG hasarg hasconst hasname "
       "hasjump hasjrel hasjabs hasfree haslocal hasexc").split()

that work perfectly well. While I can appreciate the convenience a Perl programmer might find in the transition to Python, I don’t feel it would be healthy to pander to that need at the cost of unnecessary syntactic complexity.