Introduce a "bareword" list/dict literal

kenahoo · December 29, 2023, 11:23pm

Nobody, certainly not OP, has claimed that features of one language should be added to another just because they exist. OP is right to point out that this feature in Perl is extremely commonly used, and the use case in Perl is essentially identical to its proposed use case here, indicating very strong evidence that it would be widely used and appreciated in Python too.

As for me, I do miss both the qw() and => auto-quoting Perl behaviors every day, and I haven’t programmed seriously in Perl for at least 15 years. I do wish Python could find a way to eliminate all the extra quotes all over the place, the way it was so keen on eliminating curly braces and semicolons. So count me as someone supportive of the basic idea.

I’m not OP, but for me the big differences between them are

The proposed addition would be a literal, whereas "a b c".split() is a runtime function call.
IDEs will know about (and support) literals, including understanding that the terms can be line-wrapped by linters, etc. This would never be true of a runtime function call, so the maintenance situation is indeed worse for the split alternative.
It is not clear from the docs of split whether it splits on all whitespace characters (vertical tabs? Wide space characters? Non-breaking spaces? There are 25 Unicode whitespace characters) so it is not actually clear whether the split alternative is actually equivalent - presumably the proposed literal would only allow spaces and returns (personally I wouldn’t even allow tabs).
For long lists of strings, which is probably use case where this proposal makes the most difference, you have to read all the way past the end of the string before you see the split and realize that this is actually a sequence of strings rather than a single string. The proposal makes that clear immediately.

kenahoo · December 29, 2023, 11:38pm

For my taste, that saying is way overused and far from a solid argument for or against anything. There are many many examples in Python where an “implicit” construction is much more readable and widely accepted than its “explicit” counterpart.

For one example: for x in array vs. for i in range(len(array)): x = array[i].

For another example, context managers are highly implicit about what they’re doing, but so vastly superior in both readability and functionality to an explicit exception-based construction (and I would argue the readability benefit is much larger than the functionality benefit) that the benefit is unassailable.

Rosuav · December 30, 2023, 12:09am

Thanks to content folding, this could functionally be equivalent to a literal. Which parts of this function are actual literals?

def f():
    x = (1, 2, 3)
    y = 3 + 4j
    z = "-" * 50
    return x in {(1,2,3), (4,5,6)}

Technically, only the integers and the 4j are actually literals. But thanks to constant folding, we have the following constants:

(1, 2, 3)
(3 + 4j)
A string containing fifty hyphens
frozenset({(1, 2, 3), (4, 5, 6)})

(That last one is only constant-folded when it’s used immediately in a condition, since otherwise the mutability of the set might be relevant.)

In the same way, it would be entirely possible for a future version of Python to constant-fold methods off literals, such as:

"spam".upper()
(1234).bit_length()

Things ARE a bit trickier with string splitting, though, since "a b c".split() returns a list, not a tuple. If it returned a tuple, it would be easy to constant-fold it. But the benefit of replacing "a b c".split() with ["a", "b", "c"] is more dubious.

Try it!

Why exclude tabs? I would assume that it would allow everything that is counted as whitespace between tokens. That is to say: Grammatically, this would take a series of NAME tokens and return their string forms.

Yes, this WOULD be an advantage. I’m not convinced that it’s big enough to make a difference, but it certainly is an advantage.

MegaIng · December 30, 2023, 12:34am

This could be added! Develop a plugin, ask maintainers to include it in their tools. Just like IDEs support detecting the first argument to re.compile as a regex pattern or linters can sometimes detect and warn about improper usage of SQL statements.

kenahoo · December 30, 2023, 12:47am

That’s true, but the fact that these don’t happen in any version of Python so far (AFAIK?) is probably good evidence that they won’t happen anytime soon and/or aren’t seen as a priority.

It doesn’t really matter much what it does empirically, though, what matters is the documented behavior.

That’s probably a good precedent, good point. It looks like 2. Lexical analysis — Python 3.7.17 documentation defines the relevant whitespace as spaces, tabs, and form feeds (!). Since that’s talking about just within a single line, newline characters (and also carriage returns, I guess) would be added to that.

Rosuav · December 30, 2023, 12:54am

I dunno about that. Constant folding HAS been expanded in the past. If there’s value in it, it can be done. As long as the behaviour doesn’t change, optimizations can be added and removed based on what’s considered worth the internal complexity.

        When set to None (the default value), will split on any whitespace
        character (including \n \r \t \f and spaces) and will discard
        empty strings from the result.

Seems pretty clear to me… “any whitespace character” means “any whitespace character”. The rest is just some common ones.

Indeed. So anything that counts as the end of a physical line would also count.

kenahoo · December 30, 2023, 5:00am

Au contraire, though! This is why I brought it up - I suspected it actually doesn’t mean “any whitespace character” (which to me would mean anything Unicode defines as whitespace), so I dug around in the source, starting at split.h line 54. It’s using

the STRINGLIB_ISSPACE macro, which is defined in stringdefs.h as
Py_ISSPACE, which is defined in pyctype.h to use
PY_CTF_SPACE, which is defined in pyctype.c explicitly as only the characters '\t', '\n', '\v', '\f', '\r', and ' '.

Let me know if I misinterpreted one of those definitions, I’m not very experienced in the Python source, maybe there’s some conditional/locale stuff happening that I didn’t find?

In any case, I’d say this means the docs don’t effectively communicate what character set it splits on.

The above is a little bit tangential to the overall discussion about a bareword list/dict literal, but I’d actually argue that it matters - a bareword list/dict literal can only really be used in source code, not in data read in from other sources. So it makes sense to restrict its whitespace to whitespace characters that can effectively be used in source code.

On the other hand, split() is intended to work with any string, no matter its origin. To be honest, I’d be a little miffed if I read in some text data and split it using str.split() and then discovered that wide spaces (e.g. from a web page output or PDF text) weren’t considered “whitespace” by it. And then I’d probably dig around and find re.split(r'\s'), which is probably closer to what I want (because unlike str.split(), the \s character class is very explicit about what it matches) - but that’s a whole different world from dealing with readable literals in source code, and I feel like this helps illustrate that a little bit.

pf_moore · December 30, 2023, 5:25am

So file a docs bug? (And PR, if you are comfortable doing so).

ronaldoussoren · December 30, 2023, 9:04am

For strings, you need to look at Objects/unicodeobject.c (in particular the split function in that file), which uses usclib (with various sizes of code point storage), which define STRINGLIB_ISSPACE to Py_UNICODE_ISSPACE. That macro in the end calls _PyUnicode_IsWhitespace in Objects/unicodetype_db.h which contains a much larger set of whitespace characters and is documented as:

/* Returns 1 for Unicode characters having the bidirectional
 * type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise.
 */

Rosuav · December 30, 2023, 11:15am

Did you consider actually trying it in the REPL? It seems to split on Unicode whitespace just fine. So I don’t know what tracing you were doing or where something went wrong, but it definitely does split on every type of Unicode whitespace I tried.

boxed · December 31, 2023, 9:05am

I much prefer “atoms” like in Clojure. They are explicit.

kenahoo · January 4, 2024, 3:03am

Argh - quite right! Sorry for the noise, but at least I understand it better now.

KubaSO · December 4, 2024, 10:57pm

If there’s a few dozen lines of these “extraneous quoted words” then IMHO language changes don’t need to address it. It’s clear and well understood what it means.

IMHO if someone really wants to convert “plain text” to Python, then just write a tiny code generating script for it. Personally, I generate constant Python “data” by reading in TOML and dumping Python out. Easy and clean.

Adding special literal data structure grammar to Python just to save on some punctuation isn’t a very appealing idea, personally. It’s like creating a second language within a language, and that’s AFAIK not considered too great of an idea outside of some isolated cases. Say regex strings are a necessary evil. But say in C++ the template metaprogramming was the insidious “second language” and it was and is considered a hardly necessary evil. Over time, a lot of template metaprograms became expressible in the “core” language - precisely because having a second, completely different language-in-language that is hard to understand and maintain was not liked much at all.

If there were savings in expressiveness and verbosity on the scale of “C++ template metaprograms to imperative C++2x”, where source size, compile time and understandability benefited a lot, then that would be worth investigating. Here we’re talking on saving literally punctuation…

Wombat · December 5, 2024, 6:26am

Often split will suffice with no need for dedent:

__all__ = """cmp_op stack_effect hascompare opname opmap HAVE_ARGUMENT
             EXTENDED_ARG hasarg hasconst hasname hasjump hasjrel
             hasjabs hasfree haslocal hasexc""".split()