PEP 750: Tag Strings For Writing Domain-Specific Languages

yoavdw · August 10, 2024, 8:01pm

I know this isn’t the purpose of the PEP, but this regex tag that avoids having to manually escape things is really appealing to me. I’m curious to hear about which (if any) tag strings might be added to the stdlib if this PEP gets accepted. If there are a lot more of those, that should also be point in favor of this PEP.

Ben.Skerritt · August 10, 2024, 9:31pm

I share a lot of the sentiment mentioned before, specifically the syntax.
I love the gates this opens up, and can already imagine the use cases.
However, like mentioned previously, trying to google even what this is is non-trivial, and can imagine new pythonists trying attempts like “sql string”, “xml prefixed string” type-phrases.
I understand the sentiment to keeping the f-string type syntax for teachability, but I really do think backticks would be superior, would make learning it easier through being able to read and find other people’s code, and overall (in my mind anyway), makes a lot more sense than prefixing it to a string
(I think we’ve talked enough about dot prefixes, I won’t keep that fire lit)

kdheepak · August 11, 2024, 12:12am

If possible, backticks seems like it would be good syntax to co-opt for this feature. Visual examples:

regex`hello\s+world` # regex
html`<div>hello world</div>` # html DSL
shell`ls -al` # shlex.split("ls -al")

Backticks would be visually distinct enough from regular strings and there’s precedent from other languages too. Being able to use treesitter to inject highlighting for different languages (SQL, shell, HTML, JS etc) would be a popular feature.

fwiw, Julia has string macros that supports a prefix and suffix. Perhaps this feature could be enabled too (in the future?) e.g. backtick based syntax example based on Julia’s documentation:

regex`a+.*b+.*?d$`ism # regex 
# i enables case-insensitive matching
# m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the whole string.
# s allows the . modifier to match newlines.

methane · August 11, 2024, 1:00am

If introduce backtick, please support multiline string with dedent.

Main problem with d-string is it making string prefix and lexer too complex.
So new string quote is very important chance to introduce new multiline string behavior.

    body = ```
        <body>
        <div>hello</div>
        </body>
    ```

First newline right after ``` is dropped.
Minimum common indent is stripped. In this example, last line (closing ```) indicates how many indent is removed.

Standre · August 11, 2024, 2:33am

The greet function in the first example doesn’t work, as it unpacks the Interpolation like _, getvalue = recipient instead of getvalue, *_ = recipient

On the topic of backticks, I would like to point out that they are hard to type in many non-English keyboards, which contributed to them being removed in Python 3

jfs · August 11, 2024, 7:40am

A lot has changed since 2007: StackOverflow, GitHub appeared which made markdown with backticks even more common among developers.

Daverball · August 11, 2024, 7:47am

Taking some cues from code blocks in Markdown I think there’s another option that both addresses the tokenizer concerns without the need for backticks and would still allow editors to easily detect common DSLs and provide inline syntax highlighting for them and also doesn’t close the door on more string prefixes being added to the language in the future.

I.e. we just treat the first word within the string literal as a name^[1] and the actual string literal starts after the first whitespace. Syntax highlighting could help with making that very obvious.

i'html <b>hello world</b>'

Or, another possibility would be to keep it outside, but make the i still a required part of the prefix, so something like this:

i:html'<b>hello world</b>'

This may be slightly more difficult to parse/tokenize again. But should overall still be slightly easier than the suggested syntax.

or dotted name, to satisfy that request ↩︎

qexat · August 11, 2024, 10:17am

Firstly, I must say that a lot of work has been put in this PEP, and I think this is really nice. However, I must admit that while I can see the need for a solution to the template string problem that this PEP attempts to fix, I am not really fond of its execution.

Quoting the Motivation section of the PEP 638 (Syntactic Macros):

New language features can be controversial, disruptive and sometimes divisive. Python is now sufficiently powerful and complex, that many proposed additions are a net loss for the language due to the additional complexity.

I believe that the PEP 638 proposes something which also responds to the same issue that this PEP wants to address. I don’t exactly support the whole form of the PEP 638, but I do think that its execution could bring more benefits (by being more general) both to the intended users, but also to the other users of the language which otherwise would not see advantages in this PEP.

I hope my response doesn’t come as dismissive ; it is not my intent. The idea of sort of generalizing f-strings and making them not hardcoded anymore is really compelling to me.
I also think that this PEP, in its current form, introduces syntactic noise that would reduce Python’s quality of being easily readable and parseable for humans. One might argue that it is also the case for the other PEP I’ve mentioned, but I think this is a discussion for another thread

So, perhaps I’m not among the targeted group which this proposal is designed for, because I don’t think I would ever use this feature.

All that said, if this PEP gets accepted, congratulations!
Thank you for the hard work you folks have put in this proposal.

Summertime · August 11, 2024, 1:07pm

I’d expect the following to work how I’d want it to:

x = f'stuff{x}stuff'

I would expect the following to be less useful:

x = lambda: f'stuff{x()}stuff'

but tagged strings can act like either of those. Associating it with the f-string form I think misrepresents how it can be / should potentially be used.

In javascript, this is handled as such:

// becomes something that can be evaluated later
let a = tag`abc`
// since a is lazy going into this
// the tag can handle it as something lazy
// without scope issues.
a = tag`xyz${a}`

In that, a good portion of lazy evaluation is, I think, going to be used with outputs of other tagged strings. Especially in HTML/XML/etc use-cases. So the benefits of lazy evaluation can still mostly be had without tagged templates needing to have lazy inputs, only lazy outputs.

Summertime · August 11, 2024, 1:18pm

Maybe checking __tagged_string__ with a fallback to __call__ might be nice. Would enable html'' and html() to work differently, might also be too magical.

pauleveritt · August 11, 2024, 2:54pm

As a note, JavaScript uses the same symbol (backtick) for template literals (f-strings) as well as tagged template literals (tag strings.) I’m not arguing against adoption of a different symbol, just clarifying JavaScript’s usage.

godlygeek · August 11, 2024, 7:14pm

I think this point deserves a lot more attention. The “Valid Tag Names” section acknowledges that any existing string prefix must not be a valid tag name, but doesn’t mention the corollary that, once we have tag strings, adding any new string prefix would be a backwards-incompatible change since it might change the behavior of existing code using that prefix as a user-defined tag. This seems to be making a bet that we’ll never need another string prefix again, but that doesn’t seem like a wise bet to me.

At the very least, I’d suggest that the PEP should require that tags be at least 2 (or perhaps even 3) characters, enforcing that with a SyntaxError, to allow the implementation to support more built-in 1 character prefixes in the future without breaking backwards compatibility.

charliermarsh · August 11, 2024, 7:16pm

I’m a fan of the feature in general but find myself agreeing with @erictraut:

My understanding is that JavaScript’s tagged templates (mentioned in the PEP) perform eager evaluation. Personally, I’d like to see stronger motivation and examples for lazy evaluation given the drawbacks mentioned above. I think it will be a source of confusion for users especially because f-strings themselves don’t use these semantics.

I would also prefer this. I find it more intuitive and strictly more flexible, since you can always wrap in a lambda (but not the other way around). If you actually want values to be evaluated lazily, it seems right to me that you should be opting into that.

I would argue that explicitly requiring users to write lambda: would be both more explicit about the behavior and more intuitive to users given that f-strings are eagerly evaluated, thus reducing the burden to carefully read the documentation.

sayandipdutta · August 11, 2024, 8:11pm

This is a fantastic feature, and a great addition to the language. This is an extremely powerful feature (too powerful*, more on that later), that would make many complex solutions, elegant.

As a user, I would like to express my gratitude and some of my opinions, concerns, and confusion regarding the proposal.

At the call site, there’s almost no visual feedback. A dodgy search-and-replace could turn a regular function into a tag function, and one would get no syntax errors. A forgotten comma, wrong backspace, wrong formatting, would change the meaning of the code. Without a type checker, this could become very hard to debug. Which brings me to my next point.
Since there is virtually no difference between a regular function, and a tag function, any function that accepts a string, and other optional parameters, and was not written with tag functions in mind, could be used as tag function. Is this intentional? Is this a goal of this PEP?
If not, I think there should be a distinct way to define tag functions. Least destructive would be a decorator like functools.Tag|typing.Tag. The decorator can add a __tag_string__ or __tag_call__ (as mentioned earlier) attribute/method that enables them to be compatible with the TagString syntax. This is one of those “explicit is better than implicit” cases. Other options could be def tag foo() or def foo() as tag, but the decorator approach seems sufficient.
Regarding visual feedback, I find any of the following to be more clear than the proposed syntax.

tag`hello`
!tag"hello"
tag!"hello"
i:tag"hello"

Some of the above can make parsing dotted names easier as well.

*I fear that the feature is overpowered. Along with metaclasses, descriptors, decorators, this opens the way for more magic, more implicit behaviors. Although this is more intended towards library authors, I still feel a bit uncomfortable with this solution because the power is so easily available, at least PEP 638 had some special syntax.

The following, over-the-top example may give some weight to my words. This is a working multi-line lambda code.

symbols = flat_map(
        ocr_results,
        _lambda"""(paragraph: dict[str, list]) -> Iterable[str]:
        lines = paragraph["lines"]
        return flat_map(
                lines,
                {_lambda"""(line: dict[str, list]) -> Iterable[str]:
                    words = line["words"]
                    return flat_map(
                            words,
                            {_lambda"""(word: dict[str, list]) -> Iterable[str]:
                                return "".join(word["symbols"])"""},
                            merge_str=True,
                        )
                    """}
            )
        """
    )

pastebin link to the full code, tried it on the jupyterlite link posted above.

I am not expecting the PEP to go out of its way to discourage things, but a few light recommended guidelines won’t hurt.

Finally, thank you for all the work, to make Python even more awesome. Really like the feature, I would just like some more explicit syntax, goal summary, and recommendation on what is NOT recommended. Thanks again.

oscarbenjamin · August 11, 2024, 8:34pm

I’m not sure I understand the motivation for lazy evaluation but broadly I think I agree that it is better to be able to opt in to this rather than it being automatic. However I do not agree with the suggested fix:

What does “callable” mean here and why should it change the behaviour implicitly?

I presume that this means checking for __call__ and then behaving differently if it is present or not but I think that is generally a bad idea.

In the SymPy codebase I count 76 __call__ methods:

$ git grep 'def __call__' | wc -l
76

These are on all sorts of different objects. A simple example is a polynomial:

In [1]: p = Poly(x**2 + 1)

In [2]: p
Out[2]: Poly(x**2 + 1, x, domain='ZZ')

In [3]: p(5)
Out[3]: 26

It is very natural to define __call__ for a polynomial because pollynomials are typically identified with functions. I don’t want this to alter the way that a Poly object is handled in tag strings though.

How could you opt out of having a Poly treated as callable so that it could be treated as its literal self?

If there is a to be a distinction between which arguments are evaluated lazily vs eagerly then that really needs to be made by something that is statically present in the tag string like:

result = tag"stuff {p:lazy}"

Inspecting the object p to decide on eager vs lazy evaluation will be very error-prone and also difficult for human readers to understand when reading tag strings.

charliermarsh · August 11, 2024, 8:56pm

Just to clarify my own opinion, I think it’s the responsibility of the tagged template implementation (i.e., the function the user implements) to support this or not and document it as such – like, the tagged template function would say, “Anything that’s wrapped in a lambda will be evaluated.” I don’t think there should be anything implicit in the language feature itself. It’s fair to expect users to learn the API of the specific tagged template they’re using (e.g., what happens when you pass in a callable?).

Nietanod · August 11, 2024, 9:03pm

https://discuss.python.org/t/adding-a-simple-way-to-create-string-prefixes/60577

Here is my proposition for an implementation.
It use a class decorator to register the string tag

pf_moore · August 11, 2024, 9:26pm

I’m not at all sure I like this approach. I’ll be honest, I haven’t really gone through the proposal trying to come up with real world use cases, and as a result I don’t yet have a feel for how much value lazy evaluation adds, but I am convinced that no-one will ever use lambda as “explicit lazy evaluation”.

Over the years there have been many requests for some form of lazy evaluation in Python, in various contexts. Every time it has come up, the argument “you can do this right now, with an explicit lambda” has been put forward, but it’s never been seen as an acceptable workaround. People simply don’t think of lambda: some_calc() * 12 as a “lazily evaluated version of some_calc() * 12”.

So while I have some sympathy with the arguments that lazy evaluation might be confusing, I don’t think we should fool ourselves that wrapping values in lambda is “making deferred execution explicit”. The reality is that we’ll simply be “not supporting deferred execution”, and that’s how people will see the new feature. I don’t (yet) have an opinion on whether we should support lazy evaluation, but I definitely do think that we should avoid claiming that lambda is “opt in lazy evaluation” - either support lazy evaluation, or don’t, and leave it at that.

godlygeek · August 11, 2024, 9:30pm

Thinking about this further, I think there are a lot of advantages that would come from requiring that the tag callables use nominal subtyping rather than structural subtyping. Imagine that we added a new class to the stdlib, string.Tag, and anyone who wanted to define a tag string prefix would need to

class SQLTag(string.Tag):
    ...

sql = SQLTag()
x = sql"..."

This seems to solve many of my biggest concerns about this PEP.

It allows rejecting print"foo" with a runtime error rather than having it succeed due to the accidental fact that print is structurally compatible with the tag callables protocol
It allows rejecting abs"foo" with a clear error explaining that abs isn’t a tag function, rather than an unclear error explaining that Decoded is a bad operand type for abs
It allows an extension point for future features that we may want to add to tag strings. For example, PEP 750 says "mytag'{expr=}' is parsed to being the same as mytag'expr={expr} ", but what if in the future we discover a need to create a tag where that substitution is inappropriate? For instance, if r or b strings didn’t already exist, it would be impossible to implement them using tag strings because of this transformation. With a common parent class, though, we could simply add a new dunder method like __tag_uses_debug_expansion__ and have the implementation inherited from the base class return true, and a derived class that wants to suppress that expansion could override it to return false, and the interpreter could check the flags on the tag callable when deciding whether to apply that transformation on the arguments it’s constructing for the call. Similarly, we could have __tag_defer_evaluation__ that defaults one way but allows a particular tag implementation to choose the other, and have the interpreter respect that when calling the tag callable.

Nietanod · August 11, 2024, 9:36pm

Is your idea like mine ? (Here)
I think you can either use a superclass or a class decorator to register the tag