PEP 750: Tag Strings For Writing Domain-Specific Languages

flyinghyrax · August 21, 2024, 5:47pm

I’m not heavily invested in this design space, but I’ll say you’re not alone - to me this approach ^[1] is much less “magical”, in a good way. It seems easier to use and explain.

using a t string prefix to create an object ↩︎

pauleveritt · August 22, 2024, 2:40pm

As an update, Jim is looking at how a t-string approach could fit into a PEP update, how it would affect other points, what sharp edges if fixes or causes, etc.

thejcannon · August 22, 2024, 5:02pm

(if at all possible, but totally not required. Hopefully we don’t throw this particular baby out with the bathwater if there’s a switch to t strings)

nhumrich · August 22, 2024, 6:44pm

Instead of updating PEP 750, wouldn’t it make more sense to update PEP 501, as it’s already the same proposal for a t-string approach?

rrolls · August 22, 2024, 7:42pm

Indeed, highlighting the contents of tag strings per the appropriate language in editors - and getting formatting/linting/whatnot from tools - would be a phenomenal benefit.

Getting an editor to support “I see sql"""...""" in your Python code, I’ll render it as SQL” would presumably be a matter of convention, since sql would not be a reserved word of any kind in Python itself. It follows that if editors are going to be enhanced to render the contents of sql"""...""" as SQL, they could just as easily render the contents of sql(t"""...""") as SQL! So this isn’t an argument for one syntax or the other; it works just fine either way. (This of course applies just as much to any tag and associated language, not just SQL…)

Of course, editors could hard code these lookups, but (though this is outside the scope of this PEP) better still would be a standard way (via decorator, annotation, docstring, or any other reasonable method) for function definers to indicate the language that should be used inside the t-string, which could then be used by editors’ semantic analysis.

In fact, I’ve just now realised an excellent use case for tag strings: regular expressions. VS Code’s Python highlighting highlights r'...' as a regular expression simply because that’s what r'...' is frequently used for… which unnecessarily discourages using r'...' for any other purpose since it’s then mis-highlighting. Perhaps one day we can have re.compile(t'...'), get full regular expression syntax checking in editors (by way of the editor knowing it’s required to be a valid regular expression due to being a t-string passed to re.compile), and allow r'...' strings to go back to being ‘neutral’, not having any special highlighting. The same even applies to regular un-prefixed strings in a more limited fashion: anything inside {} braces gets highlighted, regardless of what the string is actually being used for.

steve.dower · August 22, 2024, 8:29pm

For what it’s worth, Visual Studio used to have this (before its syntax highlighting was replaced with the same implementation as VS Code). It’s challenging enough to make work reliably, mostly due to arbitrary aliasing/renaming, that it leaves me somewhat torn between tag'...' vs tag(t'...').^[1]

I think there’s definite value in allowing arbitrary tags. Over time, we’re sure to see “standard” tags emerge, and those will get picked up by editors and other source analysis tools.

Where I hesitate is in trying to support “all callables” or “all variables in the current namespace” as tags. That is easy enough in the simple cases (for both regular development and source analysis), but the edge cases (i.e. unexpected bugs) are so numerous that it puts me off. I certainly wouldn’t want to be making my code analyzer try to figure out what a foo"..." string is in a function where foo is an argument!^[2] A global registry (like copyreg) isn’t often used in Python either, but maybe it’s an option…?

These concerns apply equally to tag(t"...") as they do to tag"...", though in the former case we really don’t have any ability to change the lookup. So source analysis tools either have to handle from module import other_tag as tag or just make a big assumption and let their users suffer from it being wrong sometimes.

So I’m not sure what the right answer is here. Saying “we only add t"..." and everyone else uses a function” is easy, but it misses a big benefit that potentially changes the language in a beneficial way. Though it does dodge having to trace another set of name usage through code in order to prove that your code does what you expect.

And for context, I was the one who implemented it and got it working. (Though should clarify it didn’t get all the way to checking the syntax. Just to identify “oh, you’re calling a function that wants a regex, so this literal must be a regex, I’ll treat it like one”.) ↩︎
Again. Once is enough ↩︎

Daverball · August 22, 2024, 9:40pm

I think regardless of what we land on, the expectation of automatic pick-up of certain tag names by editors for inline syntax highlighting is going to be problematic. I don’t really want there to be fifteen different implementations of a html tag function, that all do slightly different things and you’re forced to look at the import or use your language server to know which one it is.

I’d consider specifying a solution that would encourage good editor support a separate problem. Maybe what we actually want for that is something that would work with any string prefix and is pure metadata, similar to annotations, i.e. it wouldn’t change how the string is processed, but the compiler will add some internal attribute to the resulting object, so it is introspectable by any function that consumes the object.

This would have the added benefit of the same tag function being able to dynamically dispatch to different parsers based on the supplied metadata, e.g. for disambiguating dialects:

generic_html_tag_function(t:html5"<br>")  # OK
generic_html_tag_function(t:xhtml"<br>")  # Error, not valid XHTML

# also works with other prefixes
u:html"<br>"
r:html"<br>"
b:html"<br>"
...

# maybe we want to support multiple `:` e.g. for language + dialect
t:sql:postgres""

thejcannon · August 23, 2024, 4:50am

(My OP also included an example of one way to solve the “where do these names come from and what do they do” case where all you care about is editor support or just plain “tag the thing”)

Monarch · August 23, 2024, 5:34am

I completely agree that the syntax should be foo(t"bar"). I also fail to see how editors can add syntax highlighting for html"bar" but not html(t"bar"). I think a proper system for syntax highlighting strings can live in typing instead. Maybe some type aliases for type(t"abc").

from typing import SQLString # just a typealias

def sqltag(s: SQLString):
    return s

from mytags import sqltag

q = sqltag(t"select something from something") # editors can now highlight this

rrolls · August 23, 2024, 5:42am

Steve Dower:

Where I hesitate is in trying to support “all callables” or “all variables in the current namespace” as tags. That is easy enough in the simple cases (for both regular development and source analysis), but the edge cases (i.e. unexpected bugs) are so numerous that it puts me off. I certainly wouldn’t want to be making my code analyzer try to figure out what a foo"..." string is in a function where foo is an argument! A global registry (like copyreg ) isn’t often used in Python either, but maybe it’s an option…?

These concerns apply equally to tag(t"...") as they do to tag"...", though in the former case we really don’t have any ability to change the lookup. So source analysis tools either have to handle from module import other_tag as tag or just make a big assumption and let their users suffer from it being wrong sometimes.

This is why I suggested having a standard way to declare the language expected by a function that takes a tag string argument. VS Code’s semantic analysis (which IIRC is powered by mypy) already does enough amazing work of this nature that it doesn’t seem much of a stretch to have it also entirely reliably detect these declarations and forward them on to the editor so it knows to change the highlighting method for that range of source text.

For example, suppose the module a_sql_impl has:

def parse(arg: WithLanguage[TagStringInfo, 'sql']) -> ...:
  ...

(where TagStringInfo is the type of object that’s instantiated by writing t'..., and WithLanguage is a special annotation used just for the purpose of declaring the language and has no effect on the type of arg inside the body of parse; both of these would be new builtins in Python, again, the names could be anything, though I suggest they get put in an appropriate module!)

Then, even if we have something as convoluted as

import a_sql_impl as m

def f():
  class X:
    v = m.parse
  return X

f().v(t'SELECT 1')

The existing code assist in VS Code (and surely other similar tools) already knows that the type of the first (only) argument required in the call to v is WithLanguage[TagStringInfo, 'sql']. The only enhancement that needs to be made to code assist is the logic that “since this is of the form WithLanguage[TagStringInfo, 'something'], something must be the name of some language to be used for highlighting, so the editor should highlight it as that language”. The list of languages that it gets looked up in would be of the same nature as the list used for rendering fenced code blocks in Markdown files, or maybe it could even be the exact same list - either way it would be up to the code editor or the tool (or the user’s configuration thereof), not Python, to decide it.

(Of course, please don’t write actual code like the above; much better to write def sql(arg: WithLanguage[TagStringInfo, 'sql']) -> ...: and from a_sql_impl import sql and sql(t'SELECT 1')… my point was just to highlight that existing tools can already follow these chains of aliases and that the tool doesn’t care about the actual name of the function.)

(I did previously state that this was outside the scope of this PEP, but I think I’ve changed my mind on this: one of the goals of this PEP is to enable tooling support, and merely including the WithLanguage annotation in it would achieve that, so it seems perfectly in scope, now.)

rrolls · August 23, 2024, 5:54am

Many languages that can be parsed by existing Python packages already have multiple parser implementations, surely each with their own quirks, so don’t we already (and won’t we always) have this problem?

This would be easily solved with the foo(t'...') syntax, as I pointed out previously that one can easily add an argument to the function call; imagine, working from your example:

html(t'<br>', dialect='html5')  # OK
html(t'<br>', dialect='xhtml')  # Error, not valid XHTML

Were code assist to be implemented as per my previous post about WithLanguage, some clever use of @typing.overload and typing.Literal could even allow the editor to highlight the dialects differently based on this dialect arg, so it doesn’t even become a limitation there.

(I’ll also note that the t:html5'...' syntax results in at least one ambiguity: imagine putting it inside {}. Is {t:html5'<br>'} a set or a dict?)

Daverball · August 23, 2024, 6:12am

You misunderstand me. The problem is not that there are multiple different parsers for the same language. The problem is encouraging them to be all named the same, i.e. the namespace pollution of functions with the same name but sometimes vastly different purpose, just so syntax highlighting works. You are conflating two things, that although related, are not the same.

I would like them to be completely separate, so there’s something that says what kind of data this literal contains and then just any function or class with a descriptive name can process it. That way we can also get syntax highlighting on strings that don’t contain any interpolation and we aren’t forced to call the function on the literal immediately, but instead can defer it and reuse the same literal object multiple times.

ilotoki0804 · August 23, 2024, 6:13am

Language specifications for strings

opened 10:15AM - 15 Mar 23 UTC

boxed

In PyCharm you can set the programming/markup language of strings like this: …```py # language=html foo = '<div>hello there!</div>' ``` I find this extremely useful and use it all over the place. A pattern I noticed is that a large proportion of such uses are in fact more like: ```py # language=html foo = format_html("<div>{}</div", bar) ``` In this case the format_html function doesn’t take any random string as the first argument, it takes a string that is supposed to be html. Reading through my usage of # language= I see that I have CSS, HTML, and JavaScript. There are many places in my codebases that also have very different types of “languages” that unfortunately isn’t supported as a language injection in PyCharm. Some examples: fully qualified name (module.module.symbol, for example view functions in Django) module names (module.module, for example app names in Django) - file paths - host names ([www.example.com](http://www.example.com/)) - urls (https://example.com/) - time zone (UTC) - language code (en) - regexes - date formats - django app names There are probably more, but I think this gets the point across. PyCharm has some (presumably hardcoded) rules about some of these strings, for example in settings.py it knows that strings in the list INSTALLED_APPS are module names, so you can jump to the definitions of those modules, and PyCharm will resolve and check them for you. But this is a closed system where any introduced variables I create myself can’t be validated in this way. I think it would be good if python typing could have a facility for this type of thing. When this gets some traction we could see support for it in Python language servers, PyCharm, static analysis tools, etc. What do you guys think? (originally posted at https://discuss.python.org/t/language-specifications-for-strings/21826/1 where it was suggested that this is the right place for this discussion)

github.com/python/typing

Introduce a `Language` type to provide consistent language information of strings.

opened 02:26PM - 25 Apr 24 UTC

ilotoki0804

topic: feature

Currently, Python has no consistent way to indicate when a programming language …is represented as a string that the string follows the syntax of a particular programming language. This means that languages represented as strings cannot be syntax highlighted, resulting in a significant loss of productivity, readability, and an increase in bugs and errors when dealing with other languages as strings. [This article](https://discuss.python.org/t/allow-for-arbitrary-string-prefix-of-strings/19740/13) gives an example of the current problem. > ... > > A component in my library is a combination of python code, html, css and javascript. Currently I glue things together with a python file, where you put the paths to the html, css and javascript. When run, it brings all of the files together into a component. But for small components, having to juggle four different files around is cumbersome, so I’ve started to look for a way to put everything related to the component *in the same file*. This makes it much easier to work on, understand, and with fewer places to make path errors. > > Example: > > ```python > class Calendar(component.Component): > template_string = '<span class="calendar"></span>' > css_string = '.calendar { background: pink }' > js_string = 'document.getElementsByClassName("calendar)[0].onclick = function() { alert("click!") }' > ``` > > Seems simple enough, right? The problem is: There’s no syntax highlighting in my code editor for the three other languages. This makes for a horrible developer experience, where you constantly have to hunt for characters inside of strings. You saw the missing quote in js_string right? 🙂 > > ... ## Traditional approaches and issues ### Typical case Typically, syntax highlighting is not provided at all because there is no way for the editor to know the language of the string, which leads to several drawbacks. ### Batch syntax highlighting of raw strings for regexes in VSCode VSCode provides simple syntax highlighting for regexes when using raw strings, as shown below. <img src="https://github.com/python/typing/assets/130233823/ed67f7a7-432e-4eb4-920e-ad3e59220735" width=200> However, this approach has several drawbacks. First of all, it doesn't generalize to languages other than regexes. Also, since raw strings aren't just for regexes, it creates a visual distraction for people who want to use raw strings for non-regex reasons, such as Windows paths. Below is an example of syntax highlighting for regex applied to Windows path, which actually reduces readability. <img src="https://github.com/python/typing/assets/130233823/bbf13e82-4554-4eb2-b039-164d91359b01" width=300> ## `Language` and `LiteralLanguage` `Language` is a subtype of `str` that indicates that the string represents a specific language. `LiteralLanguage` is a subtype of `LiteralString`, and is used in the same way as `Language`. `Language` takes a single type argument, and in its place you put the name of the language, for example, `Language["html"]`. Editors should provide basic syntax highlighting for string literals set to types `Language` or `LiteralLanguage`. Consider code blocks in Markdown. The `Language` type may also be implied by the type of the parameter. ```python from typing import Language Language["html"] # The brackets hold the name of the language. my_css: Language["css"] = "p { font-size: 20px; }" # This string is considered CSS and should be syntax highlighted. def get_html(html: Language["html"]): ... get_html("<p>hello, world!</p>") # This string is considered HTML and should be syntax highlighted. def dreamberd_compiler(code: Language["java"]): ... # `Language` can also be used in "reasonably similar code". This code should have syntax highlighting for Java. dreamberd_compiler("var var hello = 123!") def get_path(path: str): ... def get_pattern(pattern: Language["re"]): ... # Now it's not syntax highlighted as simply a raw string. get_path(r"C:\Users\user\python.py") # This string shouldn't have any syntax highlighting. get_pattern(r"a\rb+b?[abc]", "...") # This string should be syntax highlighted as a regex. ``` ## Errors It is difficult to set the `Language` type to remain a `Language` type after an operation, as this would complicate the implementation and make it difficult to provide a clear criterion for the type. For example, does `Language["A"] + Language["A"]` always result in `Language["A"]`? Of course it often does, but it's very hard to generalize. The case of `Language["A"] + Language["B"]` is also tricky. Should we catch the type as `Language["A"]`, or should it be `Language["B"]`? And what about `Language["A"].strip()`? It's hard to maintain consistency or a single standard for these operations. Therefore, `Language` should be considered more as a feature for annotation than for complex static type checking. Therefore, a type checker should accept the target of a given `Language` type as legitimate if it is a string, regardless of its contents, and an editor should not raise an error if it fails to parse. Developers should also not expect that when they accept a value annotated with `Language' that the string is fully valid code that will pass the language's compiler. Conversely, `Language` can be used for code that is "reasonably close" to the appearance of the language. Developers should consider whether syntax highlighting helps or hinders users when deciding whether to use `Language` or just use `str` for languages that are not exactly the same as the target language. ## Post-operation type The type `Language` should be treated as `str` when computed, and `LiteralLanguage` should be treated as `LiteralString` when computed. ```python # In the case of `LiteralLanguage` literal_html: LiteralLanguage["html"] = "<h1>Hello, world!</h1>" literal_sql: LiteralLanguage["sql"] = "SELECT CustomerName, City FROM Customers;" my_literal: LiteralString = "Contents: {}" # When two different languages are synthesized, both variables are considered `LiteralString`. reveal_type(literal_html + literal_sql) # type: LiteralString # If two strings of the same language are composited, or if `Language` is composited with a `Literal`, it should still be considered a `LiteralString`. reveal_type(literal_html + literal_html) # type: LiteralString reveal_type(literal_html + " ") # type: LiteralString reveal_type(f""" <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> </head> <body> {literal_html} <body> """) # type: LiteralString # A `LiteralString` can be cast to a `LiteralLanguage`. template: LiteralLanguage["html"] = reveal_type(f""" <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> </head> <body> {literal_html} <body> """) # type: LiteralLanguage["html"] # For all other operations, a `LiteralLanguage` is considered a `LiteralString`. reveal_type(my_literal.format(literal_html)) # type: LiteralString reveal_type(input() + literal_html) # type: str ``` ```python # In the case of `Language` use_input = input().lower().startswith("y") html: Language["html"] = "<h1>Hello, world!</h1>" if use_input else input() literal_sql: LiteralLanguage["sql"] = "SELECT CustomerName, City FROM Customers;" css: Language["css"] = "SELECT CustomerName, City FROM Customers;" my_literal: LiteralString = "Contents: {}" reveal_type(html + literal_sql) # type: str reveal_type(html + query) # type: str reveal_type(html + html) # type: str reveal_type(html + " ") # type: str reveal_type(f""" <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> </head> <body> {html} <body> """) # type: str template: Language["html"] = reveal_type(f""" <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> </head> <body> {literal_html} <body> """) # type: Language["html"] reveal_type(my_literal.format(html)) # type: str reveal_type(input() + html) # type: str ``` ## `BytesLanguage`? `ByteLanguage` is the bytes version of `Language`. We should think about whether we need this type. However, there is no type called `LiteralBytes`, so at least `LiteralBytesLanguage` can't exist. ## Language names The language identifier in `Language` must be lowercase, e.g. `Language["python"]` instead of `Language["Python"]`. For language names, it seems like a good idea to use what is used for code blocks in Markdown that developers are familiar with, but the exact definition of this is up to the editor. ## Supported languages A list of supported languages is beyond the scope of this documentation and should be up to each editor's implementation. However, editors should be able to provide basic syntax highlighting for common languages like Python, HTML, SQL, etc.

Daverball · August 23, 2024, 7:20am

I decided to respond to this separately, although this is starting to get a bit off-topic, so I’ll stop there. There’s technically no ambiguity because html5'<br>' is not valid syntax, but t:html5'<br>' is. But granted, it does make the parsers’ job a bit more difficult and it would become ambiguous if you use a meta tag that shadows a string prefix, but the obvious solution to that would be to ban single character meta tags.

I’m also not married to this syntax, I only used it for illustrative purposes for how this could potentially look. The key takeaway should be that the ability to add one or multiple meta tags to string literals could be useful, since these can very easily be standardized across the ecosystem without causing namespace pollution, after all it’s just metadata. The disadvantage of most of the other approaches I have seen suggested over the years, are, that the metadata is not directly tied to the literal, so you can’t (easily) introspect it at runtime.

The comment solution, like the one pycharm provides, is a great option for partial backwards compatibility, similar to how we have type comments to give users the ability to support static type checking while still retaining support for older versions of Python.

jimbaker · August 23, 2024, 8:25pm

We left out some normative scheme from this PEP to state what the specific DSL is, and its dialect.

But it clearly should not be hardcoded, whether it’s to an r prefix or to a html function. This discussion on the pre PEP was helpful, Tag string use-case: language marker for syntax highlight · Issue #18 · jimbaker/tagstr · GitHub, including the idea of using Annotated to make the connection to some registry, such as for GitHub’s Linguist. I imagine there could be more than one registry out there, although GitHub’s looks comprehensive for agreeing on a specific common name.

Let’s sketch out some possiblities with the Template approach:

class HTML(Protocol):
  # some basic things we would expect here

HTML = Annotated[HTML, {
  'language': 'html',
  'registry': 'https://github.com/github-linguist/linguist/blob/master/lib/linguist/languages.yml'
})

def html(template: Template[HTML]) -> HTML:
  # parses the provided template into an AST
  # contextually fills in any interpolations using the AST
  # builds the desired object of type HTML, such as a DOM repr

Now let’s try the following:

some_html: Template[HTML] = t'<li>{name}</li>'

So this looks good - we can infer the expected DSL from the type here. Syntax highlighting can be provided, as well as syntax checking of the target DSL. It also recursively composes to any expressions in the template. Same with

html(t'<li>{name}</li>')

If it were actually incorrect syntax, this can be highlighted by the IDE or linter. Of course the only time we know that it’s actually an HTML object is when it’s returned - Template[HTML] simply is used by html to build HTML objects.

@steve.dower I’m not certain if this addresses the concerns here, but hopefully it’s getting closer. I could still see some advantages of the original tag string approach - we know it’s html when built with it (or else it raise some exception). Separating in this two parts, with a t-string and a function that consumes the resulting further specifed template (Template[HTML]) should work here.

I have of course left out a lot of details in this sketch, but hopefully it can support a good developer experience.

brettcannon · August 23, 2024, 10:08pm

It’s actually powered by Pylance.

Daverball · August 24, 2024, 8:02am

While I think leveraging type annotations is a fairly reasonable approach I am a little bit concerned about the implications for what editors have to do in order to support this. It would mean the parser by itself would rarely have a chance of identifying which string literals should have embedded syntax highlighting, since determining the language of the literal requires semantic analysis. So the editor either has to do that itself or has to talk to a language server, which has to be configured properly, so it can pick up all the required dependencies, since html and HTML may be provided by some third party module, rather than in the file or source tree we’re currently working in.

So the barrier of entry for actually getting the promised syntax highlighting out of this feature seems fairly high, especially for beginners. If we can do something more simple, that a parser can easily understand on its own without requiring semantic analysis, I think that would be much better. It would also raise the chances that editors would support this out of the box, rather than through some plugin.

rrolls · August 24, 2024, 10:40am

David Salvisberg:

You misunderstand me. The problem is not that there are multiple different parsers for the same language. The problem is encouraging them to be all named the same, i.e. the namespace pollution of functions with the same name but sometimes vastly different purpose, just so syntax highlighting works. You are conflating two things, that although related, are not the same.

I would like them to be completely separate, so there’s something that says what kind of data this literal contains and then just any function or class with a descriptive name can process it. That way we can also get syntax highlighting on strings that don’t contain any interpolation and we aren’t forced to call the function on the literal immediately, but instead can defer it and reuse the same literal object multiple times.

Thanks, I had indeed misunderstood you. I think I get what you mean now: we should be able to first specify the language of a tag string, and then pass it to a function to be consumed.

I do think either way works just fine though. If the html function takes an argument of the type Jim has called Template[HTML] (which I had called WithLanguage[TagStringInfo, 'html']) then the html function could just straight up parse the HTML, but it doesn’t have to, it could return the object unchanged for later processing. Or, indeed, as Jim has pointed out, one can just write some_html: Template[HTML] = t'...' and semantic analysis would know that the ... is HTML just the same as when it was an argument being passed to a function. So I think all bases are covered here.

Ah, thanks. And Pylance is powered by Pyright. This isn’t the first time I’ve gotten Pyright and Mypy mixed up

I think editors can probably have a fallback option if semantic analysis is not available, where they textually recognise expressions of the form lang(t'...' (closing paren intentionally omitted, to allow for extra arguments to be used), or var: Template[lang] = t'...', and automatically highlight the ... as being of language lang, case-insensitively, if known to the editor (like html or sql). (The same would also be applied for double- and/or triple-quoted strings.)

Then, editors that have semantic analysis support can turn off the fallback (to avoid false positives) and use only semantic analysis instead.

In any case, I’d give a solid +1 to Jim’s latest design.

One more thought that comes to mind. In addition to t'...', is it appropriate to include tb'...' (and bt'...' which’d do the same) to indicate that the literal parts of the string are to be bytes rather than str objects? We would presumably then need Template[TheLanguage, str] and Template[TheLanguage, bytes] - though I’m sure Template[TheLanguage] could default to str thanks to PEP 696.

You wouldn’t use tb'...' for things like HTML or SQL, but I can see them being useful in the occasional places where speaking in bytes rather than Unicode is the correct thing to do. There’s been times when I’ve written code of the form writer.write(b'foo' + var1 + b'bar' + var2 + b'baz') and it’d be much nicer to only have to specify the b prefix once by using a tb for the whole lot - it’s always frustrating missing out one b and only finding out later that I can’t add a str to a bytes. It’s certainly not a must-have, but it seems like it should be simple enough, and would preserve the existing symmetry between str and bytes literals.

ncoghlan · August 24, 2024, 1:02pm

With conversion specifiers being evaluated at rendering time, I realised there’s a straightforward way to support lazy fields even with eager value rendering: define !() as a new conversion specifier that means “call at rendering time”.

For t-strings, that looks like:

logging.debug(t"Logging with eager {expensive_call()}")
logging.debug(t"Logging with lazy {expensive_call!()}")

logging.debug(t"Logging with eager {call_with_args(x, y, z)}")
logging.debug(t"Logging with lazy {(lambda: call_with_args(x, y, z))!()}")

For tagged strings, just drop the (t...) (assuming we add __tag_call__ to the logging APIs and allow dotted prefixes):

logging.debug"Logging with eager {expensive_call()}"
logging.debug"Logging with lazy {expensive_call!()}"

logging.debug"Logging with eager {call_with_args(x, y, z)}"
logging.debug"Logging with lazy {(lambda: call_with_args(x, y, z))!()}"

Offering {-> expensive_call_with_args(x, y, z)!()} as syntactic sugar for {(lambda: expensive_call_with_args(x, y, z))!()} can then be left to a later PEP that cites more specific use cases.

As others have noted, the proposal for a template literal syntax is PEP 501 (since we have two PEPs assigned, I think it makes the most sense to have each make the best case it can for its preferred syntax). While @nhumrich and I both like large parts of the PEP 750 implementation proposal (so our proposed PEP 501 implementation approach would now be to start from the PEP 750 implementation and tweak the details of the surface syntax and the public API), we’re definitely not convinced that it makes sense to expand the syntax proposal from a single dedicated t prefix to allowing arbitrary prefixes.

I’ll be starting the PEP 501 update PR based on the notes at PEP 501: improvements inspired by PEP 750's tagged strings · Issue #3904 · python/peps · GitHub tomorrow.

On a related note, I was happy to see in PEP 750: Tag Strings For Writing Domain-Specific Languages - #176 by jimbaker that the current thoughts on the type system updates needed to allow DSL-aware syntax highlighting are expected to work regardless of whether the spelling is tag_function(t"...") or tag_function"...".

The case hasn’t been made to allow “fb-strings” yet, so the case for allowing “tb-strings” is weak. Leaving the door open to adding them in the future is another advantage of restricting ourselves to just t-strings rather than allowing arbitrary tags, though.

pf_moore · August 24, 2024, 2:16pm

While I don’t think we need to support the absolute lowest common denominator in editors, I do think it’s reasonable to aim to provide something that doesn’t require an editor that does semantic analysis to get a decent development experience.

To my mind, the biggest problem with the tag(t'...') approach is exactly the fact that it’s “just” a function call. The only way of knowing how to interpret the data inside the t'...' is to identify the name of the tag function. I’m fine with saying that we look at the function name rather than its definition (so nobody should expect myhtml = html to give a new tag function that editors will understand - they might, but that’s in the realm of semantic interpretation, which simpler editors won’t do). What I’m not fine with is limiting the syntax of the function call.

So, for example, let’s look at

html(t"This is a long string, which will make black want to wrap. Let's see how it goes!")

(lol, Discourse seems to wrap this. It’s a single line of source code). Assuming black handles t'...' the same as f'...', it will reformat this to

html(
    t"This is a long string, which will make black want to wrap. Let's see how it goes!"
)

And even if black gets changed to handle t-strings, human developers could quite reasonably use this form.

Edit: I got distracted, and forgot to make my point clear, which is that we can’t reasonably enforce specifics of spacing, because that’s not how function calls work. Hopefully, it was clear enough what I was trying to get at anyway. (End of added paragraph)

There’s also html ( t'...' )), or any number of variants of whitespace abuse. Which we might say are unreasonable, but they are all legitimate function calls, and having an editor privilege one style over another (in terms of giving a better developer experience) seems to me to be a direction we shouldn’t be going in.

So, to summarise, I prefer tag'...' over tag(t'...') precisely because it’s not syntactically a function call, even though it might be, semantically.

Maybe we could distinguish tags via something like t:html'...'with t:func'...' being equivalent to func(t'...'), and do everything else in terms of a t'...' template object. That gives us distinguished syntax, namespacing of tags, and an actual template object, while still keeping the end user (and source code) syntax identifiable enough to allow decent handling by an editor that (for example) does highlighting via regex-style parsing.