PEP 750: Tag Strings For Writing Domain-Specific Languages

I’m not heavily invested in this design space, but I’ll say you’re not alone - to me this approach [1] is much less “magical”, in a good way. It seems easier to use and explain.


  1. using a t string prefix to create an object ↩︎

14 Likes

As an update, Jim is looking at how a t-string approach could fit into a PEP update, how it would affect other points, what sharp edges if fixes or causes, etc.

5 Likes

(if at all possible, but totally not required. Hopefully we don’t throw this particular baby out with the bathwater if there’s a switch to t strings)

Instead of updating PEP 750, wouldn’t it make more sense to update PEP 501, as it’s already the same proposal for a t-string approach?

2 Likes

Indeed, highlighting the contents of tag strings per the appropriate language in editors - and getting formatting/linting/whatnot from tools - would be a phenomenal benefit.

Getting an editor to support “I see sql"""...""" in your Python code, I’ll render it as SQL” would presumably be a matter of convention, since sql would not be a reserved word of any kind in Python itself. It follows that if editors are going to be enhanced to render the contents of sql"""...""" as SQL, they could just as easily render the contents of sql(t"""...""") as SQL! So this isn’t an argument for one syntax or the other; it works just fine either way. (This of course applies just as much to any tag and associated language, not just SQL…)

Of course, editors could hard code these lookups, but (though this is outside the scope of this PEP) better still would be a standard way (via decorator, annotation, docstring, or any other reasonable method) for function definers to indicate the language that should be used inside the t-string, which could then be used by editors’ semantic analysis.

In fact, I’ve just now realised an excellent use case for tag strings: regular expressions. VS Code’s Python highlighting highlights r'...' as a regular expression simply because that’s what r'...' is frequently used for… which unnecessarily discourages using r'...' for any other purpose since it’s then mis-highlighting. Perhaps one day we can have re.compile(t'...'), get full regular expression syntax checking in editors (by way of the editor knowing it’s required to be a valid regular expression due to being a t-string passed to re.compile), and allow r'...' strings to go back to being ‘neutral’, not having any special highlighting. The same even applies to regular un-prefixed strings in a more limited fashion: anything inside {} braces gets highlighted, regardless of what the string is actually being used for.

1 Like

For what it’s worth, Visual Studio used to have this (before its syntax highlighting was replaced with the same implementation as VS Code). It’s challenging enough to make work reliably, mostly due to arbitrary aliasing/renaming, that it leaves me somewhat torn between tag'...' vs tag(t'...').[1]

I think there’s definite value in allowing arbitrary tags. Over time, we’re sure to see “standard” tags emerge, and those will get picked up by editors and other source analysis tools.

Where I hesitate is in trying to support “all callables” or “all variables in the current namespace” as tags. That is easy enough in the simple cases (for both regular development and source analysis), but the edge cases (i.e. unexpected bugs) are so numerous that it puts me off. I certainly wouldn’t want to be making my code analyzer try to figure out what a foo"..." string is in a function where foo is an argument![2] A global registry (like copyreg) isn’t often used in Python either, but maybe it’s an option…?

These concerns apply equally to tag(t"...") as they do to tag"...", though in the former case we really don’t have any ability to change the lookup. So source analysis tools either have to handle from module import other_tag as tag or just make a big assumption and let their users suffer from it being wrong sometimes.

So I’m not sure what the right answer is here. Saying “we only add t"..." and everyone else uses a function” is easy, but it misses a big benefit that potentially changes the language in a beneficial way. Though it does dodge having to trace another set of name usage through code in order to prove that your code does what you expect.


  1. And for context, I was the one who implemented it and got it working. (Though should clarify it didn’t get all the way to checking the syntax. Just to identify “oh, you’re calling a function that wants a regex, so this literal must be a regex, I’ll treat it like one”.) ↩︎

  2. Again. Once is enough :wink: ↩︎

3 Likes

I think regardless of what we land on, the expectation of automatic pick-up of certain tag names by editors for inline syntax highlighting is going to be problematic. I don’t really want there to be fifteen different implementations of a html tag function, that all do slightly different things and you’re forced to look at the import or use your language server to know which one it is.

I’d consider specifying a solution that would encourage good editor support a separate problem. Maybe what we actually want for that is something that would work with any string prefix and is pure metadata, similar to annotations, i.e. it wouldn’t change how the string is processed, but the compiler will add some internal attribute to the resulting object, so it is introspectable by any function that consumes the object.

This would have the added benefit of the same tag function being able to dynamically dispatch to different parsers based on the supplied metadata, e.g. for disambiguating dialects:

generic_html_tag_function(t:html5"<br>")  # OK
generic_html_tag_function(t:xhtml"<br>")  # Error, not valid XHTML

# also works with other prefixes
u:html"<br>"
r:html"<br>"
b:html"<br>"
...

# maybe we want to support multiple `:` e.g. for language + dialect
t:sql:postgres""
4 Likes

(My OP also included an example of one way to solve the “where do these names come from and what do they do” case where all you care about is editor support or just plain “tag the thing”)

I completely agree that the syntax should be foo(t"bar"). I also fail to see how editors can add syntax highlighting for html"bar" but not html(t"bar"). I think a proper system for syntax highlighting strings can live in typing instead. Maybe some type aliases for type(t"abc").

from typing import SQLString # just a typealias

def sqltag(s: SQLString):
    return s

from mytags import sqltag

q = sqltag(t"select something from something") # editors can now highlight this
5 Likes

This is why I suggested having a standard way to declare the language expected by a function that takes a tag string argument. VS Code’s semantic analysis (which IIRC is powered by mypy) already does enough amazing work of this nature that it doesn’t seem much of a stretch to have it also entirely reliably detect these declarations and forward them on to the editor so it knows to change the highlighting method for that range of source text.

For example, suppose the module a_sql_impl has:

def parse(arg: WithLanguage[TagStringInfo, 'sql']) -> ...:
  ...

(where TagStringInfo is the type of object that’s instantiated by writing t'..., and WithLanguage is a special annotation used just for the purpose of declaring the language and has no effect on the type of arg inside the body of parse; both of these would be new builtins in Python, again, the names could be anything, though I suggest they get put in an appropriate module!)

Then, even if we have something as convoluted as

import a_sql_impl as m

def f():
  class X:
    v = m.parse
  return X

f().v(t'SELECT 1')

The existing code assist in VS Code (and surely other similar tools) already knows that the type of the first (only) argument required in the call to v is WithLanguage[TagStringInfo, 'sql']. The only enhancement that needs to be made to code assist is the logic that “since this is of the form WithLanguage[TagStringInfo, 'something'], something must be the name of some language to be used for highlighting, so the editor should highlight it as that language”. The list of languages that it gets looked up in would be of the same nature as the list used for rendering fenced code blocks in Markdown files, or maybe it could even be the exact same list - either way it would be up to the code editor or the tool (or the user’s configuration thereof), not Python, to decide it.

(Of course, please don’t write actual code like the above; much better to write def sql(arg: WithLanguage[TagStringInfo, 'sql']) -> ...: and from a_sql_impl import sql and sql(t'SELECT 1')… my point was just to highlight that existing tools can already follow these chains of aliases and that the tool doesn’t care about the actual name of the function.)

(I did previously state that this was outside the scope of this PEP, but I think I’ve changed my mind on this: one of the goals of this PEP is to enable tooling support, and merely including the WithLanguage annotation in it would achieve that, so it seems perfectly in scope, now.)

2 Likes

Many languages that can be parsed by existing Python packages already have multiple parser implementations, surely each with their own quirks, so don’t we already (and won’t we always) have this problem?

This would be easily solved with the foo(t'...') syntax, as I pointed out previously that one can easily add an argument to the function call; imagine, working from your example:

html(t'<br>', dialect='html5')  # OK
html(t'<br>', dialect='xhtml')  # Error, not valid XHTML

Were code assist to be implemented as per my previous post about WithLanguage, some clever use of @typing.overload and typing.Literal could even allow the editor to highlight the dialects differently based on this dialect arg, so it doesn’t even become a limitation there.

(I’ll also note that the t:html5'...' syntax results in at least one ambiguity: imagine putting it inside {}. Is {t:html5'<br>'} a set or a dict?)

1 Like

You misunderstand me. The problem is not that there are multiple different parsers for the same language. The problem is encouraging them to be all named the same, i.e. the namespace pollution of functions with the same name but sometimes vastly different purpose, just so syntax highlighting works. You are conflating two things, that although related, are not the same.

I would like them to be completely separate, so there’s something that says what kind of data this literal contains and then just any function or class with a descriptive name can process it. That way we can also get syntax highlighting on strings that don’t contain any interpolation and we aren’t forced to call the function on the literal immediately, but instead can defer it and reuse the same literal object multiple times.

4 Likes

See also:

I decided to respond to this separately, although this is starting to get a bit off-topic, so I’ll stop there. There’s technically no ambiguity because html5'<br>' is not valid syntax, but t:html5'<br>' is. But granted, it does make the parsers’ job a bit more difficult and it would become ambiguous if you use a meta tag that shadows a string prefix, but the obvious solution to that would be to ban single character meta tags.

I’m also not married to this syntax, I only used it for illustrative purposes for how this could potentially look. The key takeaway should be that the ability to add one or multiple meta tags to string literals could be useful, since these can very easily be standardized across the ecosystem without causing namespace pollution, after all it’s just metadata. The disadvantage of most of the other approaches I have seen suggested over the years, are, that the metadata is not directly tied to the literal, so you can’t (easily) introspect it at runtime.

The comment solution, like the one pycharm provides, is a great option for partial backwards compatibility, similar to how we have type comments to give users the ability to support static type checking while still retaining support for older versions of Python.

1 Like

We left out some normative scheme from this PEP to state what the specific DSL is, and its dialect.

But it clearly should not be hardcoded, whether it’s to an r prefix or to a html function. This discussion on the pre PEP was helpful, Tag string use-case: language marker for syntax highlight · Issue #18 · jimbaker/tagstr · GitHub, including the idea of using Annotated to make the connection to some registry, such as for GitHub’s Linguist. I imagine there could be more than one registry out there, although GitHub’s looks comprehensive for agreeing on a specific common name.

Let’s sketch out some possiblities with the Template approach:

class HTML(Protocol):
  # some basic things we would expect here

HTML = Annotated[HTML, {
  'language': 'html',
  'registry': 'https://github.com/github-linguist/linguist/blob/master/lib/linguist/languages.yml'
})

def html(template: Template[HTML]) -> HTML:
  # parses the provided template into an AST
  # contextually fills in any interpolations using the AST
  # builds the desired object of type HTML, such as a DOM repr

Now let’s try the following:

some_html: Template[HTML] = t'<li>{name}</li>'

So this looks good - we can infer the expected DSL from the type here. Syntax highlighting can be provided, as well as syntax checking of the target DSL. It also recursively composes to any expressions in the template. Same with

html(t'<li>{name}</li>')

If it were actually incorrect syntax, this can be highlighted by the IDE or linter. Of course the only time we know that it’s actually an HTML object is when it’s returned - Template[HTML] simply is used by html to build HTML objects.

@steve.dower I’m not certain if this addresses the concerns here, but hopefully it’s getting closer. I could still see some advantages of the original tag string approach - we know it’s html when built with it (or else it raise some exception). Separating in this two parts, with a t-string and a function that consumes the resulting further specifed template (Template[HTML]) should work here.

I have of course left out a lot of details in this sketch, but hopefully it can support a good developer experience.

4 Likes

It’s actually powered by Pylance.

3 Likes

While I think leveraging type annotations is a fairly reasonable approach I am a little bit concerned about the implications for what editors have to do in order to support this. It would mean the parser by itself would rarely have a chance of identifying which string literals should have embedded syntax highlighting, since determining the language of the literal requires semantic analysis. So the editor either has to do that itself or has to talk to a language server, which has to be configured properly, so it can pick up all the required dependencies, since html and HTML may be provided by some third party module, rather than in the file or source tree we’re currently working in.

So the barrier of entry for actually getting the promised syntax highlighting out of this feature seems fairly high, especially for beginners. If we can do something more simple, that a parser can easily understand on its own without requiring semantic analysis, I think that would be much better. It would also raise the chances that editors would support this out of the box, rather than through some plugin.

Thanks, I had indeed misunderstood you. I think I get what you mean now: we should be able to first specify the language of a tag string, and then pass it to a function to be consumed.

I do think either way works just fine though. If the html function takes an argument of the type Jim has called Template[HTML] (which I had called WithLanguage[TagStringInfo, 'html']) then the html function could just straight up parse the HTML, but it doesn’t have to, it could return the object unchanged for later processing. Or, indeed, as Jim has pointed out, one can just write some_html: Template[HTML] = t'...' and semantic analysis would know that the ... is HTML just the same as when it was an argument being passed to a function. So I think all bases are covered here.

Ah, thanks. And Pylance is powered by Pyright. This isn’t the first time I’ve gotten Pyright and Mypy mixed up :slight_smile:

I think editors can probably have a fallback option if semantic analysis is not available, where they textually recognise expressions of the form lang(t'...' (closing paren intentionally omitted, to allow for extra arguments to be used), or var: Template[lang] = t'...', and automatically highlight the ... as being of language lang, case-insensitively, if known to the editor (like html or sql). (The same would also be applied for double- and/or triple-quoted strings.)

Then, editors that have semantic analysis support can turn off the fallback (to avoid false positives) and use only semantic analysis instead.

In any case, I’d give a solid +1 to Jim’s latest design.

One more thought that comes to mind. In addition to t'...', is it appropriate to include tb'...' (and bt'...' which’d do the same) to indicate that the literal parts of the string are to be bytes rather than str objects? We would presumably then need Template[TheLanguage, str] and Template[TheLanguage, bytes] - though I’m sure Template[TheLanguage] could default to str thanks to PEP 696.

You wouldn’t use tb'...' for things like HTML or SQL, but I can see them being useful in the occasional places where speaking in bytes rather than Unicode is the correct thing to do. There’s been times when I’ve written code of the form writer.write(b'foo' + var1 + b'bar' + var2 + b'baz') and it’d be much nicer to only have to specify the b prefix once by using a tb for the whole lot - it’s always frustrating missing out one b and only finding out later that I can’t add a str to a bytes. :slight_smile: It’s certainly not a must-have, but it seems like it should be simple enough, and would preserve the existing symmetry between str and bytes literals.

With conversion specifiers being evaluated at rendering time, I realised there’s a straightforward way to support lazy fields even with eager value rendering: define !() as a new conversion specifier that means “call at rendering time”.

For t-strings, that looks like:

logging.debug(t"Logging with eager {expensive_call()}")
logging.debug(t"Logging with lazy {expensive_call!()}")

logging.debug(t"Logging with eager {call_with_args(x, y, z)}")
logging.debug(t"Logging with lazy {(lambda: call_with_args(x, y, z))!()}")

For tagged strings, just drop the (t...) (assuming we add __tag_call__ to the logging APIs and allow dotted prefixes):

logging.debug"Logging with eager {expensive_call()}"
logging.debug"Logging with lazy {expensive_call!()}"

logging.debug"Logging with eager {call_with_args(x, y, z)}"
logging.debug"Logging with lazy {(lambda: call_with_args(x, y, z))!()}"

Offering {-> expensive_call_with_args(x, y, z)!()} as syntactic sugar for {(lambda: expensive_call_with_args(x, y, z))!()} can then be left to a later PEP that cites more specific use cases.

As others have noted, the proposal for a template literal syntax is PEP 501 (since we have two PEPs assigned, I think it makes the most sense to have each make the best case it can for its preferred syntax). While @nhumrich and I both like large parts of the PEP 750 implementation proposal (so our proposed PEP 501 implementation approach would now be to start from the PEP 750 implementation and tweak the details of the surface syntax and the public API), we’re definitely not convinced that it makes sense to expand the syntax proposal from a single dedicated t prefix to allowing arbitrary prefixes.

I’ll be starting the PEP 501 update PR based on the notes at PEP 501: improvements inspired by PEP 750's tagged strings · Issue #3904 · python/peps · GitHub tomorrow.

On a related note, I was happy to see in PEP 750: Tag Strings For Writing Domain-Specific Languages - #176 by jimbaker that the current thoughts on the type system updates needed to allow DSL-aware syntax highlighting are expected to work regardless of whether the spelling is tag_function(t"...") or tag_function"...".

The case hasn’t been made to allow “fb-strings” yet, so the case for allowing “tb-strings” is weak. Leaving the door open to adding them in the future is another advantage of restricting ourselves to just t-strings rather than allowing arbitrary tags, though.

3 Likes

While I don’t think we need to support the absolute lowest common denominator in editors, I do think it’s reasonable to aim to provide something that doesn’t require an editor that does semantic analysis to get a decent development experience.

To my mind, the biggest problem with the tag(t'...') approach is exactly the fact that it’s “just” a function call. The only way of knowing how to interpret the data inside the t'...' is to identify the name of the tag function. I’m fine with saying that we look at the function name rather than its definition (so nobody should expect myhtml = html to give a new tag function that editors will understand - they might, but that’s in the realm of semantic interpretation, which simpler editors won’t do). What I’m not fine with is limiting the syntax of the function call.

So, for example, let’s look at

html(t"This is a long string, which will make black want to wrap. Let's see how it goes!")

(lol, Discourse seems to wrap this. It’s a single line of source code). Assuming black handles t'...' the same as f'...', it will reformat this to

html(
    t"This is a long string, which will make black want to wrap. Let's see how it goes!"
)

And even if black gets changed to handle t-strings, human developers could quite reasonably use this form.

Edit: I got distracted, and forgot to make my point clear, which is that we can’t reasonably enforce specifics of spacing, because that’s not how function calls work. Hopefully, it was clear enough what I was trying to get at anyway. (End of added paragraph)

There’s also html ( t'...' )), or any number of variants of whitespace abuse. Which we might say are unreasonable, but they are all legitimate function calls, and having an editor privilege one style over another (in terms of giving a better developer experience) seems to me to be a direction we shouldn’t be going in.

So, to summarise, I prefer tag'...' over tag(t'...') precisely because it’s not syntactically a function call, even though it might be, semantically.

Maybe we could distinguish tags via something like t:html'...'with t:func'...' being equivalent to func(t'...'), and do everything else in terms of a t'...' template object. That gives us distinguished syntax, namespacing of tags, and an actual template object, while still keeping the end user (and source code) syntax identifiable enough to allow decent handling by an editor that (for example) does highlighting via regex-style parsing.

4 Likes