Language hints for PEP 750 template strings

Hi everyone,

Now that PEP 750 is accepted, I’d like to discuss how tools can detect what language is inside a template string. For example, when we write:

query = t"SELECT * FROM users WHERE id = {user_id}"

Ideally, our IDEs should highlight the SQL syntax. But how can our tools determine that this template contains SQL?

The PEP mentions this challenge but delegates the solution to the community: PEP 750 – Template Strings | peps.python.org

I’ve been experimenting with using Annotated for this:

from typing import Annotated
from templatestrings import Template  # or wherever this ends up

query: Annotated[Template, "sql"] = t"SELECT * FROM {table}"

This approach seems promising - I’ve created a prototype LSP to test it (GitHub - koxudaxi/t-linter). However, I’m not sure this is the best path, so I’d like feedback from the community.

The advantage of using Annotated is that type checkers could potentially leverage this information as well - for instance, applying SQL-specific validation rules only to SQL templates. Additionally, it maintains backward compatibility since tools that don’t recognize the annotation can simply ignore it.

That said, there might be better approaches:

  • Subclassing: class SQLTemplate(Template): ...
  • Hints package: Pre-defined types like from template_hints import SQLTemplate
  • No coordination: Let each tool implement its own detection (though this might lead to fragmentation)

I’d appreciate your thoughts on whether we should coordinate this effort as a community, or if it would be better to let different tools develop their own solutions independently.

I’m also interested in your opinions on the format for language identifiers - should we use simple names like “sql” or MIME-style types (e.g., “text/x-sql”, “text/html”)?

I look forward to hearing your perspectives, especially from those working on type checkers or IDEs.

Depending on the discussion, we might consider creating a shared package with type aliases or documenting best practices - but let’s first see what approach makes the most sense for our community.

Thank you!

4 Likes

Personally, I think it’s too early to be trying to standardise something here. Let’s wait until tools have started to develop some (tool-specific for now) conventions, and then, once we have some practical experience to draw on, start thinking about whether we should standardise.

A much simpler approach could be to have do-nothing functions with conventional names:

def sql(x):
    # Marks a t-string as SQL
    return x

query = sql(t"SELECT * FROM {table}")

or even just structured comments:

query = t"SELECT * FROM {table}" # t-str: sql

At this point, we just don’t know what will be the most usable solution.

2 Likes

See related discussion in Consider making `Template` and `Interpolation` generic at runtime · Issue #133970 · python/cpython · GitHub. @lys.nikolaou there recommended that we discuss further how to make Templates work well with type checking, so here we are.


I’d like to start by thinking about what we’d want to get out of some sort of type hints for templates. Here are some things that could be useful:

  • Syntax highlighting in editors. Requires a way to tag the string, a way for the IDE to recognize the tag, and a way for the IDE to map that tag to some sort of lexer.
  • Syntax checking inside the template (e.g., an error when you have a syntax error in your SQL). Conceptually this could work similarly to syntax highlighting, but it’s more important to get it exactly right; for example, you can likely get away with one form of syntax highlighting for all dialects of SQL, but syntax checking soon requires you to know what kind of SQL you’re dealing with.
  • Checking that the right kind of template is passed to the right function. If a function wants a SQL Template but you pass it a shell command Template, a type checker should tell you.
  • Checking that interpolations are correct. For example, if you interpolate a function object in a SQL template, that’s probably wrong. But maybe in some other kind of template it’s acceptable.

Feel free to add to that list.

I don’t know how you’d solve all of these in the general case. A difficulty is going to be that there’s an infinite possible set of mini-languages for templates, and it’s not clear how you could tell an arbitrary tool like a type checker or IDE about the mini-language for your template.

8 Likes

To build on what Jelle said, syntax highlighting requires that the string can be parsed by a parser in the target language. Even if a tool knows which language a template is targeting, standard parsers will not be able to parse a template string without syntax errors. For example, the parser in CPython would report a syntax error if it were asked to parse the string "x = {value} if {condition} else None". Likewise, a standard SQL parser would report a syntax error if asked to parse "Select * FROM {table}". It’s theoretically possible to write a bespoke parser that has syntax error recovery logic that can skip over the interpolation expressions (making assumptions about the tokens they generate), but this would be a big undertaking for even one language. It might also be possible for a tool to use some heuristics to replace the interpolation clauses with best-guess placeholder tokens before calling a standard parser, but this would be error prone and would lead to many false positives and negatives. For this reason, syntax highlighting might not be feasible for template strings.

The third bullet in Jelle’s list (checking that the right kind of template is passed to a function) is quite feasible and could be done by either making Template and Interpolate generic or by using NewType to define a subtype of Template that is specific to a target language.

SqlTemplate = NewType("SqlTemplate", Template)

def format_sql(t: SqlTemplate): ...

query = SqlTemplate(t"SELECT * FROM {table}")
format_sql(query)
8 Likes

We’ve informally referred to this as the “HTML is not HTML” problem on the PEP team. It’s “not” in three ways:

  1. The way Eric mentions: it’s not HTML because it has curly braces inside which are python expressions; no standard HTML parser would accept it.
  2. Beyond that, two different html(t: Template) methods could accept interpolations at different locations in a t-string. (Can I place a substitution inside a tag opener like t"<div {attributes}>..."?)
  3. Beyond that, two different html(t: Template) methods accepting interpolations at the same locations might still accept different types for their values. (If I write t"<div class={value}>...", must value be a str, or does the html() implementation support a list[str] too? etc.)

Stepping back, my (admittedly fuzzy) hope is that (a) over time, the community rallies around (possibly informal) standards for what constitutes “valid t-string HTML” (and SQL, etc.), and (b) we decide how to special-case these in our tooling. It’s a lot of work.

1 Like

Personally, I think it’s too early to be trying to standardise something here. Let’s wait until tools have started to develop some (tool-specific for now) conventions, and then, once we have some practical experience to draw on, start thinking about whether we should standardise.

I understand, but I’ve already created a working prototype (t-linter) and found that having no coordination at all makes it difficult to start. That’s why I’m seeking community input.

1 Like

I missed that issue. You’re right - PEP 750’s typing has two unresolved challenges: the Generic issue for Interpolation and sub-language annotations.

Thank you for organizing the requirements. I agree with them.

I don’t know how you’d solve all of these in the general case. A difficulty is going to be that there’s an infinite possible set of mini-languages for templates, and it’s not clear how you could tell an arbitrary tool like a type checker or IDE about the mini-language for your template.

While there are certainly performance and other issues to solve, I think a good approach would be for Python’s language parser to detect t-strings, recognize the sub-language, and then call external tools for linting and syntax highlighting.

In my experimental t-linter implementation, I used tree-sitter, which allowed me to leverage tree-sitter’s language parsers to highlight sub-languages.

1 Like

The third bullet in Jelle’s list (checking that the right kind of template is passed to a function) is quite feasible and could be done by either making Template and Interpolate generic or by using NewType to define a subtype of Template that is specific to a target language.

This is a great idea that I hadn’t fully considered. Passing it as a type rather than using Annotated makes sense. I think it would be even cleaner if we could solve this with Generics.

For this reason, syntax highlighting might not be feasible for template strings.

Of course, this is a technically challenging problem, but I believe it would be very valuable if we could solve it despite the difficulties. For example, TSX passes templates to React components in a similar way, and editors display them beautifully.

I understand that careful replacement is needed and that dedicated parsers are required, but I wanted to see if it was practically implementable, so I created the tools. I built t-linter (linked in #1), a VSCode extension, and a tlinter-pycharm-plugin. For t-linter, tree-sitter handled {} well. The PyCharm plugin used IDEA’s multi-language injector feature.

In these experiments, I confirmed that basic cases can be highlighted correctly.

I’m attaching screenshots from VSCode and PyCharm for reference.


2 Likes

I’ll echo Josh of last year: PEP 750: Tag Strings For Writing Domain-Specific Languages - #19 by thejcannon

Whatever happens here should ideally be applicable as well to non-template strings, so the linters and highlighters and formatters (ideally my YAML formatter also formats the YAML in my Python multiline strings, why shouldn’t they join in the fun?) run on them as well.

@thejcannon
I also considered this idea, but I skipped it due to several concerns that I think would arise.

If we provide syntax highlighting for literals, it would encourage users to write SQL or HTML in them. However, I believe doing this for strings other than t-strings would not be good practice. For example, if we enable SQL syntax highlighting for f-strings, there is a risk that users might write f-strings with SQL injection vulnerabilities or related unsafe code.

Additionally, this relates to the discussion at PEP 750: disallow str + Template. If users write SQL in regular string literals, they might combine them with f-strings or other strings, which could also lead to code with SQL injection vulnerabilities.

I might be overthinking this, so I would appreciate any helpful opinions on this matter.

1 Like

Strictly speaking is there anything stopping me from doing Annotated[str, SQL]?

Personally, I would be happy if using Annotated[str, SQL] would enable syntax highlighting and linting for all string literals. However, I’m uncertain whether we should allow (or recommend) this as a community due to the security concerns I mentioned earlier.

From a technical perspective as a tool developer, regular string literals are much more numerous than t-strings, so running syntax highlighting triggers (type resolution) for all string literals could be computationally expensive. (Though I think this is something we should consider later - for now, we should focus on what the community should allow or recommend.)

Perhaps we could start with t-strings only and gradually expand to other string types based on community feedback and security best practices?

1 Like

WebStorm has pretty good support for this and the JS ecosystem in general.

For webstorm, you can configure custom patterns for where it should inject another language

Prettier detects template literals with a leading /* HTML */ (or GraphQL, CSS etc) comment or if the template tag is html. It offers an option to disable all embed handling.

Getting this right without any form of configuration is probably hard because of the many different dialects that exist

1 Like

Indeed, as you mentioned, it might be difficult to specify all SQL dialects and similar variations through comments or Annotated.

Since pyproject.toml has already become widely accepted, it might be a good idea to specify dialects and specific highlighting/linting policies there.

Related: Language specifications for strings · Issue #1370 · python/typing · GitHub and Language specifications for strings

Semi thought out suggestion that should be general enough to allow tool-specific innovation, at the cost of a small amount of complexity.

In typing (or maybe somewhere else), have a type Tag. I am open to alternative names, maybe Language or LanguageTag. For brevity I am going to stick with Tag for now. This type acts similar to types.SimpleNamespace, in that it takes an arbitrary number of keyword arguments and exposes them as attributes. But in contrast to SimpleNamespace, Tag should be immutable and hashable, so that it can be used as a dictionary key.

Instances of Tag should be created as module globals with only literal arguments, e.g.

HTML = Tag(
    language="html",
    version="html5"
)

The exact subset of python syntax that is supported is TBD. I am tempted to say “anything that ast.literal_eval can parse directly”, but that might have issues I haven’t thought about. This restriction is so that static analysis tools have an easy time parsing the information contained within these objects. This cannot and will not be enforced at runtime - giving corresponding errors is also a job for static analysis tools like linters.

These objects can then be used for Annotated:

def print_html(text: Annotated[str, HTML]): ...

Additionally these objects are callable as runtime identity functions with a signature like[1]

    def __call__(self, x: Annotated[T, self]) -> Annotated[T, self]:
        return x

So that it’s easy to mark a string as a specific language even if you don’t have a function that directly needs it at the literal location.

TEMPLATE = HTML("""
...
""")

Tools can analyze the set of attributes on the tag object and do corresponding syntax highlighting or linting on the passed in strings. This approach easily deals with str, LiteralString and Template as needed.

A benefit of having arbitrary attributes is that additional information can easily be encoded in a structured way. Tools that don’t understand/care about a piece of information can ignore it, and they should ignore all attributes they don’t understand. This can cover dialects, different versions of a language, different templating options, which types are valid interpolations in template strings, how template processors will work with format strings, …

But all of that is future work that can be more easily worked out by the tools once the basic framework exists. For the beginning I would just specify language as a required key that defines the basic language a string is. Contrary to the above example, mime type is probably a good standard to refer to here, although I am not sure about that since I am not too familiar with mime type and what other approaches might exists.

The big benefit over just using a string literal in Annotated is that it’s unambiguous that it refers to the language of literal and that it can hold arbitrary extra information.


  1. This of course isn’t a valid signature - this requires special casing ↩︎

1 Like