PEP 750: Tag Strings For Writing Domain-Specific Languages

zuo · October 27, 2024, 9:46am

What I intended to emphasize here was that (given that protocols and ABCs tend to be “carved in stone” once officially defined) it might be worth wondering whether, for now, woudn’t it be more safe to rely on an informal duck-typing-based expectations (explicitly documented – but including the caveat of being open to further extensions) rather than on a formal protocol definition; with the intent, that the latter will be formalized in a future version.

zuo · October 27, 2024, 12:54pm

Nice Zombies:

Here’s my attempt at a summary of the different proposals (assuming fields in the format specifier can be replaced):

         "Price: {          :{     }}" .tformat(      49,     ".2f")
Template("Price: {          :{     }}" ,              49,     ".2f")
       tt"Price: {          :{     }}" .fill   (      49,     ".2f")
         "Price: {    price :{ fmt }}" .tformat(price=49, fmt=".2f")
Template("Price: {    price :{ fmt }}" ,        price=49, fmt=".2f")
       tt"Price: {    price :{ fmt }}" .fill   (price=49, fmt=".2f")
 Binder(t"Price: {   'price':{{fmt}}}").bind   (price=49, fmt=".2f")
        t"Price: {...:price :{{fmt}}}" .fill   (price=49, fmt=".2f")

I believe that:

The first three ones, i.e. those providing the possibility to specify field values using positional arguments (with separate argument per field), are not a good idea, at least not for now – considering that it is not clear if an incremental/partial filling of fields is ever wanted… Generally, positional arguments, unlike keyword arguments, do not mix well with “composition” of consecutive independent calls.
I’d opt for the 5th one (Template(...); more on that in the next post…) – rather than the 4th one (str.tformat()).
The 6th one (tt"...".fill() == Template().fill() == the 8th, assuming that both t-strings and tt-strings would produce a Template) might be a good idea in the future (no need to rush for it yet, see above…). Same for the 7th.
In any variant, fileds within a format specifier should be processed eagerly (in the same vein as currently in the PEP).

zuo · October 27, 2024, 1:03pm

@dkp @ncoghlan

One more thing… Given that certain details of the [Static]Template’s interface need to be decided now (that is, in PEP 750), and that, once they are decided, they’ll be carved in stone (at least to much extent) – I’d like to ask you to consider whether it wouldn’t be a good idea to:

(1) …rename Interpolation to ReplacementField? (considering that, as it has already been noted in an earlier post, replacement field is a well established term – used in the f-strings syntax docs and the relevant portions of the string module’s docs; and that the variable name field seems to feel natural, when referring to instances of that type – see several code snippets in the posts in this thread, even not counting those written by me )
(2) …rename Template().args to Template().segments? (considering that the latter is more specific, and that it would help to establish this terminology element: segment seems quite accurate to refer to any of those str-or-replacement-field components we deal with; and it has also been used in this thread as a variable name in several code snippets by Alyssa)
(3) …“move” the current Template() constructor’s low-level signature+semantics to a classmethod: Template.from_segments(), and make the Template() constructor itself have more high-level signature+semantics? Namely, to allow users to create a Template from a template pattern specified as an ordinary str and a mapping from replacement field names to their expression values. Thanks to that:
- a counterpart to str.format()/format_map(), for which (among others) @pf_moore asked, would be exposed to the user in an articulate manner;
- a handy building block for future “true templating” tools would be provided (EDIT: and that would help reduce the terminological dissonance I referred to in an earlier post);
- the crux of an important underlying procedure (of turning a template pattern and data to an actual Template) would be exposed.
See more information on this idea.

ncoghlan · October 28, 2024, 12:39am

Renaming Interpolation, reworking the template creation signature to better support dynamic templates, and renaming args all sound like good improvements to me.

I confess that repeatedly using the “segments” variable name in examples was intentionally testing the waters before proposing bringing the “segments” attribute name over from PEP 501’s equivalent to the Template type.

There’s also a genuine inaccuracy in using “args”: since the constructor may insert extra empty text segments, or merge adjacent segments, the segment list may not be the exact same sequence as was passed to the constructor.

Edit: to be clear, I think reworking the Template signature would mean I no longer saw any need to revisit the protocol idea. The split constructor signature is a better solution to the same problem.

sirosen · October 28, 2024, 12:42am

I think maybe too much is being made of the notion that these aren’t templates. They are templates, just with early binding semantics and, notably, without a fancy syntax for evaluating arbitrary callables at rendering time.

I think these ideas are fine, if the unbound template type is also meant to be exposed.

That is, the names are good – I’ll pick BoundTemplate for the moment – if we expect the following sentence to make sense: “a t-string is sugar for construction of a template and binding it in the current scope”.

But if the unbound template will never be exposed as an idea, then any name like this may carry the false implication that there is such a thing as an unbound template.

It might be an interesting example to include, in How to Teach This, passing a template pairs of callables with values, and having the templater simply apply those deferred function calls. I don’t recall seeing an example like that in the PEP.

ncoghlan · October 28, 2024, 4:44am

I realised I’m making some assumptions around the “split constructor signature” idea that neither I nor @zuo have explicitly stated (and my assumptions may not be the same as @zuo’s), so writing those down:

I’m assuming the basic type creation signature would be a simplified formatting syntax that only supports auto-numbered style format strings (no field expressions), and accepts an iterable of values. The expr field on the interpolations would be populated with the empty string to indicate the lack of an explicit expression in the template definition.
If private re-style caching of the dynamic templates is included, it would just use a regular Template instance with the field value attributes set to either Ellipsis or None (I don’t think the choice matters much, since the caching would be a private API)
Supporting full dynamic templating would be deferred to a later PEP (more on that below)
The final line of the default constructor would be defined as return cls.from_segments(*segments), so subclasses only need to override Template.from_segments to change the way construction works (such as pre-quoting replacement field values, so naive formatting becomes safe).

The reason I’d still like to defer full dynamic template field formatting (with class methods like Template.from_format and Template.from_format_map) is that we don’t have a nice building block that would let us easily represent the dynamic attribute and subscript retrieval that the full dynamic string formatting syntax supports. (I’m less worried about format string interpolation, since that can be handled by “templating the template”)

Specifically things like {0.metrics.weight} and {players[0].name} (inspired by the examples in string — Common string operations — Python 3.13.0 documentation).

While I think these could reasonably be represented as a lookup_chain list consisting of itemgetter() and attrgetter() instances, there are enough additional questions raised that it seems worth postponing these topics to a PEP that can fully focus on them, separately from the underlying foundation of the template string syntax definition and implementation.

Since I now also think we’d want to keep UnboundField and UnboundTemplate as their own types (separate from the bound versions, since we can’t actually make them behave identically due to the missing field value information in the unbound variants), accepting PEP 750 without these features wouldn’t be locking us out of anything. Even if PEP 750 doesn’t split the construction signature, we’re not locked out of anything - it just means anyone wanting to do dynamic templating in the meantime would need to build their own on top of string.Formatter instead of Template providing a native way to do it.

Once UnboundTemplate and UnboundField were defined, then the dynamic template caching could be defined in terms of those, and hence provide a foundation for offering Template.from_format and Template.from_format_map.

zuo · October 28, 2024, 10:54am

Indeed, your and my assumptions diverge when it comes to the above point 1.

Specifically, I’m assuming that the basic type creation signature would support only named-fields style format strings (i.e., with field expressions being valid Python identifiers ^[1]). The constructor, apart from the format string, would accept a namespace (represented as a mapping object, given as a positional argument, and/or some number of keyword arguments – combined in a similar vein to the dict() constructor or, preferrably, ChainMap-like behavior ^[2]), which would be required to contain all field names included by the format string.

The expr field on the interpolations would be populated just with the respective field names – making the resultant Template instance equivalent (and equal in terms of ==) to a one created using a similar t-string containing replacement fields in which variables of respective names and values would have been used.

E.g., these two ways to create a Template object would be perfectly equivalent:

t1 = Template(
    "Score for {title!r} is {score:.2%f}.",
    title="Monty Python's Life of Brian",
    score=0.96)

t2 = (lambda title, score:
      t"Score for {title!r} is {score:.2%f}.")(
    title="Monty Python's Life of Brian",
    score=0.96)

As I wrote in a comment to the related issue in the PEP repo:

I deliberately suggest to resign from accepting field values passed as consecutive str.format()-like positional arguments (or as items of a string.Formatter.vformat()-like sequence argument). After all, f-strings do not support any argument sequences.

OK, strictly speaking, f-strings do not support keyword arguments either; but they do support getting items from namespaces – by using variables as f-string’s replacement field expressions; and I would argue that, in some sense, this is the basic way of using f-strings. And the use of a namespace here (expressed as a mapping and/or keyword arguments) is closely related to that basic way.

I’d also argue that namespace/mapping-based approach is much more flexible in terms of combining/adjusting/customizing field values – especially if one needs to specify them incrementally/in different places/at different stages of processing, but still in a cooperative way.

PS [EDIT] When it comes to my gut feelings regarding future enhancements, especially those related to “true templating” (I agree that, at the moment, any concrete proposals beyond the available building blocks would be premature), I’d have greater hopes in exploring functools.partial()-based ^[2:1] and factory-maker-like ^[3] approaches, rather than in introducing new Template-like types… But, obviously, I do not rule them out.

PPS [EDIT] When it comes to the other points (2., 3., 4.) in the latest post by @ncoghlan, my assumptions are generally consistent with them.

Perhaps it could be made less strict in the future – to accept arbitrary strings, presumably representing arbitrary expressions (see the PPPS part of the comment to the related issue in the PEP-repo). ↩︎
See my comments to the related PEP-repo issue. ↩︎ ↩︎
See: the Variant #4 part of my earlier post. ↩︎

steve.dower · October 28, 2024, 5:10pm

Agreed.

This has become a very deep rabbit/rat hole over the idea that t"{x}" must have a value for x at the time the t-string^[1] is specified, and any use of the created t-string is not allowed to reinterpret that.

It is.

The rich object representing the {x} part can contain both the value and the expression itself (as a string), and whichever function is processing the string can use whichever it wants. All we have to do is capture any exceptions raised when evaluating x and raise them if the consumer accesses .value (primarily to defer NameError until it’s actually known that the value for the name was needed).

These t-strings have no inherent use on their own. They must be used with a t-string aware function^[2] and so it’s always possible to choose whether to use the values or the expressions.

And as a result, they are templates. The aspect that brings value is that they also capture the value of template expressions that are available at time of definition/creation.

I’ll deliberately avoid calling it a “template” for now, though I’m happy enough with that name. ↩︎
Which is why I suggested earlier that the default __str__ could behave like a normal f-string for any t-string. ↩︎

ncoghlan · October 28, 2024, 11:33pm

You’re right, that would pose fewer unanswered questions than the sequence based version.

zuo · October 29, 2024, 12:00am

The thing I don’t like in this approach is that (if I understand the concept correctly) an immediate attempt to evaluate an expression is always made, but effects of that evaluation (including possible errors and/or side effects) are silently ignored unless and until some further code, i.e. the rendering function, asks for the interpolation’s value attribute.

In particular, if that function is never invoked, or – what is an important part of this approach – if it is invoked but chooses not to ask for value, then those effects are ignored forever. And the worst part is that, in the latter case, that initial evaluation is just pointless (which does not mean harmless), and the programmer’s attention is not directed at it at all. So the effects of that evaluation are likely to be never considered and examined, also if they are harmful (or wasteful, or just have a potential to attract, unnoticed, bugs in the future…) – at least until those effects bite somebody.

Please consider the following example:

import shutil
import tempfile
from pathlib import Path

class SomeService:

    def prepare_work_dir(self, parent_dir):
        self._work_dir = tempfile.TemporaryDirectory(dir=parent_dir)
        return self._work_dir.name

    def get_response_template(self):
        return (
            # Here `prepare` is meant to be a component of this
            # "abstract" template. We specify its intended binding
            # later -- in the `handle()` method (see below...).
            t"<p>OK, prepared work directory is {prepare(
                Path("~/my-precious-data").expanduser()
            )}</p>"
        )
        # Neither the design nor style of this code are
        # great, but all this looks innocent, doesn't it?
        # But sorry: if your home directory contained
        # `my-precious-data` subdirectory, it has just
        # been removed together with all its contents, and
        # then re-created (empty, only with `spam` subdir).
        # Yet, if it did not exist, nothing happened because
        # of a FileNotFoundError (silently swallowed, i.e.,
        # still without giving you any hint, that your code
        # attempted weird actions)

    def render_response(self, template, eval_locals):
        ...
        return <"here: code to render `template`, evaluating
                expressions it includes, using `eval_locals`">

    def handle(self, request):
        ...
        return self.render_response(
            self.get_response_template(),
            eval_locals={
                "prepare": self.prepare_work_dir,
            },
        )

...
# <a few hundred lines of code>
...

# Some unrelated local helper...
def prepare(directory):
    shutil.rmtree(directory)  # BOOOOM!
    Path(directory, "spam").mkdir()

Which is why I suggested earlier that the default __str__ could behave like a normal f-string for any t-string. ↩︎

steve.dower · October 29, 2024, 5:27pm

No no, side effects happen as normal. It’s just if an error is raised then the error won’t bubble out until it’s used.

So yes, if you do complex things as an expression that you expect to be a later template substitution, you may lose errors:

>>> substitute(t"{1/0}", **{"1/0": "text"})
# Expected (maybe, unless the "substitute" gives you a hint):
DivideByZeroError()
# Actual:
"text"

Which admittedly does allow for someone to do something dumb like:

>>> substitute(t"{os.system('rm -rf /')}", **{"os.system('rm -rf /')": "text"})
# Expected (if you assume the expression is not evaluated):
"text" and my hard drive still has files
# Actual:
"text" but my hard drive has no files

I’d argue that this isn’t the only way to get weird things to evaluate, and note that the expression to be evaluated must be in the literal, not in an argument value, so it’s no more at risk of injection than any other code.

And I think it’s obvious that if you are writing a template string with names to be substituted later, and you put a complex expression with side-effects in there, then you should expect weirdness. It’s not that much different from a partially evaluated set of function arguments:

f(os.system("rm -rf /"), 1 / 0)
# Removes files, but never calls 'f' because of the DivideByZeroError

The majority case for a template string with deferred substitutions would be simple names, which have simple NameErrors and no side-effects. And if you are using a template string more like a function call with expressions, you’ll be passing it somewhere (probably immediately) that uses the values rather than the expression, and so the errors will get raised.

Of course, the other option is to have another prefix that just doesn’t evaluate the expressions, but still does all the parsing. That makes it impossible to get the values from the source (which is good! You should have to ask for that), but still keeps all the parsing goodness and syntax checks. Unlike the tt proposal, this is simple All the .value attributes can just raise some error (but usually the template string would be passed to a function that ignores them anyway).

zuo · October 29, 2024, 6:11pm

Yes, they happen – and my point is that (assuming the approach you propose) they happen even in cases when that’s pointless, and when what happens is beyond the programmer’s goals and attention, and what exactly happens may depend on unrelated contents of the module, creating a potential for an unintended action at a distance.

I find that at least highly inelegant, with a potential for becoming evil.

For the record, I didn’t claim that a code injection is a threat in this case. [EDIT] At least not in the basic sense of that term.

I’d argue that using simple functions or functools.partial objects in conjunction with the “high-level” Template signature (see my previous posts…) would be more explicit and even simpler (especially when it comes to reasoning about what happens), e.g.:

make_cheese = functools.partial(
    Template,
    "Label: {label!r} - Price: {price:{.2f}} - Score: {score:.2%}",
)
leicester = make_cheese(label="Red Leicester", price=628, score=0.96)
limburger = make_cheese(label="Limburger", price=314.15, score=0.91)
gruyere = make_cheese(label="Gruyère", price=1670.99, score=0.95)

steve.dower · October 29, 2024, 6:54pm

Sure, and maybe the answer is “t-strings are not for non-capturing purposes”, and people who assume that “template” implies “non-capturing” just need to learn that it doesn’t imply that.

Nineteendo · October 30, 2024, 9:49am

I don’t think this would be a great solution, because either linters would complain that the variable doesn’t exist, or they wouldn’t complain when you try to pass an unbound variable to a function that always calls .value().

trey · October 30, 2024, 2:14pm

After playing with the new t-string version of this PEP, I wonder whether it would be worthwhile to include recommendations for how existing templating tools should adopt this feature.

My concern here is largely due to 2 concerns occurring together, which would be benign on their own:

Existing template-like tools updating to allow the current string-plus-optional-arguments to be accepted or a single t-string
Users finding a t-string used online and thinking “ah like an f-string”, not realizing that the two are quite different.

The issue for library authors

I’m tempted to look at functions like django.utils.html.format_html and think “what if this accepted a string and arguments or a t-string?”

This:

name = "<script>Malicious name</script>"
format_html("<b>{name}</b>", name=name)

Could then optionally become this:

name = "<script>Malicious name</script>"
format_html(t"<b>{name}</b>")

While attempting to implement this with an isinstance check, I realized that this could introduce confusion that might result in more accidental HTML injections.

The issue for users

I’m imagining someone stumbling upon that t-string version on StackOverflow and mistaking it as an f-string or even thinking it’s an f-string-like tool.

This might lead them to use an f-string, which would be a bad idea in this case:

name = "<script>Malicious name</script>"
format_html(f"<b>{name}</b>")

Possible recommendations that PEP 750 could make

Possible recommendations I can think of to avoid this situation are:

Don’t introduce t-string support to existing template-like tooling. In this case, we should invent a new format_html_template or some other function that accepts only t-strings.
Issue warnings to users when it seems like there may be an issue (e.g. when no additional arguments are passed).
Encourage linters to be aware of these t-string-aware tools and issue warnings.
Transition to only allowing t-strings with the existing tools.

1 means giving up the existing names and paths for these tools, but it’s the safest approach. 2 may upset current uses of these tools. 3 is complex because linters would need awareness of each tool that can accept t-strings or strings (maybe type annotations could help though). 4 would require transitioning all code that currently uses these tools.

I don’t know whether the PEP should make a specific suggestion, but this concern seems like one worth considering.

charmoniumQ · November 8, 2024, 8:07pm

I have doubts about the entire motivation (although I’m not sure if it’s too late to express them).

I don’t think programming languages should encourage users to build strings of other-language-syntax by string concatenation/interpolation.

Wouldn’t it be better to create HTML using functions and objects rather than strings (e.g., not the best library, but certainly better than strings)? Not only does this avoid injection attacks, you don’t have to worry about mismatched tags, mismatched angle brackets, you can validate the structure in the html function if you want, the result is more strongly typed in two ways: the tag-functions can limit what types they accept (e.g., span(...) should only accept elements whose type is Inline) and the result is strongly-typed (being HTML not str). If you’re using the html object later on, you don’t have to parse it.

The same arguments and more apply to SQL, because we also have parameterized queries at the SQL-engine level. Won’t this feature encourage users to do the wrong thing, write their own probably incorrect SQL quoter, rather than use parameterized queries?

Some others have written on the downsides of template strings as well.

DanCardin · November 8, 2024, 10:13pm

I see (non programmer by trade) people frequently write raw SQL composed through pure string manipulation without any parametrization. And in no world are those people going to go through the leap to translate these queries into equivalent e.g. sqlalchemy constructs. Also, some things cannot be parametrized (like a table name). At the end of the day sqlalchemy is ultimately still doing string manipulation to arrive at the end-sql statement, same as a theoretical sql templating function.

I think this PEP at least enables there to be templaters that do the correct thing (ideally e.g. built into sqlalchemy), while retaining a python-syntax-highlighting aware raw form which are inevitably going to occur (either due to personal preference or lack of an existing abstraction or what have you).

I feel like jinja is an example where you ultimately have an html file and the overall syntax support/completion for the dynamic python bits are very lacking. Where a comparable system that’s written in python inverts such that the static bits are strings and all the dynamic bits retain native python tooling benefits.

Although the current state of the PEP defers niceties like theoretical support for syntax highlightable templates.

twardoch · November 9, 2024, 12:28am

Hello all, I’m new here

My understanding of the discussion so far is that

t"Hello {...:name}"

(that is, a template string with an “unbound” interpolation) is little more* than just

"Hello {name}"

that is, just a string. *) Of course it is more, because it does provide the structural access.

It’s only the binding process that might turn t"Hello {...:name}" into t"Hello {name}" that expects the name object to exist.

These deliberations are useful, my question is:

What’s the mechanism of turning "Hello {name}" (a simple string) into t"Hello {...:name}" (an unbound template string) or into t"Hello {name}" (a bound template string)?

In other words, if I want to store the template string strings in a JSON (or such) and load them, convert them into template strings and then have my own template string evaluator/formatter — how do I construct them?

twardoch · November 9, 2024, 12:41am

In Python, f"""a {"b"} c""" evaluates to "a b c".

So I take it that t"""a {"b"} c""" would also be a valid construct that would effectively carry with it a “structure” of that string (in .args).

I can imagine this being useful in text processing like

paragraph = t"""{"This is a sentence.":metadata}{"This is another sentence.":metadata}"""

Basically, instead of using it as a template string, one could use it as a container of “rich text”. I realize that there are other ways to achieve that, but perhaps there might be some advantages

(I’m using metadata here very liberally.)

twardoch · November 9, 2024, 1:12am

I guess if I have a plain string "Hello {name}" that I’ve loaded from a JSON (or some other source), I could do

tstringplain = "Hello {name}"
tstring = eval(f"""t"{tstringplain}" """

but at that moment, name gets evaluated as well. So this is definitely not safe. If there was a “lazylazy” way like some of those discussed, ideally one that does not need the ... magic, it would be much safer, I could do

tstringplain = "Hello {name}"
tstring = eval(f"""tt"{tstringplain}" """

(I’m illustrating it with the tt notation that was suggested here as one of the options).

As far as I know, there isn’t any special method to convert a plain string into an fstring, and I assume there isn’t one for string-to-tstring either, right?

It would certainly be useful to have a “lazylazy” way that doesn’t evaluate the interpolations/expressions in any way, just parses for syntax validity. That would be much safer for dealing with external input — although perhaps is there was one, then actually that could open up another can of worms.