PEP 750: Tag Strings For Writing Domain-Specific Languages

barry · August 14, 2024, 11:55pm

Thanks; I like that having the source string gives you the chance to use it as a dictionary key (useful both for memoization and for i18n catalog lookup, which is essentially a dict.get(source, source).

Another thing that occurs to me though, is that to be a more-or-less drop in replacement for flufl.i18n style translatable strings, you’d need to be able to support the $-string (PEP 292) style placeholders. Is the PEP 750 placeholder syntax restricted to f-string style {placeholder} syntax?

ncoghlan · August 15, 2024, 4:36am

Would source be a DecodedConcrete instance, so the full raw text can also be accessed?

The other planned changes look good to me, including the idea of offering t as a builtin so PEP 750 becomes a true functional superset of PEP 501 (between the planned changes to 750, and the planned changes to 501, any remaining differences will almost entirely be in the way templating related code looks rather than how it works).

I’m still not convinced we actually need the extra step of generalisation to arbitrary string tag support, but with the revisions, it would be straightforward to devise a two step implementation plan: in step 1, add specific t-string support to the lexer and compiler (with eager evaluation), spend a release exploring what can be done with that more restricted version, and then decide whether or not to proceed with step 2 (adding the lexer and compiler support for arbitrary string tags, as well as the decorator to choose between eager, lazy, and selective field interpolation).

I spent some time yesterday thinking about aligning these templating proposals with string.Template in the context of PEP 501. At the time I was skeptical of the suitability, but I’m starting to see more merit in the possibility now.

However, it would come with some restrictions:

you’d be limited to the ${...} substitution form, since the compiler wouldn’t see $... as defining an interpolation field (whether the $ was required or optional would be up to the templating function)
to allow interpolating more than simple references to named variables, you’d need to adapt the specifier string to include a way of naming fields for i18n substition (for example, repurpose the specifier string as naming the field such that i18n"The result of adding ${x} to ${y} is ${x+y:expr_result}" or _(t"The result of adding ${x} to ${y} is ${x+y:expr_result}") would map to the English translation catalog entry The result of adding $x to $y is $expr_result. A regular specifier string could still be allowed after a second :, since colons are permitted in specifier strings)

With those caveats though, you’d be able to completely hide the retrieval of the relevant string.Template object for the active language (cached based on template.source, but initially looked up in the catalog based on a normalised form that strips out the details of expressions, format specifiers, and conversion specifiers in favour of just the field names) and then calling string.Template.safe_substitute on it with the interpolated values, without needing to do dynamic name resolution on the interpolated fields in the calling namespace the way translation functions currently have to do.

boxed · August 15, 2024, 9:27am

Why of course? You make a pretty strong argument I think for allowing any symbol followed by " to be used in this way. (Even an object with __call__?)

steve.dower · August 15, 2024, 2:24pm

Well, it’d be any expression followed by " at that point, which is quite a different syntactic definition than the string prefix we’re dealing with (if not in practice, then very much in intention). It changes the proposal from generalising an existing syntax into adding an entirely new form of syntactic sugar, and the rationale will need updating.

I don’t know of any precedent from another language where <expression> <string literal> is essentially a function call and special handling of the literal - it’s certainly not mainstream, though it probably exists in some text-processing heavy language. And I’m not sure whether turning what’s probably a common class of syntax error into a runtime error is a good move, but I suspect it wouldn’t be worth it (but maybe it’d become a fairly commonly used feature and it would be?).

Summertime · August 15, 2024, 3:38pm

For Javascript: of course its not strictly a string literal, but it can be used in untagged cases so its kind-of a string literal:

// Any expression can be thrown in there
(a=>a)`foo` // ["foo"]
// Dotted
let bar = {baz:a=>a};
bar.baz`foo` // ["foo"]
// Can be caused by lack of +
"abc" + `xyz` // "abcxyz"
"abc"  `xyz` // At runtime: TypeError: "abc" is not a function

jimbaker · August 15, 2024, 6:41pm

Template.source should the original source string as provided to the tag in Python source code, so it’s “raw”. This is the most straightforward memoization key, since it matches that source.

jimbaker · August 15, 2024, 6:51pm

Atomic expressions, that is expressions that are parenthesized, are syntax errors in current Python code when preceding a quoted string.

So this would be something like

(lookup_tag(arg1, arg2))"{foo} along with {bar}"

I don’t know. It’s not the worst code one could write.

But we chose not to include them in the current PEP because it’s hard to implement as mentioned earlier; and unlike dotted names, it’s strained. Why not just setup your complex tag function earlier on separate lines, then use it? Arguably it would read much better.

In contrast, the argument for doing something similar in decorators felt more like how decorators are placed in code, and keeping that connection. So it would be best to see real examples to make the case for tag strings.

steve.dower · August 15, 2024, 6:58pm

A big difference with decorators is that you know you’re looking at a decorator right at the start of the line (as you read from left to right), whereas with a complex tag on a string you wouldn’t find out until the end. And as the expressions get more complex, it becomes much easier to miss the point where it becomes a string tag rather than a normal expression, and because it’s an expression itself, it may have started in the middle of a line. So the complexity to parse it for a human reader is much much higher.

By the time you’re doing something like (lookup_tag(arg1, arg2))"{foo} along with {bar}", you may as well do (lookup_tag(arg1, arg2))(t"{foo} along with {bar}") (assuming t"..." means to interpolate to an object containing all the values, but not all the way to a final string).

jimbaker · August 15, 2024, 8:44pm

Exactly. I don’t see why atomic expression support would be needed. We need to optimize for actual usage needs. As part of that, tag strings should help support writing better Python code, such as the lexical scope vs dynamic scope considerations that started this work.

jimbaker · August 15, 2024, 9:09pm

Yes, it’s restricted to f-string style placeholders. It’s certainly possible, as @ncoghlan discussed, to support a subset of possible strings that use PEP 292 placeholders. But the subset is not reasonable:

$var is not supportable, but this is the most common usage as I understand it
Braces would have to be doubled (maybe seen less in i18n strings, but it’s still arbitrary)

Standardizing on f-string style placeholders seems to be a good thing, given that they are now so popular in existing code, even where it’s discouraged such as logging f-strings.

Lastly, moving to tag strings means at least some syntactic change is necessary. As part of that, it’s should be safe to use a rewriting process for _('$who does $what'), given how translation strings are already identified by tooling.

Rosuav · August 15, 2024, 9:37pm

I agree, although perhaps for a slightly different reason: complexity makes it harder to pinpoint errors. If you have something where arbitrary expressions can be followed by strings, a large set of bugs become trickier to track down (imagine omitting an operator, or closing a function call parenthesis too soon, or something). So I’m also in favour of restricting it to something simple.

barry · August 16, 2024, 12:05am

Thanks!

That will be problematic. Think back to when we only had %s substitutions. Translators^[1] could handle that fairly well, but it wasn’t sufficient to support placeholder rearrangement in the translated strings, so we supported %(placeholder)s syntax. But that was extremely fragile because translators routinely forgot the trailing s, and that ended up in hard-to-debug crash reports, because you’d only get a crash and traceback say, when viewing a web page in Italian (i.e. if the Italian translation had this bug for one source string).

This was motivation for PEP 292 and string.Template. You can’t get much simpler than $placeholder and translators a) usually had enough comfort and prior experience to get that right, and b) even if they messed up, e.g. writing it $placeholde, the safe_substitution() method used wouldn’t crash, it just wouldn’t print quite the right information. So this change solved a lot of practical problems for everyone.

My worry is that we’ll end up seeing ${placeholder strings in translations. Maybe that would only produce bad templates, but it would be unacceptable for such mistakes to crash.

For i18n use cases, I would explicitly disallow anything other than simple name references, at least in any library I maintained. Expressions just aren’t worth the likely cognitive load on translators. Better to let the authors of the code store the value of the expression in a local variable and use the name of that variable in the source string. It’s rare today but when it does come up, it’s a totally acceptable solution.

i.e. the humans translating from English to their natural language ↩︎

barry · August 16, 2024, 12:11am

I think that’s a non-starter for the i18n use case, as I mention in my reply to @ncoghlan. It’s not the acceptability to the source code author I’m concerned about, it’s the acceptability to human translators, who often are not programmers.

Maybe that’s okay and I should just give up on trying to get tag strings to support the i18n use case. string.Template and flufl.i18n aren’t going away so it’s fine to just say i18n is out of scope. I do worry a bit about people trying to use tag strings for i18n, naively going down that path for a while, and then seeing all the painful lessons we learned that got my solution to where it is, but I’m not sure you can do much about that.

jimbaker · August 16, 2024, 1:19am

Completely understand. Because tag strings necessarily has compile time support, much like f-strings, to get lexical scope to work, we have to pick one syntax. I should also mention that there’s a lot to like about PEP 292 syntax, including the fact that it’s a lot easier to target DSLs like Latex or C-like languages without brace doubling, so it’s unfortunate.

I very much appreciate this discussion however because changing the argument to the tag function to Template is advantageous, including the support for Template.source.

ncoghlan · August 16, 2024, 1:51am

I think template based i18n would need to break the direct relationship between source strings and catalog lookups (so the translators still see the translator-friendly format, even if the developers are working with a Python native template literal or tagged string).

Specifically, my proposal would be that we define a string.Template.from_compiled_template class method as a template renderer with the following behaviours when working out the translation string to look up:

trailing $ immediately before a substitution field is discarded
simple variable lookups are converted to their names preceded by $ if that would be unambiguous given the following text, otherwise to the form with braces
more complex expressions are either disallowed or change the specifier syntax to put a name field before the actual specifier

This would return a 2-tuple of the string.Template object and a dictionary mapping placeholder names to their runtime values.

I would further propose that string.Template implement the typing.Template protocol in a way that emits the normalised form of the translation catalog string by reporting the placeholder strings for each interpolation field.

That way, only the tools handling template string extraction to build the translation catalog would need to deal with the native template format, not translators themselves.

steve.dower · August 16, 2024, 9:39am

I’m inclined to think the i18n case is a mild misuse of this feature, but there’s no reason why it wouldn’t still be helpful without interpolations: translated"My $sub with multiple $subs" is totally valid - it just will be passed to the tag as a single string with no compiler substitutions.

Whether it’s better than _("My $sub") or better/worse if _"My $sub" works/doesn’t work is a matter of taste. What this PEP offers is the opportunity to lose the parentheses, even if you don’t take advantage of the interpolations.

(Edit: And potentially to gain variable capture, of course. Which would be nice here, but I agree that we wouldn’t want to be supporting multiple syntaxes. The ${x} syntax isn’t actually a new one, btw, it still works with the current proposal, provided the tag knows to omit the literal $ when it formats.)

barry · August 16, 2024, 3:13pm

The extraction tools would also have to do that translation, because they’d need to put the transformed source string into the .pot files.

barry · August 16, 2024, 3:15pm

Yep, it could be done. Losing the parentheses doesn’t seem like much of an advantage to me, but that means I don’t see it as a motivation in favor of the PEP ^[1].

I’m not saying there are or are not other reasons to favor PEP 750 ↩︎

ncoghlan · August 16, 2024, 11:44pm

Yeah, the only real i18n pay-off is not having to dig around the runtime stack resolving variables. Given that the existing i18n libraries already have that covered, I’m leaning towards describing how i18n could be supported in PEP 501, but also noting that it would only be worthwhile if the runtime performance benefits proved significant (and I doubt they will given template caching).

barry · August 16, 2024, 11:56pm

I’ve never considered i18n a performance critical feature and in all the years of doing it, I can’t remember it ever coming up. That’s not to say you could hurt performance if you put some i18n code in the wrong place, just that I don’t think typical i18n use cases ever lead you to do so, in practice.