Language specifications for strings

boxed · December 11, 2022, 1:01pm

Language specifications for strings

In PyCharm you can set the programming/markup language of strings like this:

# language=html
foo = '<div>hello there!</div>'

I find this extremely useful and use it all over the place. A pattern I noticed is that a large proportion of such uses are in fact more like:

# language=html
foo = format_html("<div>{}</div", bar)

In this case the format_html function doesn’t take any random string as the first argument, it takes a string that is supposed to be html. Reading through my usage of # language= I see that I have CSS, HTML, and JavaScript.

There are many places in my codebases that also have very different types of “languages” that unfortunately isn’t supported as a language injection in PyCharm. Some examples:

fully qualified name (module.module.symbol, for example view functions in Django)
module names (module.module, for example app names in Django)
file paths
host names (www.example.com)
urls (https://example.com)
time zone (UTC)
language code (en)
regexes
date formats
django app names

There are probably more, but I think this gets the point across.

PyCharm has some (presumably hardcoded) rules about some of these strings, for example in settings.py it knows that strings in the list INSTALLED_APPS are module names, so you can jump to the definitions of those modules, and PyCharm will resolve and check them for you. But this is a closed system where any introduced variables I create myself can’t be validated in this way.

I think it would be good if the typing module could have a facility for this type of thing. When this gets some traction we could see support for it in Python language servers, PyCharm, static analysis tools, etc.

What do you guys think?

mdrissi · December 11, 2022, 1:09pm

This feels like potential good use case for Annotated. We could have

HTMLString = Annotated[str, “html”]
SQLString = Annotated[str, “sql”]
etc

I wouldn’t expect a type checker like mypy/pyre to have special logic for sql string but the metadata would be available in the type and a separate tool/ide could use it.

boxed · December 11, 2022, 2:05pm

I like it. I think we’d want some intermediate thing though like…

class EmbeddedLanguage(Annotated):
    pass

HTMLString = EmbeddedLanguage[str, 'html')

So that it’s smoothly extensible, and doesn’t trample on other uses of Annotated.

layday · December 11, 2022, 6:48pm

See also Allow for arbitrary string prefix of strings and GitHub - jimbaker/tagstr: This repo contains an issue tracker, examples, and early work related to PEP 999: Tag Strings.

boxed · December 11, 2022, 7:16pm

Arbitrary string prefixes are a totally different thing.

layday · December 11, 2022, 8:33pm

Apologies for trying to be helpful.

layday · December 11, 2022, 8:35pm

And no, they are not a “totally different thing”; one can imagine that typing metadata could be attached to prefixes to the same effect.

Rosuav · December 11, 2022, 10:27pm

Aside from b, string prefixes don’t attach metadata or any form of typing information. They are different forms of string literal which result in the exact same string. This is extremely different (I won’t argue whether it is “totally different” or not) from something which retains information about the string. It should be quite orthogonal; you should be able to make a triple-quoted SQL string, a raw-literal SQL string, etc.

layday · December 11, 2022, 11:25pm

You are describing how existing built-in string prefixes work, not how “arbitrary string prefixes” or tagstr (might) work.

guido · December 11, 2022, 11:29pm

FWIW Jim Baker’s “tagged strings” proposal (for which I did a proof-of-concept prototype implementation) does allow one to attach metadata to the string at runtime. Essentially, it would make sql"select * from table" equivalent to sql("select * from table"), where the sql() function can return whatever it wants. (To be clear, there’s also f-string behavior baked into that proposal, so the signature of sql() is actually more complicated than that.)

OTOH the OP’s reference was a magic comment in PyCharm which looks to me like it’s only visible at static type checking time.

We should probably be a bit more clear about whether the proposal here is static, runtime, or both, otherwise we could have quite the shouting match.

boxed · December 12, 2022, 5:45am

Static.

This is for improving tooling. More specifically to make it possible for oppen source tooling to catch up to PyCharm, and then leap frog PyCharm quite a bit (eventually forcing PyCharm to adapt this and everyone winning).

PMLP-novo · December 12, 2022, 12:28pm

If tagged strings got implemented. Something like the following code would be allowed:

from string_allocations import sql
query = sql"select * from table"

This would be a pattern that PyCharm easily could spot

boxed · December 12, 2022, 12:50pm

Yes. And that might be a good thing, but what I’m suggesting is quite different. It is that the a called function could say what the argument is. So:

cursor.execute('select * from foo')

would be interpreted as SQL, not because you change the string to a tagged string, but due to execute having an annotation on its first argument saying that it’s SQL.

boxed · December 12, 2022, 12:53pm

The effect would be quite different. String prefixes would require the programmer to change all the call sites. What I’m suggesting would just require you to change the annotation for a specific argument and all the call sites of all code bases would then be “upgraded” to have more rich information for the tooling to use.

PMLP-novo · December 12, 2022, 1:22pm

I can’t figure out what your acturally suggesting. The last problem regarding the “con.execute” looks like something that could be solved by PEP 593 – Flexible function and variable annotations | peps.python.org or something like:

from string_annocatations import StringType, sql

def execute(query:StringType[sql],*args,**kwargs):
    ...

boxed · December 12, 2022, 1:44pm

That is what I’m suggesting yes (as written by Mehdi Drissi above in this thread too).

The bigger point I think is that for this to be something that catches on, it probably needs to be implemented in the standard library. For example in re, where match could be:

def match(pattern: EmbeddedLanguage[regex], string, flags=0):

or

def match(pattern: StringType[regex], string, flags=0):

… whatever would come out of yak shaving at the end

I know I would REALLY like it if every time I wrote a call to re.match() my IDE would syntax highlight the pattern! Imagine all the times I would immediately catch my mistake of thinking it’s string, pattern instead of pattern, string. I don’t know about you but I make that mistake all the time

Rosuav · December 12, 2022, 1:58pm

Am sympathetic to that one. I make the same mistake too.

pf_moore · December 12, 2022, 4:12pm

Surely typeshed is the place for things like this right now? So there’s no direct impact on CPython in the immediate term, all that’s needed is an agreement on the form such annotations should take.

boxed · December 12, 2022, 4:34pm

True. It’s mostly a culture/PR problem, not a technical one.

boxed · March 15, 2023, 10:15am

Submitted an issue for typeshed here: Language specifications for strings · Issue #9888 · python/typeshed · GitHub