Allow for arbitrary string prefix of strings

This is an idea to allow any string to have a prefix in the from of “sql_”, “py_”, “cpp_”, “js_”, “jinja_”,“sh_” or “test_”. So a string of a-z ending with “_”. By doing this cross language scripting would be improved, since it would be easy for IDE’s to add syntax highlighting and intellicense. Furthermore it could be combined with existing prefixes, eg. “sql_f” and would not limit future features heavily. Since future build in prefixes can be added without the underscore.

I expect this to be used heavily if doing cross langange scripting. But I also expect it to have effects on different python dialects and libraries, that can use these string types to add meta data in a way that is still compatible by Cpython. For example I could imagine the “test_” prefix being used.

In short I can’t imagine how exactly the users will be using it. But just out of a quick brainstorm it is clear to me that this feature have many applications.

What do you think?

Examples:

exec(py_"print(a)")

db.session.execute(sql_"SELECT * FROM student;")

def add(a,b):
   test_"assert add(1,2)==3"

subprocess.run(sh_f"cwm --rdf test.rdf --ntriples > {file_name}.nt")

(post deleted by author)

I think you’ll need a much better justification, with actual use cases and specific benefits. The cost of adding this to Python (in terms of both coding/maintenance and the work needed to revise books, documentation, training materials, etc) is significant, and you need to demonstrate that it’s worth the cost.

Do you know of any IDEs that have expressed an interest in this feature?

What about numerics? Someone’s bound to want something like “i18n_” as a prefix… Or non-ascii letters? Without clear use cases, how can you be sure people won’t want to use native language terms?

Do you have any evidence of this? Typically, I would advise people to put non-Python code into separate files and read those files at runtime. If nothing else, embedding code into Python strings requires you to be careful about escaping quotes, and your proposal does nothing to make that any easier. Nor does it address indentation issues - you still have to handle leading whitespace if you align multi-line embedded code with the surrounding Python code.

Overall, I’m sorry, but I don’t think this idea fits well with Python. Although you might want to look at PEP 501, which looks at embedding other language syntaxes into Python from another angle.

This is an example which is explicitly addressed by PEP 501. Your code could very easily be subject to exploits if file_name isn’t carefully sanitised.

(You’ll also note that the Discourse highlighter for Python would need changing to handle your new prefixes - that’s a good example of the hidden costs of a change like this).

2 Likes

@jimbaker has had some thoughts about similar things.

And I think he had talked with @guido about the idea.

We are working on a set of PEPs (spec, tutorial) here for tagstrings, which support custom prefixes - GitHub - jimbaker/tagstr: This repo contains an issue tracker, examples, and early work related to PEP 999: Tag Strings - and there’s a corresponding branch of CPython that @guido wrote that implements most of the functionality that is discussed - I am working in a branch · Issue #1 · jimbaker/tagstr · GitHub, which can be readily be updated. I should get back into gear on this work and just get it done :smile:.

First, tagstrings are a Pythonic version of JavaScript’s tagged template literals. So this is the way to think about it: tags = custom prefixes. In a nutshell, one can write tagstrings in a target DSL - html`, shell, sql, etc - and the interpolations are properly managed for that target (eg quoted). In particular, interpolations can be nested recursively; and the behavior is the same as with f-strings in that lexical scope is respected. (For example, this also opens up being able to work with numexpr, GitHub - pydata/numexpr: Fast numerical array expression evaluator for Python, NumPy, PyTables, pandas, bcolz and more, for example, which currently uses dynamic scope to resolve expressions.)

Examples:

# NOTE: tags are passed raw strings, so the grep works here
run(sh"find {path} -print | grep '\.py$'", shell=True)

or

table_name = 'lang'
name = 'C'
date = 1972

with sqlite3.connect(':memory:') as conn:
    cur = conn.cursor()
    cur.execute(*sql'create table {Identifier(table_name)} (name, first_appeared)')

Some additional things to know are that interpolations are “lambda wrapped” with a no-arg function; and the interface provided seems to be nicely Pythonic - it’s easy IMHO to write your own tag functions to work with your target DSL. Example for implementing the fl tag, a lazy version of f-strings, assuming some imports and definitions like Thunk (see tagstr/fl.py at main · jimbaker/tagstr · GitHub):

def just_like_f_string(*args: str | Thunk) -> str:
    return ''.join((format_value(arg) for arg in decode_raw(*args)))


@dataclass
class LazyFString:
    args: Sequence[str | Thunk]

    def __str__(self) -> str:
        return self.value

    @cached_property
    def value(self) -> str:
        return just_like_f_string(*self.args)


def fl(*args: str | Thunk) -> LazyFString:
    return LazyFString(args)

Ideally, you wouldn’t use a string at all for this, but rather some kind of structured object. See, for example, python-sql · PyPI. Strings are flimsy and hard to parse.

I expect this to be used heavily if doing cross langange scripting.

Even in this case, it’s better to avoid strings and focus on building ASTs of the other language if you can. Take a look at how Jax creates XLA code without ever requiring strings from the user. Instead, it just uses a decorator applied to a Python function to extract a execution graph using intelligent objects.

In general, I think strings are an antiquated idea for the examples in this post, and there’s usually a better solution.

Is there a pre-PEP for this idea already ?

Since SQL is being used a lot as an example, I would want to make sure
that we don’t open Python even more to SQL injections by having people
use string formatting to interface to database modules.

Instead, such a tag string object would have to provide a database
compatible rendering of the SQL string using binding parameters and
a list or dict with the parameters.

That way, data and SQL instructions are uncoupled, which effectively
prevents SQL injection. It’s also faster, since database drivers will
often send the data separately and efficiently encoded to the database.

Another detail I wonder about is how those tags will be registered
and how we’ll manage the tag namespace (e.g. Python core tags vs.
user defined tags).

2 Likes

Thanks for pointing me to the work on tagstrings, this is very close to the sprite of what I though about and is probably a better solution than what I can come up with. Other than SQL I think html, shell scripting is also very common.

Regarding opening Python more up to SQL injections. I acturaly think it would make people more aware of SQL injections. Because having such tags would allow the IDE to give warnings to the users much better, because they can use the tags to understand what is happening better.

Regarding if IDE’s are interested in it. Wouldn’t you think that somebody would make a VScode extension that looked for patterns like (sql") and applied an sql syntax highlighter to the string if it was posible?

Regarding core tags vs user defined tags. My original thourgh was that user defined tags ended with “_” which would also allow for tag chaining, but it is probably a bad idea, and I’m not sure it is a problem. I could imagine an IDE highlighting if there is a name colision between core tag vs user defined tag.

The best I know of is tagstr/tutorial.rst at main · jimbaker/tagstr · GitHub.

Jim and I are very aware of SQL injections and I believe https://github.com/jimbaker/tagstr/blob/main/examples/sql.py shows an example sql tag implementation specifically aimed at avoiding such attacks, while still providing value for the developer.

Yes, I have a draft PR version, Initial specification PEP by jimbaker · Pull Request #17 · jimbaker/tagstr · GitHub. Besides needing some more filling in, it is way too long, being bogged down by discussion of quoting and injection attacks. Perhaps the motivation section could simply be replaced by the link to the Bobby Tables injection attack in xkcd: Exploits of a Mom ??? :wink:

@guido linked the sql tag example, but I will add that this example does exactly what you suggest, including providing support for DB-API 2 and SQLAlchemy with binding parameters in the SQL object it contructs. The example is of course not fully developed out, but I think it illustrates some interesting points in how tag functions would be implemented.

So a tag function simply is a callable that has this type spec:

    class Tag(Protocol):
        def __call__(self, *args: str | Thunk) -> Any:
            ...

where a Thunk has this definition (I’m leaning to a named tuple approach, especially since this is typical for other Python internals):

    class Thunk(NamedTuple):
        getvalue: Callable[[], Any]
        raw: str
        conv: str | None
        formatspec: str | None

OK, with those formalities in place, all a tag function does is the following:

  1. Parse the template strings. This could be super minimal as with the fl tag I showed earlier; or actually more involved with HTML. It’s your callable with a simple API :slight_smile: Note that such parses should be very memoizable and could also do fun stuff like codegen.

  2. Do something with the interpolations, such as evaluating them, wrapping them in bind params, applying formating, etc.

  3. Return some object. Note that best practice is that this object doesn’t have side effects, but it instead is a filled-in template like PEP 501 – General purpose string interpolation | peps.python.org 's InterpolationTemplate. See https://github.com/jimbaker/tagstr/blob/main/examples/interpolation_template.py, which shows how to implement PEP 501’s i prefix/tag with this approach.

With the sql tag example, we see how this works in especially its analyze_sql function in https://github.com/jimbaker/tagstr/blob/main/examples/sql.py#L66, which matches on SQL text, Identifier, nested SQL objects, and expressions which will then be set up as binding parameters. The returned SQL object can then be executed with respect to a specific database library.

It’s not a great example, but the fact that this recursive common table expression composed of nested SQL fragments and interpolations works interchangeably with DBI2 or SQLAlchemy is pretty nice (I should refactor the code to make it more obvious that only the execution API changes, not the SQL and any interpolations):

        num = 50
        num_results = 9  # actually using num_results + 1, or 10
        
        # NOTE: separating out these queries like this probably doesn't
        # make it easier to read, but at least we can show the subquery
        # aspects work as expected, including placeholder usage.
        base_case = sql'select 1, 0, 1'
        inductive_case = sql"""
            select n + 1, next_fib_n, fib_n + next_fib_n
                from fibonacci where n < {num}
            """

        results = cur.execute(*sql"""
            with recursive fibonacci (n, fib_n, next_fib_n) AS
                (
                    {base_case}
                    union all
                    {inductive_case}
                )
                select n, fib_n from fibonacci
                order by n
                limit {num_results + 1}
            """)

So hopefully I addressed that!

So standard Python name semantics are used for binding a tag name to its function - they can be imported, defined, patched, manipulated by looking up the namespace with globals(), etc as usual. So no specific registration required.

3 Likes

To be very clear, a properly written sql tag, or for that matter a tag for some other DSL, should not suffer from injection attacks in its usage. This is because the tag function can clearly delineate between text from the template, which can be trusted; and the interpolations, which have to be quoted, bound, or otherwise worked with.

It seems possible for a plugin for an IDE to support this for some arbitrary tag and its associated DSL. IDEs track explicit definitions/imports to determine the source of a name; and in principle, they presumably could call a registered plugin on specific usage in a tagstring. It would be very cool if they could do this, especially because language support is already available for so many DSLs that could be used - HTML, SQL, shell, etc - they would just need to take into account any interpolations that are used; and work around that.

It’s a good question. Here’s one possible answer with respect to name collisions. I will often use f as generic function name, or for a throwaway function when using the REPL. But I have never confused that usage with using a f-string. That’s one anecdote, but my feeling here is that the syntactic position really matters.

In any event, if name collisions are a problem, that’s something an IDE could certainly determine for a user.

I’d like to add another use-case for this.

I’m the maintainer of a small django library called django-components. I’ve run into a problem that I have a language-level solution (tagged strings) to, that I think would benefit the wider python community.

Problem
A component in my library is a combination of python code, html, css and javascript. Currently I glue things together with a python file, where you put the paths to the html, css and javascript. When run, it brings all of the files together into a component. But for small components, having to juggle four different files around is cumbersome, so I’ve started to look for a way to put everything related to the component in the same file. This makes it much easier to work on, understand, and with fewer places to make path errors.

Example:

class Calendar(component.Component):
    template_string = '<span class="calendar"></span>'
    css_string = '.calendar { background: pink }'
    js_string = 'document.getElementsByClassName("calendar)[0].onclick = function() { alert("click!") }'

Seems simple enough, right? The problem is: There’s no syntax highlighting in my code editor for the three other languages. This makes for a horrible developer experience, where you constantly have to hunt for characters inside of strings. You saw the missing quote in js_string right? :slight_smile:

If I instead use separate files, I get syntax highlighting and auto-completion for each file, because editors set language based on file type. But should I really have to choose?

First steps

I think a great first step would be to allow these kind of tags in the language, but don’t yet do anything with them. Then editors could implement highlighting and a lot of the wins could be realized quickly (?).

Why not use annotated strings for that?

js_string: Annotated[str, 'javascript'] = 'document.get....'

You can create a type alias if that’s too long.

This is a good idea I think. Very verbose, compared to the original proposal in this thread, but sure. My main goal would be to get the support of code editors, and this seems like a nice way imo.

even shorter form:

from multilanguage import sql, javascript

js_string: javascript = 'document.get....'
sql_string: sql = "drop table browser_history"