PEP 787: Safer subprocess usage using t-strings

rrolls · May 6, 2025, 8:50pm

I commented earlier about a relatively specific issue with this proposal, but in truth I had a bad gut feeling about the proposal in general but couldn’t put my finger on why.

However, while I was replying to a different t-string-related thread, I figured it out.

I originally posted in the PEP 750 thread:

The syntax should be foo(t'bar'), not foo'bar'.

and this led onto a discussion about using “the function that a t-string is passed to” as a language tag, which could be used by editors with appropriate support to determine the language that the contents of the t-string is supposed to be parsed as; here, foo would be the language and bar the source in that language.

Therefore

subprocess.run(t"some_command {message_from_user}")

would actually be far better written as something like

subprocess.run(shlex.Command(t"some_command {message_from_user}"))

i.e.

subprocess.run would accept either a shlex.Command object, or a str or sequence-of-str for backward compatibility, but would not accept a t-string itself
- (bikeshed the spelling of shlex.Command if you like, I’m not precious about that)
the constructor of shlex.Command would accept a t-string
the constructor of shlex.Command would be typed/annotated to declare that its argument will be parsed like a POSIX shell command line
editors with appropriate support would then be (theoretically) able to syntax-highlight the contents of the t-string in the appropriate way.

Now I do understand that the whole idea of using language-tagged t-strings in this way to perform syntax highlighting in editors was set aside to be considered separately in future (and if it has been further discussed then I’ve totally missed this) but I do think we should be careful about what precedent we set for exactly what should take a t-string.

I would much rather see the following pattern come about

cmd = shlex.Command(t"some_command {message_from_user}")
html = some_html_lib.HTML(t'<a href="{link_target}">{link_text}</a>')
sql = some_sql_lib.SQL(t'select foo from bar where id={bar_id}')

p = subprocess.run(cmd)
foo, = db_connection.execute(sql).fetch_row()
return render(html)

than this:

cmd = t"some_command {message_from_user}"
html = t'<a href="{link_target}">{link_text}</a>'
sql = t'select foo from bar where id={bar_id}'

p = subprocess.run(cmd)
foo, = db_connection.execute(sql).fetch_row()
return render(html)

I also do realise that just as one could annotate the constructor of shlex.Command so that editors could discover the language of the content of the t-string, one could also annotate subprocess.run in the same way, so the case where the t-string was being immediately passed to subprocess.run would work just as well.

But the pattern I’m suggesting here is for the t-string to first be passed to some constructor (or regular function) that would merely parse the t-string without any side-effects (other than raising an appropriate exception if it’s invalid), such that the result could be stored in a variable and then used later (possibly multiple times, possibly not at all depending on the result of a conditional, possibly only after running some other code that needs to know that the t-string was valid, etc.).

I feel there’s enough benefit to this pattern that it could be worthwhile to require it, which would mean not allowing “functions that do things”, like subprocess.run, to accept t-strings directly. If the first significant use of t-strings in Python is to write subprocess.run(t'prog arg1 arg2') then I fear the idea of “t-strings get passed immediately to something which validates them, which also acts as a language tag for static analysers” might vanish into thin air without us realising it.

encukou · May 7, 2025, 7:14am

Yes! Thank you for figuring it out!
Note that in all the cases, it would be useful to be able to switch to a different syntax – mslex.Command, some_html_lib.XHTML or SVG, some_sql_lib.SpecificSQLDialect…

In cases of db_connection.execute or a render_http_response, it might be useful to have a “default syntax” (i.e. take t-strings directly), and we can’t really prevent library authors from allowing that shortcut. But subprocess should be more careful, at least initially.

steve.dower · May 7, 2025, 8:42am

Rob Rolls:

Therefore
subprocess.run(t"some_command {message_from_user}")
would actually be far better written as something like
subprocess.run(shlex.Command(t"some_command {message_from_user}"))

This is just a static typing view being overlaid onto Python’s duck-typing.

I’m inclined to agree that it’s more helpful for the reader than most static typing has shown itself to be, but ultimately it’s not how Python works: the callee gets to decide whether it can use the value provided, rather than the caller trying to get it into the right name/type.

Making users type more is totally fine. Users who want to do that can and will. But I’d rather see it done like the example below, because it still allows the more concise version that doesn’t require further clarification (that is, omit the type because the callee makes it obvious):

cmd : shlex.Command = t"some_command {message_from_user}"

(Or one of the annotation styles that were discussed in the earlier threads.)

rrolls · May 7, 2025, 4:46pm

Not quite.

Yes, from a typing/static-analysis/syntax-highlighting POV, you’re absolutely right. If shlex.Command was just a no-op for typing, I’d agree with you 100%.

But my feeling is that shlex.Command should not be a no-op, it should be a real function or class. Which means that at runtime, there’d be a big difference between

# your suggestion
cmd: shlex.Command = t'prog arg1 arg2'

and

# my suggestion
cmd = shlex.Command(t'prog arg1 arg2')

which is that:

the latter performs validation and will raise an exception at that point if the t-string contents are invalid; and
you’re able to use isinstance(cmd, shlex.Command) (as if to determine the “language tag” of the t-string), call functions on it that are specific to that class, etc.

steve.dower · May 7, 2025, 5:13pm

Yes, I got that. But the difference you’re actually suggesting that I don’t like is:

>>> subprocess.run(t'prog arg1 arg2')
TypeError: cannot use a t-string unless you pass it through shlex.Command

If it knows exactly what to do with the value, it should just do it. And at that point, actually calling shlex.Command explicitly is effectively a no-op.

BrenBarn · May 7, 2025, 6:22pm

Steve Dower:

But the difference you’re actually suggesting that I don’t like is:
>>> subprocess.run(t'prog arg1 arg2')
TypeError: cannot use a t-string unless you pass it through shlex.Command
If it knows exactly what to do with the value, it should just do it.

I dunno, I don’t see that as super different from various APIs that require an open file object instead of a filename, etc. The advantage is that you get a consistent API (“you must pass this kind of object”) instead of a hybrid one that has to document the details of how it handles various different types of inputs that ultimately are getting converted to the same thing.

steve.dower · May 7, 2025, 6:34pm

Most of those APIs do accept an int (file descriptor), though, which is the equivalent of an open file object (it represents an active OS-level resource, rather than a resolvable path to locate one). The operation of opening a file is also something that needs to be managed, as it may introduce conflicts with other uses.

Neither of those aspects apply to wrapping up a t-string with a no-op.

encukou · May 8, 2025, 7:48am

If we’re talking about how Python works, I’d compare the situation to this:

>>> "1" + 2  # `”12”` or 3?
TypeError: can only concatenate str (not "int") to str

>>> subprocess.run(t"echo {message_from_user}")  # Use shlex.split? Other split? Quote for a default shell?

In the HTML & SQL examples, there’s a good “default”, and it should be used implicitly. But IMO, instead of “the function takes Template” the docs should say “Templates are converted using some_html_lib.HTML()”.

Good defaults are good, but, being implicit only pays if there’s exactly one obvious way to do it.

ncoghlan · May 8, 2025, 10:36am

From a type hinting point of view, it’s reasonable to assume that even though t-strings aren’t defined as a generic type yet, that transition (where template variable definitions can be parameterised to indicate the command dialect they use) will eventually happen.

There was some discussion of PEP 750 explicitly mentioning that idea when it shifted from proposing tagged strings to a specific new string literal prefix, but it was eventually omitted as not-yet-in-scope speculation about a possible future enhancement (rather than being part of the core feature proposal).

Parameterising with the expected template parsing output class is certainly one way that could go (the other main idea that came up was to parameterise with simple strings like “html”, “sql”, “sh”, etc).

effigies · May 8, 2025, 12:19pm

I think we’re mixing conversations about implicit/explicit conversion and the unsafeness of str | Template.

A function taking HTML | Template or shlex.Command | Template is entirely fine, and it has nothing to do with how many ways there are to do it. It will work because passing an f-string should fail loudly and quickly, either during static type checking or at runtime.

A function taking str | Template is broken if it uses the template for input sanitization because t-strings are confusable with f-strings. subprocess.run() should not accept Template because it accepts str and is a notorious security bug magnet. I don’t care if it accepts shlex.Command, but if it leads to confusion, it would be entirely reasonable to make a run() method on the shlex.Command class.

steve.dower · May 8, 2025, 1:20pm

Except these are trivial to answer: we can define exactly how t-strings are split by the subprocess module, because no such algorithm exists today and the t-strings are not platform-specific values, and quoting matches the current quoting algorithm used if you pass a list.

This is the equivalent of asking how should the literal in Path(...) / "a/b/c" be parsed on platforms where forward slashes aren’t path separators - the answer is, pathlib decides how to split, not the platform, because the string is an abstracted value.

I assume you mean that accepting str is the bug magnet (accepting Template doesn’t exist and so can’t be “notorious”). But this just means we should deprecate accepting str, or at least warn about it, and so someone who accidentally uses f instead of t will get a warning. (It’s also trivial to detect statically, though admittedly we probably need to commit to the runtime warning for the linters to be comfortable also providing a static warning.)

More run methods will only create more confusion. In the tradeoff between a potential (but not guaranteed!) typo, and deliberately having to import a different module and use a different type to get the same behaviour, I find it really hard to see how the latter is less confusing.

effigies · May 8, 2025, 1:32pm

Yes, str is the problem. I’m not sure I agree that accepting str should be deprecated, as there is already quite a lot of churn in the stdlib, but I think it’s a bad idea to accept both, even if you warn on str.

encukou · May 9, 2025, 7:25am

That we can!
The question is whether the result will be obvious for ~~everyone~~ enough people.
Another question is whether we can come up with the best solution right now. Posts like Alyssa’s make me wonder what kinds of out-of-the-box ideas we’re missing. I worry about painting ourselves in the corner.
An implicit conversion is easy to add later. It’s hard to remove.

Yes, it is equivalent. The main difference is just qualitative: treating / as a path separator is unsurprising and well-defined enough to make sense as a default. On the one platform where this matters, / is invalid in pathnames, so there’s little chance of Python doing the wrong thing.
Another difference at the API “level”: this decision is ntpath’s, not pathlib’s. If builtins.open didn’t accept / separators, it would make much less sense for pathlib to do so.
I don’t think we’re adding t-string support to os.execv?

steve.dower · May 9, 2025, 8:45am

Pathlib could quite easily have kept segments in a list/tuple rather than using join, so there was a deliberate decision to do it, implying a deliberate decision to not do something else.

Nobody would propose adding high-level functionality to low-level implementation functions. No need to introduce it now. subprocess is obviously the right level for the functionality (it already has one slightly awkward way to do it), and thin wrappers around C functions are not.

barry-scott · May 9, 2025, 2:07pm

What platform? It is valid on windows, macOS and unix systems in ther system APIs.

steve.dower · May 9, 2025, 2:38pm

It is normalised by Windows kernel APIs into a valid separator (unless you’ve disabled normalisation, or have bypassed the kernel and are dealing directly with the file system drivers).

pitrou · May 9, 2025, 2:39pm

That’s more or less how Pathlib worked in the past, but nowadays there is a more sophisticated set of attributes, some of them computed on-demand:

github.com/python/cpython

Lib/pathlib/init.py

cd2f234ad


      
          __slots__ = (
              # The `_raw_paths` slot stores unjoined string paths. This is set in
              # the `__init__()` method.
              '_raw_paths',
          
              # The `_drv`, `_root` and `_tail_cached` slots store parsed and
              # normalized parts of the path. They are set when any of the `drive`,
              # `root` or `_tail` properties are accessed for the first time. The
              # three-part division corresponds to the result of
              # `os.path.splitroot()`, except that the tail is further split on path
              # separators (i.e. it is a list of strings), and that the root and
              # tail are normalized.
              '_drv', '_root', '_tail_cached',
          
              # The `_str` slot stores the string representation of the path,
              # computed from the drive, root and tail when `__str__()` is called
              # for the first time. It's used to implement `_str_normcase`
              '_str',
          
              # The `_str_normcase_cached` slot stores the string path with

This file has been truncated. show original

In any case, it seems that any non-trivial operatoin on a Pathlib path will still produce the segments list, while still storing the original constructor arguments. @barneygale may want to correct me here.

steve.dower · May 9, 2025, 2:55pm

The point is, from the user’s POV, they don’t have to consider all the edge cases, because they’ve been given a simple rule that is entirely controlled by us and doesn’t depend on the underlying OS.

Doing the same thing for subprocess would give users an equivalent simple rule^[1] that would let them write simple code without having to consider the underlying OS (as much… obviously if they are intending to call OS-specific apps, then they have to consider it, but not for the syntax of splitting the string).

And the higher-order point is that Python does stuff like this. So an argument that “Python would never” is not a great starting point. If it were a language that strictly typed everything and disallowed duck-typing, we’d have much more concrete examples (such as separate sets of os functions for each pathlib type, rather than a way for the os module to use them).

Something like “splits at spaces (in the template string), grouping together arguments with pairs of single or double quotes”, followed by substituting the insertions without resplitting, and with substitutions being escaped based on how the string is being used (e.g. for shell=True/False and per-platform) ↩︎

barneygale · May 9, 2025, 4:03pm

some pathlib notes, mostly confirming things already said

pathlib defers to posixpath and ntpath for most OS-specific details like separators
For performance reasons, PurePath('a') / 'b' / 'c' / 'd' keeps the raw path segments in a list rather than parsing and joining with every __truediv__().
A normalized string representation is generated on-demand and cached
Alternate separators (‘/’ on Windows) are replaced with primary separators (‘\’ on Windows) in the string representation
The string representation also elides ‘.’ parts and extraneous separators, and on Windows it normalizes the root of UNC paths (e.g. \\server\share becomes \\server\share\)

barry-scott · May 9, 2025, 7:40pm

So its not invalid as @encukou said?

I just did in CMD dir "folder/file" and it worked.
I needed to quote the path to stop cmd/dir parsing the /file as a option.

As subprocess calls the _winapi.CreateProcess is the / vs. \ a problem?