Python should drop custom formatters

JohnnyNajera · April 13, 2023, 5:44pm

Custom 3rd party formatters ('%a%b%Qs~!z') are nothing but confusing. Why should we encourage another obscure language within the good language we already have? More frequently than not, this just cause developers to turn to some forgotten table in the object’s documentation, squinting their eyes looking for the necromantic rune to solve their problem.

I think formatting should be mostly done in a readable code. Usually there are a few common repeating formats which can have dedicate function(s). In the relatively rare cases when someone needs a custom one-time formatting, she can use f-strings to format the object in a prior expression, which is much more easy to maintain, and understand.

If indeed, in some use cases it is required to keep generating different and weird formats for an object, and such a mini-language is extremly useful (as all the involved developers indeed know it by heart), then such an object can implement this functionality independently.

But today with f-strings python supports this idea natively at the syntax level.

Dropping it will allow taking the format strings into compile time! Which is much more reasonable. Creating a standard consistent syntax for it (today, padding a datetime requires a different sorcery than padding a number. Not to mention 3-rd party libs).

Today the compiler just look at the format string as a opaque blob. One can easily insert a “syntax” error there and no one will know until execution.

If the format strings are limited to standard types, and are well defined in the spec for compile time - this will make an easier job for linters and IDEs. Autocomplete and suggestions may actually make people use all the great functionalities to quickly format a string.

Dynamic runtime format strings also open a problem from a security consideration.

More than anything, I think using runtime strings to communicate between components is an idea that should slowly fade away.

EDIT: just to clarify we can still have dynamic formatting capabilities. E.g., dict is a python literal, and a native object, but it can also be read at runtime from a config file (json) when crucial. This can be true for the “formatter” object too.

ericvsmith · April 13, 2023, 6:17pm

I completely disagree. The extensibility of the format specifiers to new and unknown types is a key feature of PEP 3101, where they were introduced. It’s a key differentiator between it and %-formatting.

Even if you think it’s a bad idea, it’s far too late to change how they work.

It’s also not reasonable to force the use of f-strings everywhere. There are plenty of safe uses of str.format, including use of dynamically supplied format specifiers.

JohnnyNajera · April 13, 2023, 6:28pm

It’s also not reasonable to force the use of f-strings everywhere.

Slowly – why not?

Instead of generating runtime strings to be parsed later on – one could generate a formatter object, which will probably be faster and more strict.

But if it is too late, then it is too late.

ericvsmith · April 13, 2023, 6:30pm

Because, for example, they might be read from a config file and not be available to the compiler.

JohnnyNajera · April 13, 2023, 6:37pm

One can do runtime formatting by creating “formatter object”, which most of the time will probably be faster than generating and parsing. This object can be created according to the config file at runtime. You can even create a layer which parse the formatting strings in runtime just like today, to create this formatter object.

Just like a dict, it is natively an object, and if you must read it from a file, you can parse a json, or equivalent.

ericvsmith · April 13, 2023, 6:48pm

The original post said the compiler should be aware of format specifiers, I’m just saying that’s not always possible.

chepner · April 13, 2023, 7:02pm

F-strings allow the same kind of formatting.

>>> f'{datetime.datetime.now():%Y-%m-%d}'
'2023-04-13'

It’s not clear how this is any more prone to errors than

>>> datetime.datetime.now().strftime("%Y-%m-%d")
'2023-04-13'

unless you are advocating that everyone should, instead, write code like

>>> x = datetime.datetime.now()
>>> f'{x.year}-{x.month}-{x.day}'
'2023-4-13'

(I intentionally left out the format spec that pads the month with a 0. I’d be interested to hear if you consider {x.month:02} “confusing” as well.)

Rosuav · April 13, 2023, 7:25pm

They’re way WAY more than confusing. They are extremely useful and powerful. If you are unable to make use of them, then don’t; but the language doesn’t need to be weakened to accommodate this.

Remember, it is absolutely okay to ignore parts of a language. I have never used large slabs of the standard library at all, and there are entire language features that I’ve only ever used in testing. This is not a problem. The language provides such a huge set of features that, even if you ignore some of them, it’s still a very powerful language.

You are, of course, completely free to disallow third-party formatters in your codebases. People do this sort of thing all the time (particularly with JavaScript, which has some utterly appalling misfeatures baked into the language, but has much better ways of doing the same things - for example, lots of codebases mandate that you avoid the var keyword and use let instead).

That’s true, but you can also put a spelling error into a string literal and no one will know until execution. Or you could write x + 10 when you should have written x - 10. Bugs come in many MANY forms. Format strings are a compact language for a specific purpose, just like regular expressions, and have a lot of advantages and disadvantages - just like regular expressions do. Want to use them? Go ahead and use them. Want to avoid them in favour of something else? That’s not a problem either.

JohnnyNajera · April 13, 2023, 8:06pm

First of all this is phrased in somewhat a rude manner and I don’t know why it is necessary.

Can you give a nice example? Maybe it will give me a better perspective.

Anyway, as I said in the post, nothing porhibts a library from a having its format function. The question is whether or not this should be supported natively by python. We should be comparing the added value of the native support of custom runtime formatter to the price. I’m sure you would agree we are paying at least some price by supporting this.

My problem is not that they are just there. My porblem is the fact that their syntax level native support prevent doing something (imo) more remarkable, like turning a few of the extremely useful formatters (numbers, strings, etc.), into part of the language spec. Turning it into a compile time feature. 3rd party formatters can stay an independent runtime feature.

To me, disassembling this

>>> dis.dis(lambda x: f'{(x+5)**2:30.2f}')
  1           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 (5)
              4 BINARY_ADD
              6 LOAD_CONST               2 (2)
              8 BINARY_POWER
             10 LOAD_CONST               3 ('30.2f')
             12 FORMAT_VALUE             4 (with format)
             14 RETURN_VALUE

and seeing the format string abandoned there as a runtime string feels unbaked. It seems so random to have there a string all of a sudden.

In its core, a format specification, is (most very often) static. Parsing it only in runtime feels almost unnatural.

barry-scott · April 13, 2023, 8:41pm

Does not work for internationalisation.
The string with its formatting are a mandatory requirement for internationalisation.
Its cannot be compiled as the string and the placement of the replacements are read from files at runtime that depend on the language of the user.

TeamSpen210 · April 13, 2023, 8:41pm

With the specialising interpreter, formatting could be optimised without needing to require any changes at all. It could look for LOAD_CONST + FORMAT_VALUE, then specialise it based on the type it observes, and stash the parsed format specification somewhere on the code object. Main question is whether implementing that would actually improve performance enough to be worth it.

Rosuav · April 13, 2023, 8:52pm

This would only work when it truly is a constant, and thus fails on I18n (where the format string may be shoved off into a config file somewhere).

Is there actually a performance problem to be solved here, though?

JohnnyNajera · April 13, 2023, 9:07pm

Can’t find my comment, reposting.

The formatting of course happens at runtime, but is there any problem having the parsing of the format itself in compile time?

barry-scott · April 13, 2023, 9:17pm

The format is in an external data base file. How will you compile that?
For gettext that is the .mo binary file that is access by the gettext code.
For Qt the code is in the Qt modules.
In both cases tool chains that are outside of pythons ability to change.

JohnnyNajera · April 13, 2023, 9:43pm

Maybe I don’t understand you correctly. Numbers for example, have their format spec in the documentation. I want to turn the format string into an object which encodes the format specification. If some external data is required at runtime in order to do the actual formatting, it can still be fetched at runtime, no problem. It is not as if the entire mini-language syntax is completely unpredictable before runtime.

aroberge · April 13, 2023, 9:47pm

At compile time, you don’t know what language will be requested by the end-user. Internationalisation is done lazily: the translation cannot be done ahead of time.

ericvsmith · April 13, 2023, 10:57pm

At compile time you can’t know if the format spec is meant for an integer.

def f(o):
    print(f'{o:02}')

f(4)
f(datetime.datetime.now())

What could the compiler do with the format spec ‘02’ to improve things? If this were really in need of optimization, which I doubt until shown otherwise, you could already parse the ‘02’ if o were an int and reuse that information the next time it was used with an int. I don’t think there’s any need to involve the compiler itself in this.

JohnnyNajera · April 14, 2023, 1:14am

Optimization is not the main point here at all, it was just a side point.

And yes, regarding your question, I suggest that the standardized format synatx should indicate the type of the formatting (in practice this may be the type of the object).

Because anyway it doesn’t make sense to format an object of type you don’t know, as the format syntax is drastically different from one object to the other. If there are common behaviours between the formatting of different types, of course this can still be preserved.

So for example thie first character of the format could denote what kind of formatting we are doing here.

result = f'{o:n02}'  # n denotes number formatting

But this is not the only option. You can find many ways to standardize the format language in a way which doesn’t have ambiguities.

ericvsmith · April 14, 2023, 1:19am

Okay.

We can never remove the existing behavior, and there’s no sense having two formatting languages, so I’m going to drop out of the conversation.

kknechtel · April 14, 2023, 11:24am

I think there are two separate topics here: 1) f-strings vs. explicit formatting requests, and 2) {}-based vs. custom (%-based and others) format specifications.

f-strings vs. explicit formatting

That’s called a string. You just put {} placeholders in it.
No, it cannot do arbitrary calculations. F-strings support that to save some typing. When you are doing i18n in the real world, you do not want to do that kind of calculation on the fly. You want to look up some locale-specific strings and values substitute them into a template. (Sometimes the template will also have to be locale-specific.) If you don’t already have the values, then you just write the code to calculate them first, and store them with descriptive names, then do the actual interpolation. (Or if it’s simple enough, you stuff them into the arguments for the call to the formatter.)

In case you were not using Python before 3.6: this is done using the .format method of the built-in string type.

>>> test = 1
>>> f'{test}'
'1'
>>> '{}'.format(test)
'1'

Yes; the format will be determined at runtime.

As a trivial example: some cultures prefer to write today’s date (at the time of day I’m writing this, it should be the same day in nearly all the populated places in the world) in the order 4-14-2023 and others as 14-4-2023. Even supposing that we don’t use the datetime library at all, and just have three variables with those numbers in them, with the .format method of strings we can write

date_format = '{month}-{day}-{year}' if month_first_locale() else '{day}-{month}-{year}'

And then later in the program do e.g.

today = date_format.format(day=14, month=4, year=2023)

And the date_format string can be passed around the program like any other string, read from a file, etc. These things are impossible with f-strings. You cannot store the f-string in a file in a meaningful way, because there is nowhere to put the f. If you used an f-string in the code that writes the file, then the current values are hard-coded into your saved data - you don’t have a reusable template.

So, f-strings don’t belong in the same category as all the other options. The common syntax shared by f-strings and the .format method is in the syntax category, that is shared by the %-based syntax used by the % operator (and by the standard library logging module, some SQL bindings, datetime.strptime/.strftime, etc. etc.) as well as other custom syntaxes.

The actual format-specification syntax

There is already some support for this. Please see the documentation, specifically the type field:

>>> f'{100:f}'
'100.000000'

However, it’s often much less useful than one might like, and of course only a few types can be “privileged” in this way to an extent that would matter for the compiler. It wouldn’t work for library types like datetime.datetime to try to “claim” a type specifier, because the parser wouldn’t be aware of them, much less the compiler.

Instead, the grammar allows the part after : to have an arbitrary format, the “standard” in the doc notwithstanding, and this is eventually passed to the __format__ method of whatever is getting interpolated.

Now, my own thoughts, which are entirely about the second topic.

The {}-based syntax is really nice for working with an overall string that needs to have multiple pieces of data formatted in. Many library authors seem to think of their types (like datetime.datetime, whatever object represents an SQL query, etc.) as “single” pieces of data that might be formatted either separately or in a larger context. (The logging module is the way it is only for historical reasons, I’m sure. After all, the data type there is just str.)

So, they’ve invented (or emulated from an older source: e.g., an existing C library which either inspires the Python feature, or is implementing it under the hood while Python provides minimalist bindings) a variety of custom format specifications of their own. I’m generally not a big fan of these, like OP: they tend to be confusing (%m vs %M in datetime.datetime formats is hard to remember, and recently I learned it is the other way around for Numpy!) and ugly ({} is symmetrical and makes it clear what the bounds of the “placeholder” are; most % syntaxes expect a single character, although you get weird compromises like Python’s old %(varname)s) and redundant (in the syntaxes where %s is a placeholder for a string, it will typically accept non-strings anyway, and the more-type-specific formatters might not do noticeably different things).

{} syntaxes address these problems elegantly: it’s clear where the beginning and end are, it looks nice, and you only have to specify type conversions etc. when necessary (stuff like !r to make Python use repr instead of str, or :f to treat integers as float - a !s is never required, so the common case is an empty string, rather than s). Finally, it allows you to embed the custom syntaxes, as shown up-thread by @chepner.

But more importantly for contexts like datetime.datetime, .format already offers some limited destructuring:

>>> '{0[hello]}{0[hello]}'.format({'hello': 'world'})
'worldworld'

Similarly with attributes rather than dict keys.

I personally would like to see the standard library move more in this direction. While it is possible to write

>>> import datetime
>>> '{:%Y-%m-%d}'.format(datetime.datetime.now())
'2023-04-14'

I would on aesthetic grounds much rather follow the second example already:

>>> '{x.year:04}-{x.month:02}-{x.day:02}'.format(x=datetime.datetime.now())
'2023-04-14'

and I would like to be able to have a shorter way to do it. For example, if the str class supported something like

>>> class Example(str):
...     def format_attrs(self, obj):
...         class _: # throwaway class to provide a method that uses `obj` from the closure
...             def __getitem__(self, name):
...                 return getattr(obj, name)
...         return self.format_map(_())

Then we could do (without the need to wrap in the subclass):

>>> Example('{year:4}-{month:02d}-{day:02}').format_attrs(datetime.datetime.now())
'2023-04-14'

(Similarly, it would be nice to have some mechanism to restrict, or supply separately, the environment from which f-strings draw names.)

I think this is a lot clearer (of course, it could also have shorter aliases for the property names, but definitely not the ambiguous m). I don’t want to have to think about the type’s own formatting API, because the instance already has attributes which I already understand how to work with in the standard way. I don’t have to mentally correspond b to months (for the name of a month - what on Earth??) And I don’t have to remember what the code is for, say, a numeric month value that isn’t zero-padded (trick question: there isn’t one).

If custom per-type formatters have any use here IMO, it’s for things like specifying a 4-digit vs 2-digit year, or a short vs. full month name, because that actually involves type-specific processing. Putting day/month/year in a specific order and putting literal hyphens or slashes between them, are boring, generic tasks and I don’t need or want the class’ help with them. I do want its help to know that e.g. MR rather than MA is the 2-letter abbreviation for March, or that Thursday as a single letter is R rather than T, at least, in the contexts where that’s true.

Of course, implementing that involves the datetime module coming up with its own type to represent months, or days-of-the-week, which can stringize or format in various ways. Probably making use of enums, now that those exist. All of these seem like improvements to me. Just imagine (though of course the semantics could be defined differently; this is just what makes the most sense to me off the top of my head):

>>> april = datetime.datetime.now().month # now some enum type that implements `__format__`
>>> f'{april}' # raises a ValueError - ambiguous
>>> f'{april:s}' # April
>>> f'{april:3s}' # Apr
>>> f'{april:3S}' # APR
>>> f'{april:2S}' # AP ; but May would be MY, not MA
>>> f'{april:1S}' # raises a ValueError - no such valid abbreviation
>>> f'{april:d} # 4
>>> f'{april:02d}' # 04

(And this, of course, is why I specified the superfluous :4 for the year in the previous example; my thinking is that a similar year type could interpret :2 to take the last 2 digits - something that int does not do.)

Topic		Replies	Views
Custom Strings: h'.....' Ideas	7	910	June 4, 2021
Diagnostic logging: reinventing the wheel? Python Help	3	335	November 23, 2023
Practical applications of string-like bytes methods? Python Help	13	626	July 26, 2023
PEP 737 – Unify type name formatting PEPs	58	4325	March 14, 2024
Type annotations, PEP 649 and PEP 563 Core Development	25	6530	October 4, 2023

Python should drop custom formatters

f-strings vs. explicit formatting

The actual format-specification syntax

Related Topics