New format specifiers for string formatting of floats with SI and IEC prefixes

avylove · May 19, 2023, 4:49pm

I was thinking of writing a PEP to support formatting floats with SI (decimal) and IEC (binary) prefixes natively for float (and maybe other types if makes sense) but wanted to feel it out first.

I’ve implemented what I want to propose in the Prefixed package as a subclass of float. It came out of some work I was doing a few years ago and has been refined since then through use and feedback.

The format specification changes are outlined here, but in summary it adds:

New presentation types
- 'h': SI format. Outputs the number with closest divisible SI prefix. (k, M, G, …)
- 'H': Same as 'h' but treats precision as significant digits
- 'k': IEC Format. Outputs the number with closest divisible IEC prefix. (Ki, Mi, Gi, …)
- 'K': Same as 'k' but treats precision as significant digits
- 'm': Short IEC Format. Same as 'k' but only a single character. (K, M, G, …)
- 'M': Same as 'm' but treats precision as significant digits
New flags
- '!': Add a single space between number and prefix
- '!!': Same as '!' but drop space if there is no prefix
New field
- Margin, denoted with '%', raises or lowers the threshold for prefixes

Some examples of what this would look like if implemented directly into the float type. You can do these today with preflixed.Float.

Simple SI (decimal) case

>>> f'{3250.0:.2h}'
'3.25k'

Same thing with the '!' flag to add a space

>>> f'{3250.0:!.2h}'
'3.25 k'

Simple IEC (binary) case

>>> f'{2048.0:.2k}B'
'2.00KiB'

Same thing, but short form

>>> f'{2048.0:.2m}B'
'2.00KB'

Difference in the way precision is handled when treated like significant digits

>>> f'{1246.0:.3h}'
'1.246k'

>>> f'{1246.0:.3H}'
'1.25k'

Difference in the way prefixes are determined based on margin field. The example lowers the threshold by 5%.

>>> f'{950.0:.2h}'
'950.00'

>>> f'{950.0:%-5.2h}'
'0.95k'

I couldn’t find a previous discussion of something similar on this forum, but there was some discussion in a cpython issue 75930.

When I originally looked at this problem, there seemed to be a lot of packages and code samples solving specific uses, but not really any generic solutions. This is my attempt to imagine what it would look like if this was generic and native. I think it’s a common problem and would get a lot of use.

I would appreciate your thoughts and suggestions. Please check out Prefixed if you want to test drive the proposal. And if you’d like to help with writing the PEP or the C code, please reach out.

(Sorry for the lack of links, apparently new users can only have 2 in a post)

effigies · May 19, 2023, 5:02pm

Why h, k and m? There doesn’t seem an obvious mnemonic there to remember them by.

avylove · May 19, 2023, 5:18pm

There are only so many letters. They made sense at the time, but I’m open to others if anyone has ideas. It actually used to be h and j/J (for JEDEC), but it was changed when the significant digit logic was added.

guido · May 19, 2023, 5:22pm

Why does this need to be in the stdlib? It seems fine as a PyPI package.

avylove · May 19, 2023, 5:53pm

@guido, I wasn’t suggesting adding the package to the standard library. I was proposing adding similar logic to the built-in float data type. The example implementation in Prefixed is really just a subclass of float that overrides __format__() and was done as an experiment to see what it would look like if implemented natively in the language. I think it’s a common enough problem and generic enough to be considered.

Update the original post to hopefully make this clearer.

guido · May 19, 2023, 7:11pm

Sorry, the boundaries between the built-ins and the stdlib are often fuzzy, and core dev terminology often lumps them together under stdlib.

I think you will have to argue this more forcefully and with data before it will be considered. Without that I feel that people who like to format their numbers in different ways can easily use a 3rd party package like yours to do the formatting for them.

avylove · May 19, 2023, 9:58pm

@guido, I think Rich Jones (not sure if he’s on here) explained it well in the referenced issue:

It’s not a complex problem, the solutions are fairly simple, but there are many ways to shoot yourself in the foot when rolling your own. There are already many packages which attempt this, most of which aren’t used by any serious projects, who instead use their own implementations. There are just as many snippets of partial solutions floating around the internet as well. There is no canonical way to solve this common problem.

This is exactly why this common functionality should be added to the standard library, so that this extremely common function doesn’t have to be imported from some-random-jamook’s-untrustworthy-project-on-PyPI or rewritten from scratch for every project.

I think we can take it a step farther and just make it part of standard string formatting. It makes it more efficient and reduces the complexity of converting types and pulling in imports.

A cursory search yielded a lot of results. Many one-off implementations and small projects on PyPI. Many projects role there own with varying quality. Before I created Prefixed, I personally must have written code to format byte sizes for output at least 100 times, usually just enough for what I cared about in the moment (for example only MiB or KiB). How many times have you done something similar?

This is where I had a bunch of links, but I’m still limited to 2 per post (when does that go away?), so here’s a link to a GitHub gist with a bunch of links from my initial search.

pf_moore · May 19, 2023, 10:18pm

This sounds to me as if there’s not yet an “obvious” implementation that captures the best API and semantics. Maybe it’s worth waiting a while longer until a clear “best of breed” implementation emerges?

Very rarely, to be honest - although it’s something I might have used on occasion^[1] if there had been a well-known implementation. I don’t think it’s something that really needs to be in the stdlib, and I definitely don’t think it warrants being built into the float implementation’s format method. For example, one obvious (to me) use case is formatting file sizes, and those are integers, not floats, so having this as a float method rather than a function would be a disadvantage in that case.

Also, as a function rather than an extension to float.__format__, it’s much easier to publish a reference implementation on PyPI (and if it does get added to the stdlib, having a backport is a significant advantage as well).

Although probably only as a substitute for a “format in a human-readable way” function. ↩︎

avylove · May 19, 2023, 10:41pm

I read it a little different. I think most of the solutions are too simplified or specific to be used as a general solution. For example, almost none of them take into account significant digits which is extremely important in scientific fields.

I’ve gone back and forth with if this should be applied to integers as well, but it seems it likely should. The smaller than 0 magnitudes wouldn’t apply for SI units, but the logic would be almost the same.

I just don’t see the point of a function when the capability is already built into the language. It would make sense if these were arbitrary units, but we’re talking standardized magnitudes.

Let me put this another way, how can we justify including scientific notation in standard formatting and not include engineering notation? Engineering notation has been used on calculators since 1975 and, depending on your field, can be much more commonly used than scientific notation.

MRAB · May 20, 2023, 12:20am

There’s a difference between “engineering notation”, where the number is expressed with an exponent that’s a multiple of 3, and SI or similar prefixes. FWIW, I’d be OK with adding engineering notation as a built-in, but I agree with the others about handling prefixes.

jrivers · May 20, 2023, 12:21am

How prefixes are used varies a lot depending on what you’re measuring. This would work well with watts (for the SI units) or for file sizes (for the IEC units). However:

For distances: Small distances are often measured in centimeters (cm), but that prefix isn’t even supported in this implementation. But, long distances aren’t measured in megameters (Mm)—the distance all the way around the earth is 40,000 km, not 40 Mm (which would be confused with millimeters).
For volumes: Sometimes deciliters (dL) are used, and that’s also not one that you’ve included. But, other times that unit isn’t used.
For mass: 1 million grams is often not a megagram (Mg), but a tonne (t). And the larger prefixes are used together with the tonne, e.g., kilotonnes (kt), megatonnes (Mt).
For time: The prefixes are used for times smaller than a second, but, if it’s larger than a second, then minutes and hours are used.

This makes me think that there can’t really be a general solution for using prefixes.

tjreedy · May 20, 2023, 1:42am

Which is to say, a format module would need several functions for different types of units.

I think we should continue to leave such things (other than time and dates) out of the stdlib. On the other hand, a C-coded multiple-of-3 format option would be useful for such 3rd party modules.

avylove · May 20, 2023, 10:45am

Providing SI prefixes is another format of engineering notation. It’s arguably the more popular one.

We’re not talking about arbitrary prefixes, we’re talking about SI (decimal) and IEC (binary) prefixes. These are standards set by the International Bureau of Weights and Measures and International Electrotechnical Commission. “centi” and “deci” are not part of the standard. This is why you don’t generally see them in scientific and engineering context. It’s true that time is treated differently, mainly because it’s not base-10 or base-2, but we have a module for that.

pf_moore · May 20, 2023, 11:03am

The stdlib needs to cover a broader range of use cases, though. Not all Python users work in science/engineering, and dismissing centimetres (for example) simply because they “aren’t part of the standard” isn’t really practical.

A library on PyPI can have a more focused audience. Maybe that’s why there are multiple implementations out there, because they cater for different use cases?

Just to repeat - I’d personally be interested in something like this (but as a function, not a class with a specialised __format__ method), but on PyPI, not as a stdlib module. So I’m not against the idea, just the proposal that it needs to be part of the standard library or the float format spec.

ntessore · May 20, 2023, 11:27am

Who hasn’t wished for the one additional format specifier to transform number of bytes into human-readable form? (Or maybe two, for k, M, G, … and ki, Mi, Gi, … respectively).

avylove · May 20, 2023, 11:41am

And that’s the goal here, to add the most popular forms of engineering notation. It won’t solve every use case, but it will solve many. We’re never going to solve every use case. For example, we support formatting with base-10 scientific notation, but not base-2. Centimeters aren’t engineering notation and that’s also a case where the user wants to use a specific unit of measure, which is not what engineering notation is for. It is to express magnitude in a succinct way.

ajoino · May 20, 2023, 4:03pm

According to the the wikipedia page for metric prefixes, centi and deci are part of the standard. Maybe not the best source but I trust it on this one.

NIST also includes them https://www.nist.gov/pml/owm/metric-si-prefixes

avylove · May 20, 2023, 4:17pm

Yes, valid prefixes, but not engineering notation because they do not represent a power of 10 which is divisible by 3. Also discouraged or removed in many downstream standards (AIA, ASTM, GSA, etc).

jagerber · May 20, 2023, 6:55pm

My opinion is that it would be nice if the following were built in format specifications for float:

A formatting mode that ALWAYS displays a specified number of sig figs (not precision) along with additional options for forcing the display to be in
- Standard notation. i.e. no scientific notion/exponent: 123.456 → 120 for 2 sig figs
- Scientific notation: The non-zero number is set such that its mantissa satisfies 1 <= m < 2 so that 123.456 -> 1.2e2 for 2 sig figs
- Standard engineering notation: The number is shown in scientific notation but the mantissa is chosen such that 1 <= m < 1000 and the exponent is forced to be a multiple of 3 so 123.456 -> 120e0.
- “Shifted” engineering notation: the exponent is again a multiple of 3 but the mantissa ranges between 0.1 <= m < 100 so that 123.456 -> 0.12e3.

The existing #.2g format specification realizes the goal of always displaying the correct number of sig figs BUT, it has an automated routine to determine whether it uses standard or scientific notation. The user can’t control this automated selection with options in the format specification. Furthermore, there is no functionality for engineering notation.

I would strongly support this sort of behavior becoming part of the built int string formatting specification. The current string formatting options give scientific users close to the full set of format specifications necessary but not quite, and the not quite ends up being pretty frustrating.

I do not vote in favor of the functionality for built in format specification to append alphabetical characters to indicate the scale of a number. The number of applications is just too broad and too many people will be interested in slightly different functionality. It’s not clear how to decide what is “standard” or not. You might point to NIST standards or such, but is there precedent in python built-in/stdlib to cite NIST as a reference for its behaviors? Just seems to application specific for a language as broadly used as python.

However, if the feature for engineering notation that I described above, it would become exceedingly easy to write a really good pypi package that parses the python-formatted float into something which has alphabetical character suffixes indicating the order of magnitude for the float and the type of unit. I’d expect such a package to try to cover unit formatting across a wide range of disciplines, it could also nicely handle the odd non-engineering notation units like cm.

Question: Should I make a new thread specifically with the ideas expressed at the top of this post? It seems this thread is focused on the appending of alphabetical characters, though the sig fig control and engineering notation is part of the idea in this thread.

jagerber · May 20, 2023, 7:15pm

A follow on: Why controlled sig figs + engineering notation as built in float formatting? If not then I’m forced to either cast the float to a custom “nicely formattable” float type or call a function every time I want to print a float nicely. I basically lose all the elegance of inline formatting of floats in f-strings which is frustrating because, as I said, the current format specification options are almost there.