Support hexadecimal floating-point literals

skirpichev · December 25, 2023, 3:10am

Since CPython 2.6 (see Let bin/oct/hex show floats · Issue #47258 · python/cpython · GitHub for some history) we have float.fromhex/hex methods. That’s fine, but I think that we could offer a better and more powerful interface.

Lets introduce support for hexadecimal floats. That will include support for hexadecimal float literals (available in a number of Python competitors, e.g. Go, Ruby, Julia) and also a new representation type for floats in the “new style” string formatting (and maybe in old printf-style too). fromhex/hex methods could be eventually deprecated.

We could use same form for hexadecimal literals as the current float.fromhex() (with a mandatory ‘0x’ prefix) or use more IEEE 754-2008-compatible syntax (exponent part is required, see e.g. floats in Go).

Hexadecimal representation type in str.format could be ‘a’ (or ‘A’), in C-like style. This somehow clash with ascii conversion option (see the current format string syntax), but it’s not a big deal: we have similar clash for ‘s’ representation type. On another hand, if we would like to extend support of hexadecimal floats to C-API - probably we should use a different letter (‘h’/‘H’?), as ‘A’ will conflict with existing conversion specifier e.g. in PyUnicode_FromFormat().

Almost everything is there, except from simple (I think) changes from the lexical analyzer side. New formatting type in str.format() will also require some work beyond float.hex() capabilities, but why not use here %a format type support from the C stdlib? As a side story, I would like to remind here Guido’s comment from the referenced above issue:

Now C11 is required to build CPython. Maybe it’s time to revisit this suggestion?

From my quick experiments this seems to be possible. With this patch:

Comparing python:main...skirpichev:fromhex-hex-from-stdlib · python/cpython · GitHub
I have just one test failure in the CPython test suite (commented out). Other failures related to different equivalent representations of hexadecimal floats (trailing zeros). New methods also seems slightly faster on my system.

skirpichev · January 1, 2024, 4:14am

It seems for me, that some people like the idea, so here is follow-up:

In above branch - basic support for hexadecimal literals and new formatting type in str.format() was added:

>>> -0x1p-1074
-5e-324
>>> 0x1.ffffp10
2047.984375
>>> f"{_:a}"
'0x1.ffffp+10'
>>> f"{-0.1:a}"
'-0x1.999999999999ap-4'
>>> f"{-0.1:.2a}"
'-0x1.9ap-4'
>>> f"{3.14159:+a}"
'+0x1.921f9f01b866ep+1'

Let me know if it worth a pr. And, if so, does this require a PEP?

ntessore · January 1, 2024, 11:19am

Just replying to say that I would use this quite regularly, as a replacement for module-level .fromhex() calls to encode specific float16 or float32 constants such as one might find in a paper.

mdickinson · January 1, 2024, 11:29am

For the language syntax change, I think we should have a PEP. There needs to be a case made that the benefits are big enough to warrant syntax changes, and there are some choices to make that it would be good to discuss and record. It’s worth noting that something like 0x1.bp-4 is already valid at the syntax level (even though it’s not particularly useful, since there’s no bp attribute on integers).

>>> print(ast.dump(ast.parse("0x1.bp-4", mode="eval"), indent=4))
Expression(
    body=BinOp(
        left=Attribute(
            value=Constant(value=1),
            attr='bp',
            ctx=Load()),
        op=Sub(),
        right=Constant(value=4)))

If we start allowing . in 0x literals, then there’s also the question of whether something like 0x1.bit_length should remain legal syntax, or whether it should become illegal in the same way that 1.bit_length currently is. If it stays legal, what are the exact rules for determining when something like 0x1.abc is interpreted as a hex floating-point literal and when it’s interpreted as an attribute access on an integer?

For the formatting addition, I think a careful and complete description of the proposed new functionality in a GitHub issue would be enough, though again there are many details to be determined. BTW, why not presentation type x, with semantics similar to those of x for int? So the 0x prefix would be omitted unless using #x.

E.g., we already have:

>>> format(123, 'x')
'7b'
>>> format(123, '#x')
'0x7b'

and I’d propose something along the lines of

>>> format(123.4, 'x')
'1.ed9999999999ap+6'
>>> format(123.4, '#x')
'0x1.ed9999999999ap+6'

skirpichev · January 2, 2024, 1:54am

Ok, I’ll work on this.

But if it’s on a hard way anyway, maybe we should include also binary floating-point literals as well (0b10101.1110101p+123, as e.g. MPFR)?

Currently the lexer has a simple rule: “Where ambiguity exists, a token comprises the longest possible string that forms a legal token, when read from left to right.” So, in this example the hexinteger literal syntax (0x1) will clash with the 0x1.b syntax for hexadecimal floats only if the exponent part is optional (as in the above patch). If the exponent part is required (and fractional part can’t end with “.”) - old syntax remain legal.

In that case we will end with the fate of attributes, named like [0-9a-f]+p on integers. Can someone invent something useful with this pattern?

That’s an interesting option, which can be extended to the ‘b’ type as well, to the old-style string formatting and C-API functions like mentioned above PyUnicode_FromFormat(). The downside is that this breaks compatibility with stdlib’s printf(). On another hand, other languages do this (e.g. Go’s fmt package), as well as some C libraries (the MPFR uses ‘b’ type to print floats in binary, but ‘a’ - for hex output).

For floating-point format types # currently has meaning “don’t remove decimal point (and trailing zeros for ‘g’)”. Maybe we should always print 0x prefix and keep g-like meaning? I don’t see big reasons to avoid 0x: e.g. we don’t have int-like base kwarg for the float() constructor.

But if we require the exponent part for hexademal floating point literals, as suggested above - I think the # flag could be used to control the 0x prefix instead.

PS: Sorry for edits.

jamestwebber · January 2, 2024, 2:39am

They aren’t special but it’s running into the parsing ambiguity that’ll make this proposal tricky. (1).bit_length() works fine. 1.bit_length() looks like a malformed float. Perhaps the parser can be smart enough to distinguish them?

skirpichev · January 2, 2024, 4:22am

I think it should be possible. But I’m not sure if it worth complications. Right now the lexical analyzer has this simple rule: “Where ambiguity exists, a token comprises the longest possible string that forms a legal token, when read from left to right.” So, we got 1. in this example:

$ echo  '1.a'|python3 -m tokenize
1,0-1,2:            NUMBER         '1.'           
1,2-1,3:            NAME           'a'            
1,3-1,4:            NEWLINE        '\n'           
2,0-2,0:            ENDMARKER      ''

If we add hexadecimal floating point literals, those will be preferred over hexinteger’s in cases like 0x1a.z.

storchaka · January 2, 2024, 9:45am

How often in your code you use float.fromhex() with a literal string argument?

find -name '*.py' -exec egrep '\bfloat\.fromhex\(["'\''][^"'\'']+["'\'']\)' '{}' +

skirpichev · January 2, 2024, 11:01am

Hmm, not sure if that is a good metric here.

On another hand, float.fromhex() format is the simplest way to represent floats exactly. Perhaps, it’s the reason why this is extensively used for tests in the CPython itself:

$ grep '0x[0-9a-f.]\+p' Lib/test/test_math.py \
                        Lib/test/test_random.py | wc -l
79

Or for every test in Lib/test/test_strtod.py…

hugovk · January 2, 2024, 11:47am

Remember you’ll need a sponsor or core dev co-author:

Any volunteers?

skirpichev · January 8, 2024, 5:24am

Thanks, I know.

BTW, here is a draft PEP for potential sponsors/co-authors:

Linked implementation has a more strict format for hexadecimal literals, with a mandatory exponent (while float() will accept same syntax as the float.fromhex()). That should clear backward compatibility concerns:

>>> 0x1.bit_length()
1

I also did some cleanup to reduce the diff (esp. for changes in floatobject.c).

The proposal for the str.format() changes, per @mdickinson suggestion, goes to an issue: Support formatting floats in hexadecimal (and binary?) notation · Issue #113804 · python/cpython · GitHub