Converting between floating point numbers and strings is apparently surprisingly often a bottleneck (for example in JSON parsing / serialization and similar areas.)
CPython seems to use dtoa for this, but a new algorithm Ryu is apparently much faster.
Microsoft C++ standard library developers recently reported using it and “the speedups are indeed massive due to algorithmic improvements of Ryu.”
Would this be something CPython should adopt as well?
Do you have any benchmarks that show a bottleneck in dtoa.c? There’s lots of other overhead in Python. Before we start optimizing we should understand where the time is currently spent.
Ulf Adams claims in his paper “We present Ryu, a new routine to convert binary floating ¯
point numbers to their decimal representations using only
fixed-size integer operations, and prove its correctness. Ryu¯
is simpler and approximately three times faster than the
previously fastest implementation.” https://doi.org/10.1145/3192366.3192369
It performs better than dtoa on some inputs but the output is less human readable. For example dtoa and serde_json prefer to print 0.3 as 0.3, while Ryū currently prints it as 3E-1.
I’d definitely like to hear @mdickinson’s opinion on this topic.
An advantage of Ryū besides speed is simplicity. dtoa.c is a hairy piece of code, and it’d be nice to have something simpler and modern instead. On the other hand, dtoa.c is very battle-tested at this point.
I’d be happy to see the change, provided it’s well tested. I’ve never been much of a fan of the complexity of dtoa.c. But dtoa.c won’t go away entirely: IIUC, Ryū is only for float-to-string conversions, right? And for performance, I’d consider that the less important direction - e.g., reading text files of numeric data happens more often than writing such files, and anyone with serious amounts of numeric data should be writing in a binary format rather than converting to string anyway.
One unusual aspect of dtoa.c is that it supports fixed-point formatting (i.e., %f-style formatting) with a negative precision; that facility is used by Python for doing correct rounding with a negative second argument. It’s not clear to me at a glance whether Ryū supports this. Can anyone confirm or deny?
Finally, I’d caution that replacing dtoa.c won’t be a small amount of work.
Output can be slightly different from the native function, due to floating-point rounding
The shortest-string algorithm requirements are well-defined, so there’s exactly one “correct” answer for that algorithm (and absent bugs, that’s what repr produces). If frepr is giving different results, that doesn’t give me confidence that it’s a good replacement. At a minimum, I’d want to know more about what these different results are and under what circumstances they can occur.
My understanding is that underlying library (Google’s double-conversion) always rounds away from 0:
The buffer will choose the representation that is closest to
‘v’. If there are two at the same distance, than the one farther away
from 0 is chosen (halfway cases - ending with 5 - are rounded up).
I don’t know if that is a concern for us or not.
But anyway that is for frepr, but the Ryu library seems faster yet.
hello @all,
being new here pls. be tolerant if I’m not well adopted to the manners,
working on a similar question for gnumeric I found a link to this thread in github/ryu, if this topic is still of interest pls. check if the info / code provided there discussion about output formatting is of any help for you, it enables different formats in ryu. tests, corrections, feedback, improvements welcome.
IMHO ‘d2s’ is nicely fast, ‘generic_128’ is not, I urgently! need a fast solution for 80-bit long doubles.
b.
P.S. @mdickinson: you’d write > The shortest-string algorithm requirements are well-defined, so there’s exactly one “correct” answer for that algorithm (and absent bugs, that’s what repr produces). - I’m new to these things and unsure about some details, can you help me with a link to appr. docu? TIA
It’s not 100% clear to me what sort of docs you’re looking for here.
For Python, these are the requirements:
define the “length” of a positive decimal value d as the smallest positive integer f such that d can be written in the form d = m * 10**e, with m and e integers and m < 10**f
for a given positive float x, the value of repr(x) must be the shortest (in the sense of the “length” just defined) decimal value that rounds back to x under the round-ties-to-even rounding mode
if there’s more than one such shortest value, the closest to x must be chosen
if there are two equally close values, the one with even last digit must be chosen
These requirements uniquely determine the value of repr(x), though they don’t uniquely determine the form: e.g., they leave open whether to express the repr of 1_0000_0000_0000_0000.0 as 10000000000000000.0 or as 1000000000000000. or as 1e+16.
Note that there are some subtleties here: it is not true that the value of shortest string defined above is always the same as the result of rounding x to a suitable number of decimal digits. For an example of a weird corner case, consider x = 1 / 2**24. This has exact value 0.000000059604644775390625. If we round x to 16 significant digits we get 5.960464477539062e-08, which does not round back to x. If we round x to 17 significant digits we get 5.9604644775390625e-08, which does round back to x. So you might think that the latter gives the shortest string representation in this case. However, the shortest string representation is 5.960464477539063e-08.
thought there might be some ‘official standard’, your explanation is fully sufficient,
let me try to express it less scientific: ‘the shortest string of decimal digits which raised to an - also zero or negative - integral power of ten would re-produce the original float in conversion to IEEE format’ - correct?
that’s ‘nearest, ties to even’ on decimal side? IMHO it’s already problematic for binaries.
thank you, very well pinpointed where I - and others - often think too simple.
that’s the problem i meant, ‘ties to even’ on decimal side improves the problem instead of the solution, not having human / decimal common ‘nearest, ties away’ as alternative was a weakness in IEEE 754 from the beginning, and they are step-by-step going back with having it since -2008 and having ‘augmented’ modes since -2019, but that’s another story.
‘shortest roundtripping’ seems a good step for me to get better syncing between decimal and bin math, but ‘ties to even’ on the binary side is still a stumbling block. For this sample ‘ties away’ in dec and bin would match, not clear to me at the moment if that applies to all such cases.
Just a comment because this thread popped up again that this field has continued to advance quite quickly in recent years, and other competitive methods have appeared, some of which out-perform Ryu. To my knowledge, one of the best ones currently is dragonbox.
Reviving an old thread, but I recently stumbled on the relatively poor performance of Python float-to-string. It happened because our app was enjoying the high speed of orjson’s serialization (thanks to Ryu, by way of Rust), and then found that, on PyPy, there is nothing available to come anywhere near that performance (since PyPy uses CPython’s dtoa, and orjson doesn’t support PyPy, and would suffer the huge C extension tax anyway).
It’s an indirect observation by way of json serialization, but you can see that orjsondumps() of a large float array is 15x faster than json:
$ python -m pyperf timeit --fast \
-s 'from json import dumps; x = [3.141592653589793] * 100'
'dumps(x)'
Mean +- std dev: 78.6 us +- 1.0 us
$ python -m pyperf timeit --fast \
-s 'from orjson import dumps; x = [3.141592653589793] * 100' \
'dumps(x)'
Mean +- std dev: 4.64 us +- 0.03 us
Just to confirm that it’s really due to float-to-string, rather than some unfortunate overhead of json, here is the nearly identical result from CPython str():
$ python -m pyperf timeit --fast \
-s 'x = [3.141592653589793] * 100' \
'str(x)'
Mean +- std dev: 76.3 us +- 1.0 us
If we’re talking about a 15x speedup, and given how pervasive float-to-string operations are (serialization for networking and storage, logging, …), I wouldn’t be surprised if the typical app got a several percent boost by replacing dtoa in CPython. I know it would be a lot of work – perhaps this should be on the faster-python radar?
What times do you get with the integer 31415926535897932 or the string “314159265358979” instead? Just wondering whether the slowness comes from float or from json. Seeing all six times (three types with both libraries) might be useful.
I would be very surprised. Float-to-string operations are typically not a bottleneck for most Python applications. The only reasonable use case that comes to my mind is writing out data in CSV format.
(and, of course, CSV itself is inefficient compared to storage formats like Parquet)
Maybe float to string is affected by the need to return the shortest string that converts back to the original value after rounding, assuming IEEE doubles.