Faster float / string conversion (Ryu)

petersuter · October 12, 2019, 6:12am

Converting between floating point numbers and strings is apparently surprisingly often a bottleneck (for example in JSON parsing / serialization and similar areas.)

CPython seems to use dtoa for this, but a new algorithm Ryu is apparently much faster.

Microsoft C++ standard library developers recently reported using it and “the speedups are indeed massive due to algorithmic improvements of Ryu.”

Would this be something CPython should adopt as well?

ericvsmith · October 13, 2019, 2:25am

Do you have any benchmarks that show a bottleneck in dtoa.c? There’s lots of other overhead in Python. Before we start optimizing we should understand where the time is currently spent.

willingc · October 13, 2019, 7:48am

Thanks @petersuter for the links and raising the question. Looking at Ryu’s repo and associated paper it does look like an interesting approach.

I agree with Eric that it would be helpful to know any benchmarks.

@vstinner FYI.

petersuter · October 14, 2019, 4:40pm

Unfortunately I have no benchmarks, and I agree that would be needed first. Thanks for your thoughts.

Hasnat · October 16, 2019, 10:58am

Ulf Adams claims in his paper “We present Ryu, a new routine to convert binary floating ¯
point numbers to their decimal representations using only
fixed-size integer operations, and prove its correctness. Ryu¯
is simpler and approximately three times faster than the
previously fastest implementation.” https://doi.org/10.1145/3192366.3192369

It performs better than dtoa on some inputs but the output is less human readable. For example dtoa and serde_json prefer to print 0.3 as 0.3, while Ryū currently prints it as 3E-1.

Input f64 dtoa Ryū
0.1234 47 ns 66 ns

2.718281828459045 64 ns 40 ns

1.7976931348623157e308 73 ns 42 ns

Benchmark commit: dtolnay/dtoa@655152b

benjamin · October 17, 2019, 2:06am

I’d definitely like to hear @mdickinson’s opinion on this topic.

An advantage of Ryū besides speed is simplicity. dtoa.c is a hairy piece of code, and it’d be nice to have something simpler and modern instead. On the other hand, dtoa.c is very battle-tested at this point.

methane · October 17, 2019, 6:33am

FYI, there is a package make repr(float) faster.

mdickinson · October 18, 2019, 5:10pm

I’d be happy to see the change, provided it’s well tested. I’ve never been much of a fan of the complexity of dtoa.c. But dtoa.c won’t go away entirely: IIUC, Ryū is only for float-to-string conversions, right? And for performance, I’d consider that the less important direction - e.g., reading text files of numeric data happens more often than writing such files, and anyone with serious amounts of numeric data should be writing in a binary format rather than converting to string anyway.

One unusual aspect of dtoa.c is that it supports fixed-point formatting (i.e., %f-style formatting) with a negative precision; that facility is used by Python for doing correct rounding with a negative second argument. It’s not clear to me at a glance whether Ryū supports this. Can anyone confirm or deny?

Finally, I’d caution that replacing dtoa.c won’t be a small amount of work.

mdickinson · October 18, 2019, 5:37pm

I’m a bit worried by this part of the README:

Output can be slightly different from the native function, due to floating-point rounding

The shortest-string algorithm requirements are well-defined, so there’s exactly one “correct” answer for that algorithm (and absent bugs, that’s what repr produces). If frepr is giving different results, that doesn’t give me confidence that it’s a good replacement. At a minimum, I’d want to know more about what these different results are and under what circumstances they can occur.

pitrou · October 18, 2019, 8:53pm

My understanding is that underlying library (Google’s double-conversion) always rounds away from 0:

The buffer will choose the representation that is closest to
‘v’. If there are two at the same distance, than the one farther away
from 0 is chosen (halfway cases - ending with 5 - are rounded up).

I don’t know if that is a concern for us or not.

But anyway that is for frepr, but the Ryu library seems faster yet.

newbie-02 · May 14, 2022, 11:36am

hello @all,
being new here pls. be tolerant if I’m not well adopted to the manners,
working on a similar question for gnumeric I found a link to this thread in github/ryu, if this topic is still of interest pls. check if the info / code provided there discussion about output formatting is of any help for you, it enables different formats in ryu. tests, corrections, feedback, improvements welcome.
IMHO ‘d2s’ is nicely fast, ‘generic_128’ is not, I urgently! need a fast solution for 80-bit long doubles.

b.
P.S. @mdickinson: you’d write > The shortest-string algorithm requirements are well-defined, so there’s exactly one “correct” answer for that algorithm (and absent bugs, that’s what repr produces). - I’m new to these things and unsure about some details, can you help me with a link to appr. docu? TIA

mdickinson · May 15, 2022, 3:31pm

It’s not 100% clear to me what sort of docs you’re looking for here.

For Python, these are the requirements:

define the “length” of a positive decimal value d as the smallest positive integer f such that d can be written in the form d = m * 10**e, with m and e integers and m < 10**f
for a given positive float x, the value of repr(x) must be the shortest (in the sense of the “length” just defined) decimal value that rounds back to x under the round-ties-to-even rounding mode
if there’s more than one such shortest value, the closest to x must be chosen
if there are two equally close values, the one with even last digit must be chosen

These requirements uniquely determine the value of repr(x), though they don’t uniquely determine the form: e.g., they leave open whether to express the repr of 1_0000_0000_0000_0000.0 as 10000000000000000.0 or as 1000000000000000. or as 1e+16.

Note that there are some subtleties here: it is not true that the value of shortest string defined above is always the same as the result of rounding x to a suitable number of decimal digits. For an example of a weird corner case, consider x = 1 / 2**24. This has exact value 0.000000059604644775390625. If we round x to 16 significant digits we get 5.960464477539062e-08, which does not round back to x. If we round x to 17 significant digits we get 5.9604644775390625e-08, which does round back to x. So you might think that the latter gives the shortest string representation in this case. However, the shortest string representation is 5.960464477539063e-08.

newbie-02 · May 16, 2022, 8:53am

thought there might be some ‘official standard’, your explanation is fully sufficient,

let me try to express it less scientific: ‘the shortest string of decimal digits which raised to an - also zero or negative - integral power of ten would re-produce the original float in conversion to IEEE format’ - correct?

that’s ‘nearest, ties to even’ on decimal side? IMHO it’s already problematic for binaries.

thank you, very well pinpointed where I - and others - often think too simple.

that’s the problem i meant, ‘ties to even’ on decimal side improves the problem instead of the solution, not having human / decimal common ‘nearest, ties away’ as alternative was a weakness in IEEE 754 from the beginning, and they are step-by-step going back with having it since -2008 and having ‘augmented’ modes since -2019, but that’s another story.
‘shortest roundtripping’ seems a good step for me to get better syncing between decimal and bin math, but ‘ties to even’ on the binary side is still a stumbling block. For this sample ‘ties away’ in dec and bin would match, not clear to me at the moment if that applies to all such cases.

h-vetinari · May 16, 2022, 12:35pm

Just a comment because this thread popped up again that this field has continued to advance quite quickly in recent years, and other competitive methods have appeared, some of which out-perform Ryu. To my knowledge, one of the best ones currently is dragonbox.

mdickinson · May 17, 2022, 2:46pm

Dragonbox looks interesting indeed - thank you for the link!

belm0 · October 8, 2022, 4:48am

Reviving an old thread, but I recently stumbled on the relatively poor performance of Python float-to-string. It happened because our app was enjoying the high speed of orjson’s serialization (thanks to Ryu, by way of Rust), and then found that, on PyPy, there is nothing available to come anywhere near that performance (since PyPy uses CPython’s dtoa, and orjson doesn’t support PyPy, and would suffer the huge C extension tax anyway).

It’s an indirect observation by way of json serialization, but you can see that orjson dumps() of a large float array is 15x faster than json:

$ python -m pyperf timeit --fast \
    -s 'from json import dumps; x = [3.141592653589793] * 100'
    'dumps(x)'
Mean +- std dev: 78.6 us +- 1.0 us

$ python -m pyperf timeit --fast \
    -s 'from orjson import dumps; x = [3.141592653589793] * 100' \
    'dumps(x)'
Mean +- std dev: 4.64 us +- 0.03 us

Just to confirm that it’s really due to float-to-string, rather than some unfortunate overhead of json, here is the nearly identical result from CPython str():

$ python -m pyperf timeit --fast \
    -s 'x = [3.141592653589793] * 100' \
    'str(x)'
Mean +- std dev: 76.3 us +- 1.0 us

If we’re talking about a 15x speedup, and given how pervasive float-to-string operations are (serialization for networking and storage, logging, …), I wouldn’t be surprised if the typical app got a several percent boost by replacing dtoa in CPython. I know it would be a lot of work – perhaps this should be on the faster-python radar?

pochmann · October 8, 2022, 12:23pm

What times do you get with the integer 31415926535897932 or the string “314159265358979” instead? Just wondering whether the slowness comes from float or from json. Seeing all six times (three types with both libraries) might be useful.

pitrou · October 8, 2022, 4:04pm

I would be very surprised. Float-to-string operations are typically not a bottleneck for most Python applications. The only reasonable use case that comes to my mind is writing out data in CSV format.

(and, of course, CSV itself is inefficient compared to storage formats like Parquet)

belm0 · October 8, 2022, 11:44pm

It’s not coming from json, as shown by str() being equally as slow.

Anyway, here are the int results:

$ python -m pyperf timeit --fast -s 'from json import dumps; x = [3141592653589793] * 100' 'dumps(x)'
Mean +- std dev: 17.8 us +- 0.8 us

$ python -m pyperf timeit --fast -s 'from orjson import dumps; x = [3141592653589793] * 100' 'dumps(x)'
Mean +- std dev: 4.88 us +- 0.08 us

$ python -m pyperf timeit --fast -s 'x = [3141592653589793] * 100' 'str(x)'
Mean +- std dev: 15.1 us +- 1.0 us

(So perhaps a little room for improvement in int-to-string, but not nearly the problem of float-to-string.)

guido · October 9, 2022, 12:08am

Maybe float to string is affected by the need to return the shortest string that converts back to the original value after rounding, assuming IEEE doubles.