Performance of str.encode vs codecs.getwriter

davetapley · February 14, 2024, 5:34pm

For reasons* I need to get from a json.dump to BytesIO.

My first thought was:

out = io.BytesIO(json.dumps(data).encode('utf-8'))

But it occurred to me that json.dump accepts SupportsWrite[str] aka StringIO, and this SO answer suggests using codecs.getwriter to get from StringIO to BytesIO, e.g.:

out = io.BytesIO()
writer = codecs.getwriter('utf-8')(out)
json.dump(data, writer)

It works, but much to my surprise it is much slower than the encode version, at least according to my test** (results in gist).

Could someone sanity check whether that test is ‘realistic’
and offer ideas of why the latter implementation is so much slower?

* by using the ‘file like’ BytesIO I presume I get the benefit of wsgi.file_wrapper per:

** I also put an orjson implementation in there, and reassuringly that’s even faster.

Stefan2 · February 14, 2024, 5:47pm

So much slower? Where did you tell us how much slower it is?

barry-scott · February 14, 2024, 5:48pm

My guess is that its the number of function calls that is making the difference.
With encode() there is one call to changed the json encoded data from unicode to bytes.
With the getwritter it is called repeatedly for each piece of encoded data.

You should be able to confirm that by using a wrapper around writer that counts the number of calls made.

bschubert · February 14, 2024, 6:03pm

I’m not so sure about this premise. Typically, the point of using a file-like object is to avoid loading data all-at-once. But in this case, all of the data is already in a Python object. Without knowing anything about the library you’re using, I would guess that using BytesIO in this way is going to be strictly slower, since 1) you’re creating extra copies of the data, and 2) you’re forcing the library to make many function calls to access the data chunk-by-chunk instead of letting it use the already existing str/bytes object.

jamestwebber · February 14, 2024, 6:08pm

Along those lines: it’s just trading speed for memory. The first version converts the whole thing in one go, the second reads it a bit at a time. For large data this can make a difference in timing but for very large data it’s necessary.

davetapley · February 14, 2024, 6:26pm

Ha, good point here’s my output:

13.968287179000981
2.118424963999132
0.22043878200020117

Rosuav · February 14, 2024, 6:30pm

UTF-8 is an incredibly common encoding, and str.encode() has a fast path for it. I guess codecs.getwriter() doesn’t, which might change if anyone cares enough, but since most people use str.encode(), that’s the one worth optimizing the most.

davetapley · February 14, 2024, 6:32pm

Thanks for addressing that part of the post

So initially I was using dump (to str) and setting Falcon’s Response.text. When I did that and made a request with a large response body (MBs) it would block other requests.

I switched to Response.set_stream and a BytesIO that blocking behavior went away.

I assume GIL related as hinted here, so perhaps it would be slower (for reasons you mention), but for the GIL.

barry-scott · February 14, 2024, 10:22pm

Sounds like an issue in Falcon’s (the library?) code that is not doing network I/O fairly.
Doubt if the GIL is involved as I think you are describing an I/O issue not a CPU bound problem.

Rosuav · February 14, 2024, 10:37pm

Or possibly, the correct way to do it is set_stream. It certainly seems plausible.