Get single-byte bytes objects from a bytes object

Rosuav · December 24, 2023, 11:20am

Yeah, that’s fair. Though in the case of a bytestring, it’s much less clear, since a byte IS an octet IS a small integer, and yet, a string of bytes often really IS a meaningful text sequence. Practicality beats purity, after all, and while we could dig our heels in and say “No! Bytes and text are not the same! Waaaaaaahhhh”, it’s not going to change reality. An HTTP response will genuinely include text headers followed by binary data, without anyone batting an eyelid.

beauxq · December 27, 2023, 2:38pm

Out of interest, who calls it a mistake?

I would call it a mistake.

It makes it difficult to avoid this type error:

def foo(a: Iterable[str]):
    ...


foo("abc")

What we often want is a way to specify that this function takes a data structure that the user intended to be multiple separate strings - a list of strings, or a tuple of strings, or a dictionary with string keys, or a custom container that yields strings when iterated.

A single string is almost never what we want, but there’s no way for a type checker to see this error.

Because of this, I am glad that iterating through bytes doesn’t yield single-element bytes.

Rosuav · December 27, 2023, 3:48pm

And many other people don’t. I was more asking whether anyone who actually makes decisions about Python - that is, PSF, core devs, or someone like that - had said that it’s a mistake.

It’s definitely a quirk. The fact that iterating over a string produces more iterables, all the way down, is definitely a bit of a surprise the first time you come across it (or the first several times, even). But it’s also an extremely useful one in many situations.

beauxq · December 27, 2023, 3:56pm

Whether it’s surprising is not relevant to my point. Even if it’s not surprising at all, and you know it’s coming, it’s still a problem. There’s no good solution to this.

Rosuav · December 27, 2023, 7:31pm

Well, I guess a lot of us disagree with you then. It’s incredibly useful, not a problem, for strings to be iterable.

elis.byberi · December 27, 2023, 7:49pm

Why should a byte string be a sequence of ASCII characters?

Rosuav · December 27, 2023, 7:53pm

For the same reason that you can put ASCII characters into a bytestring literal, for the same reason that bytestrings have percent formatting, and so on. A lot of byte streams DO include some readable, printable ASCII. Some are even entirely so:

>>> import random, base64
>>> base64.b64encode(random.randbytes(36))
b'+b4sMNkX3F01n28TYiJmHyG4aUQJV13iiB1+c2DWoS8uye9r'

Should b64encode return str or bytes? Either way, the value it returns is guaranteed to be entirely readable ASCII, so it makes very good sense to display it that way.

Don’t forget that this is just the object’s repr. Sometimes it’ll happen to be compact, sometimes it won’t. No big deal either way.

elis.byberi · December 27, 2023, 8:05pm

I understand, but I don’t really want to deal with 128 ASCII characters when working with bytes; an integer value from 0 to 255 makes more sense.

I always use print(list(b"byte string")) to print bytes because ASCII characters make no sense in a byte string.

beauxq · December 27, 2023, 8:20pm

I didn’t say it’s a problem for strings to be iterable. I definitely think it’s good and not a problem for strings to be iterable.
It’s looking like you didn’t read what I posted.

Also, multiple times you’ve said it’s useful, but you haven’t given any argument or example for why it’s useful for the yielded type to be a string, rather than some other type, like a character data type.

pf_moore · December 27, 2023, 8:30pm

Because Python doesn’t have a character data type, and it would be a massive backward compatibility break to add one and have iterating over strings yield it.

It’s critical to remember that the onus is on someone proposing a new feature to demonstrate its benefit. There is no requirement for anyone to justify the current behaviour - the status quo (i.e., “do nothing”) always wins by default.

beauxq · December 27, 2023, 8:34pm

I’m not proposing any change, I’m pointing out that there is a problem.

I would be against trying to change this before Python 4, because I recognize the backwards compatibility problem.

barry-scott · December 27, 2023, 8:34pm

But many of us do process bytes based on them containing text.
HTTP etc being a major use case for python and bytes.

In other use cases being a sequence of unsigned 0-255 values is perfect.

elis.byberi · December 27, 2023, 9:19pm

Yes, I’ve also built a custom HTTP server, but it works with ASCII characters only. It is irrelevant whether you use bytes or strings (Unicode) for the text part (not the binary content).

Rosuav · December 27, 2023, 10:00pm

I admitted that it was surprising that iterating over strings produced more iterables. You now say you don’t see it as a problem. What, then? You want strings to be iterable, but single-character strings not to be? Or strings are iterable, but produce some sort of “character” type (not an integer)?

The most obvious interpretation of what you posted was exactly what I took. I did read what you posted. I interpreted it, especially the parts you didn’t actually write, in the most sensible and obvious way I could think of.

stoneleaf · December 27, 2023, 10:02pm

[quote=“Karl Knechtel, post:6, topic:41709, username:kknechtel”]
The real question is what subsequent problem you hope to solve by getting this result, and how common that need is[/quote]

There are plenty of areas where working directly with bytes as ascii-encoded text makes sense, which is why PEP 461 was accepted.

barry-scott · December 28, 2023, 12:53am

It makes a big difference in performance by avoiding the encode and decode. It is not irrelevant. If you ever try this at scale you will find that out. My day job until recently was working on such code.
We did approx 3,000,000,000 transactions a day in python.

flyinghyrax · December 28, 2023, 2:34am

It looks like a PEP related to this topic has recently been revived. The discussion thread is here: PEP 467: Minor API improvements for binary sequences

It looks like if implemented, it would add getbyte and iterbytes methods that directly address the pain points from the OP (plus some other helpful byte/bytearray methods).

elis.byberi · December 28, 2023, 10:38pm

This is how you should do it:
[char.encode() for char in string.ascii_lowercase]

I would recommend anyone to use Python’s built-in str to encode to bytes, regardless of the character set they prefer.
For example:

NUL = '\0'.encode()
A = 'A'.encode()
B = chr(66).encode()  # PEP 467: bytes.fromint(66)
CR = '\r'.encode()
face = '\U0001f604'.encode()

print(NUL, A, B, CR, face)
print(A.decode(), B.decode(), face.decode())

string = b'byte_string'
byte = chr(string[0]).encode()  # PEP 467: string.getbyte(0)

I am not against PEP 467, but there is already a way to encode every character set.

If you are going to work with ASCII characters, you really need a prebuilt ASCII table. You don’t need to use encode from str or the getbyte() method from PEP 467 in your code.

That is 0.0000288 seconds per transaction. Knowing (from the benchmark) that it takes ~700 nanoseconds to encode and decode 4096 random ASCII characters, it would cause a 2.2% decrease in reqs/sec if you transfer at 1Gbps.

Benchmark

import timeit
import random
import string

# Generate 4096 random ASCII bytes
random_bytes = bytes(
    random.choices(list(string.ascii_letters.encode()), k=4096))

# Measure time for encoding
number = 1000000
time_taken = timeit.timeit(stmt="random_bytes.decode('ascii')",
                           setup="from __main__ import random_bytes",
                           number=number)

encode_time = (time_taken / number) * 1000000000
print(f"Time taken to encode: {encode_time} nanoseconds")

# Measure time for decoding
number = 1000000
random_bytes_str = random_bytes.decode('ascii')
time_taken = timeit.timeit(stmt="random_bytes_str.encode('ascii')",
                           setup="from __main__ import random_bytes_str",
                           number=number)

decode_time = (time_taken / number) * 1000000000
print(f"Time taken to decode: {decode_time} nanoseconds")
print(f"Time taken to encode and decode: {encode_time + decode_time} nanoseconds")

Rosuav · December 28, 2023, 11:42pm

That’s an average. But I would be EXTREMELY impressed if Barry’s doing thirty-odd thousand transactions per second on a single thread. That sounds, to me, like a job for parallelization - which would then allow each request to actually take more time than the 28usec that this estimate would suggest!

barry-scott · December 28, 2023, 11:47pm

Around 750 xeon CPUs with lots of ram and cores…
And yes we cared about 2% savings, that a lots of money in hardware.

Topic		Replies	Views
Converting integer to byte string problem in python 3 Python Help	6	40426	August 28, 2020
Int to bytes conversion confusion Python Help	6	5715	February 9, 2021
Strip byte string and take only importante values Python Help help	9	1143	July 7, 2023
How to split one word into an array? Python Help help	2	3755	September 9, 2020
XOR operand between bytes Ideas builtins	11	4161	August 4, 2022

Get single-byte bytes objects from a bytes object

Related Topics