Get single-byte bytes objects from a bytes object

InSync · December 22, 2023, 3:32pm

A bytes object yield int values when being iterated over:

>>> list(b'abc')
[97, 98, 99]

I find myself wanting something like this:

>>> b'abc'.to_list_of_single_bytes()
[b'a', b'b', b'c']

I thought that .split() is the method I was looking for, but apparently it rejects empty separators, unlike, say, JavaScript’s .split():

>>> b'abc'.split(b'') 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: empty separator

Of course, the result can also be achieved by a number of other ways:

>>> [bytes([byte]) for byte in b'abc']
[b'a', b'b', b'c']
>>> 
>>> [byte.to_bytes(1, 'big') for byte in b'abc']
[b'a', b'b', b'c']
>>> 
>>> import re
>>> re.findall(b'.', b'abc', flags = re.S)
[b'a', b'b', b'c']

…but all of them are somewhat verbose and do not look quite as “nice” as list(b'abc').

Is there a better way that I’m ignorant of? If not, would it be suitable to add such a method to the language?

jamestwebber · December 22, 2023, 3:50pm

How about

list(map(int.to_bytes, b'abc'))
>>> [b'a', b'b', b'c']

effigies · December 22, 2023, 4:00pm

int.to_bytes doesn’t work without additional arguments:

In [1]: list(map(int.to_bytes, b'abc'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 list(map(int.to_bytes, b'abc'))

TypeError: to_bytes() missing required argument 'length' (pos 1)

The cleanest way I can think of would be to slice:

In [5]: b = b'abc'

In [6]: [b[i:i+1] for i in range(len(b))]
Out[6]: [b'a', b'b', b'c']

jamestwebber · December 22, 2023, 4:02pm

In >=3.11 those arguments have defaults so it works.

kknechtel · December 22, 2023, 5:57pm

The real question is what subsequent problem you hope to solve by getting this result, and how common that need is. I think it’s likely a sign that you’re doing something wrong. Conceptually, bytes do not represent text, which is why the behaviour changed in 3.x (list(b'abc') does exactly what you want already in 2.x, modulo the fact that the b prefix is the default and Unicode strings with a u prefix are special). Indexing a bytes now gives an integer, because conceptually a bytes is a sequence of individual bytes, and “individual bytes” are fundamentally integral values, if they can be said to have any meaning at all. (Of course, the choice to interpret them as unsigned is somewhat arbitrary.)

But no, you haven’t really missed a better way to get the result you want. There isn’t a built-in conversion, so you indeed will either need to exploit some higher level functionality (like the byte-regex you show), or apply a conversion to each element in the standard ways (like the list comprehensions you show; of course list(map(...)) can be adapted to the purpose, but that’s largely bikeshedding).

Rosuav · December 22, 2023, 6:20pm

While this is true, bytes often do contain ASCII-encoded text, which is why you can do things like this:

>>> b"\xFF\x04%04d\xFE" % 123
b'\xff\x040123\xfe'

It would make a lot of sense to have an easy way to “chunk” a bytestring up into units of whatever size you like, with single-byte chunks being what the OP asked for. There just doesn’t happen to be an easy way to do that in the core bytes object.

davidism · December 22, 2023, 6:23pm

A bytes object is sequence of ints. Indexing a sequence yields a single value. Slicing a sequence yields another sequence. This is true of sequences in general, bytes isn’t special here except for its repr. (value[i:i+1] for i in range(len(value))) will get you 1-byte sequences of bytes, and the slice can be modified to return other lengths. Libraries such as more-itertools and boltons provide chunked functions: API Reference — more-itertools 10.1.0 documentation

abessman · December 22, 2023, 6:54pm

I have needed to do something like this in the past. My usecase was that I needed to send bytes over a serial connection one byte at a time, using pyserial’s serial.Serial which only accepts iterables as input, even when only sending a single byte.

I can’t recall why I needed to do that. Possibly for debugging purposes.

InSync · December 22, 2023, 6:59pm

I’m writing a package that deals with character/string ranges (e.g., abc-azz with the “base” of only ASCII lowercase letters would yield abc, abd, …, abz, aca, …, azz when being iterated over).

Since bytes and str have enough similarities for even some libraries in the stdlib to support both more or less the same (e.g. os.PathLike or re.Pattern), I think support for bytes ranges would be a nice feature to have.

I know that bytes objects do not generally represent text, but my package is called character_range: The user should be fully aware that if bytes objects are passed as arguments to the package’s API they would be interpreted (nearly) the same as normal ASCII characters (so b'abc'-b'azz' would yield b'abc', b'abd', etc.) and not just integers in the range [0, 0xFF].

kknechtel · December 22, 2023, 7:08pm

If the purpose is to build new bytes objects, then you don’t need to split into single-byte bytes anyway. The integers should actually be more convenient for the reassembly step after whatever logic you have for the range. For example:

>>> import itertools
>>> # Of course, this equally works with `b'abc'`, but the point here
>>> # is to show that we are explicitly working with integers.
>>> list(map(bytes, itertools.product(range(97, 100), repeat=3)))
[b'aaa', b'aab', b'aac', b'aba', b'abb', b'abc', b'aca', b'acb', b'acc', b'baa', b'bab', b'bac', b'bba', b'bbb', b'bbc', b'bca', b'bcb', b'bcc', b'caa', b'cab', b'cac', b'cba', b'cbb', b'cbc', b'cca', b'ccb', b'ccc']

If our input used a sequence of single-byte bytes, we’d have to concatenate them, which isn’t any easier:

>>> list(map(b''.join, itertools.product((b'a', b'b', b'c'), repeat=3)))
[b'aaa', b'aab', b'aac', b'aba', b'abb', b'abc', b'aca', b'acb', b'acc', b'baa', b'bab', b'bac', b'bba', b'bbb', b'bbc', b'bca', b'bcb', b'bcc', b'caa', b'cab', b'cac', b'cba', b'cbb', b'cbc', b'cca', b'ccb', b'ccc']

I would guess that working with the integers would be more performant, too, but of course one never knows without a benchmark.

InSync · December 22, 2023, 7:25pm

I did implement my own base-n integers to do the counting part. The need arised when I needed to test the code, which requires checking that iterating a range of b'a'-b'z' yields the expected 26 bytes objects representing the letters in string.ascii_lowercase.

In other words, I’m too lazy to write out all 26, so I tried string.ascii_lowercase.encode('utf-8').split(b'') and the rabbit hole began.

gcewing · December 22, 2023, 11:10pm

That’s what was thought at the time, but subsequently bytes objects
regained a lot of the functionality that strings had in 2.x, because
people found it useful. If we were designing it with the knowledge
we now have, we might have done things differently.

kknechtel · December 23, 2023, 6:49am

They find it useful because of other people’s hacks that inappropriately pretend the data represents text inherently. Unicode is now older than ASCII was when Unicode first appeared.

barry-scott · December 23, 2023, 1:21pm

There are important use cases for bytes with text in them.
HTTP for example and other protocols.

That iterating over a string that yields length 1 strings has been talked about as a mistake that python 3 should have not have continued from python 2.

But that opens up the whole code-point vs. grapheme-cluster debate.

Rosuav · December 23, 2023, 2:03pm

Out of interest, who calls it a mistake? It’s certainly a quirk, but it’s also an extremely practical one, and I generally prefer subscripting of strings to yield strings rather than integers. Although it’s perhaps worth separating subscripting from iterating here; maybe it would have been better to change iteration but keep subscripting? But that would break other things.

barry-scott · December 23, 2023, 2:25pm

One of the core devs as I recall - but I cannot figure out how to web search for a reference. I recall it being discussed but I cannot recall the context.
Could have been during python 3 design or shortly after 3.0 discussing a problem that this change would have solved.

FYI I’ve been reading python dev for >25 years somewhere in there I suspect I read this. Or maybe on this forum.

kknechtel · December 24, 2023, 8:02am

Putting that aside, one would need to establish a type for whatever was subsequently considered to be an element of a string. Even code points aren’t really integers; their mapping to text characters (!= graphemes) really is determined, by the Unicode standard, which is global this time (and even includes a bunch of stuff that one wouldn’t ordinarily think of as textual, while still having plenty of expansion room).

Rosuav · December 24, 2023, 8:55am

I’m not sure what you mean by that. Quoting from the Unicode spec:

“”"
The Unicode Standard specifies a numeric value (code point) and a name for each of its
characters.
“”"

In other words, the code point IS the integer value, and it, along with the name (eg “LATIN SMALL LETTER Q”) , are fundamental parts of a character’s definition.

kknechtel · December 24, 2023, 10:31am

Ah, good point, I got the terminology wrong.

The argument I want to make is that the string conceptually consists of characters, not the corresponding integral code points. Whether you consider the characters clustered into graphemes or separately, they’re a meaningful abstraction that shouldn’t be equated with the code point.

Stefan2 · December 24, 2023, 10:46am

Another short one (I like the “trick” of using zip on a single iterable):

[*map(bytes, zip(b'abc'))]

I was curious about speed, so …

Benchmark results

bs = b'abc'
 0.7 μs  [bytes([byte]) for byte in bs]
 0.5 μs  [byte.to_bytes(1, 'big') for byte in bs]
 1.0 μs  re.findall(b'.', bs, flags = re.S)
 0.4 μs  list(map(int.to_bytes, bs))
 0.7 μs  [bs[i:i+1] for i in range(len(bs))]
 0.7 μs  [*map(bytes, zip(bs))]

bs = string.ascii_lowercase.encode('utf-8')
 4.7 μs  [bytes([byte]) for byte in bs]
 2.5 μs  [byte.to_bytes(1, 'big') for byte in bs]
 4.0 μs  re.findall(b'.', bs, flags = re.S)
 1.6 μs  list(map(int.to_bytes, bs))
 3.0 μs  [bs[i:i+1] for i in range(len(bs))]
 3.1 μs  [*map(bytes, zip(bs))]

bs = bytes(range(256))
40.8 μs  [bytes([byte]) for byte in bs]
20.1 μs  [byte.to_bytes(1, 'big') for byte in bs]
29.4 μs  re.findall(b'.', bs, flags = re.S)
12.7 μs  list(map(int.to_bytes, bs))
24.4 μs  [bs[i:i+1] for i in range(len(bs))]
25.5 μs  [*map(bytes, zip(bs))]

bs = bytes(range(256)) * 100
 4.1 ms  [bytes([byte]) for byte in bs]
 2.6 ms  [byte.to_bytes(1, 'big') for byte in bs]
 3.0 ms  re.findall(b'.', bs, flags = re.S)
 1.8 ms  list(map(int.to_bytes, bs))
 2.9 ms  [bs[i:i+1] for i in range(len(bs))]
 3.1 ms  [*map(bytes, zip(bs))]

Benchmark script

from timeit import repeat, default_timer as time

codes = '''
[bytes([byte]) for byte in bs]
[byte.to_bytes(1, 'big') for byte in bs]
re.findall(b'.', bs, flags = re.S)
list(map(int.to_bytes, bs))
[bs[i:i+1] for i in range(len(bs))]
[*map(bytes, zip(bs))]
'''.strip().splitlines()

def test(case, number, scale=6, unit='μs'):
    print(case)
    setup = f'''
import re, string
bs = {case}
'''
    t0 = time()
    for code in codes:
        t = min(repeat(code, setup, number=number, repeat=100)) / number
        print(f'{t*10**scale:4.1f} {unit} ', code)
    # print(time() - t0)
    print()

test("b'abc'", 10**3)
test("string.ascii_lowercase.encode('utf-8')", 10**3)
test("bytes(range(256))", 1)
test("bytes(range(256)) * 100", 1, 3, 'ms')

Attempt This Online!