Alliow `bytes(mystring)` without specifying the encoding

Currently, calling bytes on a str object without specifying an encoding raises a TypeError:

>>> bytes("hello")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

In contrast, calling the .encode method on a str object without specifying an encoding assumes UTF-8 by default:

>>> "hello".encode()
b'hello'

For consistency, I would suggest that calling bytes on a str object without an encoding also assumes UTF-8 by default, as in

>>> bytes("hello")
b'hello'

This is relevant for functions that call bytes to convert a variety of arguments to a bytes object. Currently we are doing

>>> try:
...     b = bytes(arg)
... except TypeError:  # string
...     b = bytes(arg, "UTF-8")

which is unnecessarily complicated.

This was done so intentionally so as to alert users coming from a Python 2 background (or converting a legacy code base from Python 2 to 3). In Python 2, bytes(u"abc") worked, but bytes(u"\xff") raised UnicodeEncodeError. This was a common pitfall, so we changed bytes() to always insist on an encoding (UTF-8 wasn’t the obvious default it was then that it is now, and would have masked important errors given the totally different approach to Unicode in Python 2).

Maybe it’s time to reconsider this, but before we go there, can you tell us what other types the argument can take on in your code?

3 Likes

Other types are bytes itself, as well as a number of classes (defined by us) that have a __bytes__ method.

I would write it as

if isinstance(arg, str):
    b = bytes(arg, "UTF-8")
else:
    b = bytes(arg)

if string is common as argument.

bytes() also accepts a single argument if it is an integer or an iterable of integers. It can cause problems in your program if you do not handle these cases. It would be better if the bytes constructor be less overloaded, and if there would be alternate constructors for different types of arguments.

Note also that str(b) and str(b, 'utf-8') produce very different results if b is bytes or bytearray. It is so large issue that there is a dedicated warning emitted in the former case (disabled by default), and much effort is spent to avoid these warnings if they are enabled. I afraid that allowing bytes(s) will open a similar can of worms.

I think that bytes() accepting an integer is the odd case. For other arguments, the behavior of bytes is consistent with e.g. list and tuple, which are similarly overloaded.

For example, for list:

>>> list([1,2,3])
[1, 2, 3]
>>> list((1,2,3))
[1, 2, 3]
>>> list(range(3))
[0, 1, 2]
>>> list({1,2,3})
[1, 2, 3]

and

>>> list("ABC")
['A', 'B', 'C']

and for bytes:

>>> bytes([1,2,3])
b'\x01\x02\x03'
>>> bytes((1,2,3))
b'\x01\x02\x03'
>>> bytes(range(3))
b'\x00\x01\x02'
>>> bytes({1,2,3})
b'\x01\x02\x03'

but

>>> bytes("ABC")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

Yes, that is true:

>>> b = b"ABCD"
>>> str(b)
"b'ABCD'"
>>> str(b, 'utf-8')
'ABCD'

I guess the issue is that str(b, 'utf-8') shows the information contained in b, while str(b) also shows how this information is stored (i.e. in a bytes object). However, repr(b) already does that, so perhaps str(b) could return 'ABCD' while repr(b) returns b'ABCD'. But I didn’t follow the discussion during the python2/python3 transition, so I may be missing something, and anyway changing the behavior of str(b) would break lots of code. Maybe something for Python4.

As bytes(s) currently just raises a TypeError, it seems harmless to let bytes(s) return bytes(s, 'utf8'). But I may be missing something. Do you have a particular problem in mind that allowing bytes(s) could cause?