`final` argument of codecs.utf_8_decode

jgirardet · April 29, 2019, 7:44am

Hi,
what is the purpose of the final argument of codecs.utf_8_decode.

it’s not documented and the result is not obvious :

In [18]: codecs.utf_8_decode(b"lkjlkjlijli", None, 90)                          
Out[18]: ('lkjlkjlijli', 11)

In [19]: codecs.utf_8_decode(b"lkjlkjlijli", None, 0)                           
Out[19]: ('lkjlkjlijli', 11)

thanks

malemburg · April 29, 2019, 9:25am

The final argument is needed for incremental codecs, e.g. for
.decode():

https://docs.python.org/3.8/library/codecs.html#codecs.IncrementalDecoder.decode

The purpose is to process and flush all buffers.

Thanks,

mjpieters · April 29, 2019, 10:42am

The codecs.utf_8_decode() function is not documented, no, because it is an implementation detail, part of the codecs UTF-8 infrastructure for incremental decoding. It is used indirectly to underpin the IncrementalDecoder.decoder() method registered for the UTF-8 codec. This is where the final argument is documented:

decode(object[, final])
Decodes object (taking the current state of the decoder into account) and returns the resulting decoded object. If this is the last call to decode() final must be true (the default is false). If final is true the decoder must decode the input completely and must flush all buffers. If this isn’t possible (e.g. because of incomplete byte sequences at the end of the input) it must initiate error handling just like in the stateless case (which might raise an exception).

The function is directly called as the implementation of the abstract _buffer_decode() method of the codecs.BufferedIncrementalDecoder() base class:

def _buffer_decode(self, input, errors, final):
     # Overwrite this method in subclasses: It must decode input
     # and return an (output, length consumed) tuple
     raise NotImplementedError

which documents exactly what you see returned; a tuple with the decoding result and the number of bytes consumed. This allows for partial decoding of possibly-incomplete data streaming into Python from some form of I/O.

If you pass it incomplete UTF-8 bytes, you’ll see the consumed length reflect this:

>>> codecs.utf_8_decode(b'U+2603 SNOWMAN: \xe2\x98', None, False)  # missing the 83 byte
('U+2603 SNOWMAN: ', 16)

The b'\xe2\x98' byte sequence is not complete, the third byte, b'\x83’is missing. A next call to theIncrementalDecoder.decode()method should start with that byte, and theBufferedIncrementalDecoderimplementation keeps the bytes not consumed by_buffer_decode()` so they can be prepended to the next batch of data in a future cal and decoded properly:

>>> codecs.utf_8_decode(b'\xe2\x98\x83\n', None, False)
('☃\n', 4)

But if the 3rd argument, final, is True, then the codecs.utf_8_decode() function will have to treat missing bytes as an error, and raise an exception or put in a replacement as per the second argument (errors):

>>> codecs.utf_8_decode(b'U+2603 SNOWMAN: \xe2\x98', None, True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 16-17: unexpected end of data
>>> codecs.utf_8_decode(b'U+2603 SNOWMAN: \xe2\x98', 'replace', True)
('U+2603 SNOWMAN: �', 18)

Note that when the error handler is replace, all 18 bytes of the input have been consumed.

jgirardet · April 29, 2019, 6:02pm

thank you for answers

Topic		Replies	Views
Endcoding Errors Python Help	5	5844	March 2, 2021
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 4: invalid start byte Python Help	1	2586	October 27, 2023
Bytes data being converted to string before reaching to encoding's decode method Python Help	7	2260	August 31, 2023
The cp1252 codec is significantly slower than ascii or utf8 when reading a file that only contains ascii encoded text, could it be improved? Ideas	11	2844	September 15, 2023
Please try PYTHONWARNDEFAULTENCODING (PEP 597) Python Help	8	1643	May 25, 2022

`final` argument of codecs.utf_8_decode

Related Topics