`final` argument of codecs.utf_8_decode

(Jgirardet) #1

what is the purpose of the final argument of codecs.utf_8_decode.

it’s not documented and the result is not obvious :

In [18]: codecs.utf_8_decode(b"lkjlkjlijli", None, 90)                          
Out[18]: ('lkjlkjlijli', 11)

In [19]: codecs.utf_8_decode(b"lkjlkjlijli", None, 0)                           
Out[19]: ('lkjlkjlijli', 11)


(Marc-André Lemburg) #2

The final argument is needed for incremental codecs, e.g. for


The purpose is to process and flush all buffers.


1 Like
(Martijn Pieters) #3

The codecs.utf_8_decode() function is not documented, no, because it is an implementation detail, part of the codecs UTF-8 infrastructure for incremental decoding. It is used indirectly to underpin the IncrementalDecoder.decoder() method registered for the UTF-8 codec. This is where the final argument is documented:

decode(object[, final])
Decodes object (taking the current state of the decoder into account) and returns the resulting decoded object. If this is the last call to decode() final must be true (the default is false). If final is true the decoder must decode the input completely and must flush all buffers. If this isn’t possible (e.g. because of incomplete byte sequences at the end of the input) it must initiate error handling just like in the stateless case (which might raise an exception).

The function is directly called as the implementation of the abstract _buffer_decode() method of the codecs.BufferedIncrementalDecoder() base class:

def _buffer_decode(self, input, errors, final):
     # Overwrite this method in subclasses: It must decode input
     # and return an (output, length consumed) tuple
     raise NotImplementedError

which documents exactly what you see returned; a tuple with the decoding result and the number of bytes consumed. This allows for partial decoding of possibly-incomplete data streaming into Python from some form of I/O.

If you pass it incomplete UTF-8 bytes, you’ll see the consumed length reflect this:

>>> codecs.utf_8_decode(b'U+2603 SNOWMAN: \xe2\x98', None, False)  # missing the 83 byte
('U+2603 SNOWMAN: ', 16)

The b'\xe2\x98' byte sequence is not complete, the third byte, b'\x83’is missing. A next call to theIncrementalDecoder.decode()method should start with that byte, and theBufferedIncrementalDecoderimplementation keeps the bytes not consumed by_buffer_decode()` so they can be prepended to the next batch of data in a future cal and decoded properly:

>>> codecs.utf_8_decode(b'\xe2\x98\x83\n', None, False)
('☃\n', 4)

But if the 3rd argument, final, is True, then the codecs.utf_8_decode() function will have to treat missing bytes as an error, and raise an exception or put in a replacement as per the second argument (errors):

>>> codecs.utf_8_decode(b'U+2603 SNOWMAN: \xe2\x98', None, True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 16-17: unexpected end of data
>>> codecs.utf_8_decode(b'U+2603 SNOWMAN: \xe2\x98', 'replace', True)
('U+2603 SNOWMAN: �', 18)

Note that when the error handler is replace, all 18 bytes of the input have been consumed.

(Jgirardet) #4

thank you for answers