Hi all! Sorry for my apparent inactivity. I’ve thought a bit about this, reading the base64 RFC more carefully and writing my own encoding and decoding functions in C.
I think that adding a data_length
argument, as suggested by me and @encukou, isn’t the best choice for a few reasons:
- It is somewhat complex to handle and implement
- This complexity is also exposed to the user
- While it is possible to encode and decode a stream of data with a data length different from 8 (a byte/octet), I’m unsure how much this use case is applicable for Python. As far as I know, Python does not have a data type capable of representing anything more granular than a byte array (correct me if I’m wrong, I’m not an expert!). Suppose you’d decode a base64 input with a
data_size
of 5. Where would you store the output? Would a byte array be appropriate for this case? Should each array element contain a 5-bit datum? Or should they packed as a continuous stream of bits? If so, what about the last element?
All things considered, I believe that adding an option which only concerns padding would be better, and especially more easy to understand for users.
I’m also not convinced by @Rosuav’s strict_padding
option. With it, the only option to accept unpadded inputs would be to set strict_padding=False
, but this also implies accepting wrongly padded inputs (i.e. inputs with “not enough padding”). The RFC mentions either correctly padded inputs or unpadded inputs as “acceptable”, and inputs with incorrect padding would be in my opinion never acceptable.
Adding a padded
option, on the other hand, would also make it more clear to people reading the code that in that given situation the developer chose to use the unpadded profile of base64. strict_padding=False
, instead, would be less descriptive/precise.
The Go language, for reference, has different encoders/decoders for different base64 profiles: standard with padding, standard without padding, urlsafe with padding, urlsafe without padding. See https://pkg.go.dev/encoding/base64#pkg-variables.
Lastly, here’s a code comparison of padded
and strict_padding
, and how they interact with validate
:
from base64 import b64decode, b64encode
assert b64encode("python3") == "cHl0aG9uMw=="
# padded option
b64decode("cHl0aG9uMw==", padded=True, validate=True) # ok
b64decode("cHl0aG9uMw==", padded=True, validate=False) # ok
b64decode("cHl0aG9uMw==", padded=False, validate=False) # ok? error? I'd argue error
b64decode("cHl0aG9uMw==", padded=False, validate=True) # error
b64decode("cHl0aG9uMw=", padded=True, validate=True) # error
b64decode("cHl0aG9uMw=", padded=True, validate=False) # error
b64decode("cHl0aG9uMw=", padded=False, validate=False) # ok? error? I'd argue error
b64decode("cHl0aG9uMw=", padded=False, validate=True) # error
b64decode("cHl0aG9uMw", padded=True, validate=True) # error
b64decode("cHl0aG9uMw", padded=True, validate=False) # error
b64decode("cHl0aG9uMw", padded=False, validate=False) # ok
b64decode("cHl0aG9uMw", padded=False, validate=True) # ok
# strict_padding option
b64decode("cHl0aG9uMw==", strict_padding=True, validate=True) # ok
b64decode("cHl0aG9uMw==", strict_padding=True, validate=False) # ok
b64decode("cHl0aG9uMw==", strict_padding=False, validate=False) # ok
b64decode("cHl0aG9uMw==", strict_padding=False, validate=True) # ok
b64decode("cHl0aG9uMw=", strict_padding=True, validate=True) # error
b64decode("cHl0aG9uMw=", strict_padding=True, validate=False) # error
b64decode("cHl0aG9uMw=", strict_padding=False, validate=False) # ok
b64decode("cHl0aG9uMw=", strict_padding=False, validate=True) # ok
b64decode("cHl0aG9uMw", strict_padding=True, validate=True) # error
b64decode("cHl0aG9uMw", strict_padding=True, validate=False) # error
b64decode("cHl0aG9uMw", strict_padding=False, validate=False) # ok
b64decode("cHl0aG9uMw", strict_padding=False, validate=True) # ok
From these examples, I hope it’s clear that padded
implies a more conscious decision. It’s also way faster to type!
Once there’s consensus, I’ll resume working on the patch : )
Edit: the base64
command line utility, wrote by the same guy who wrote the RFC, happily decodes both cHl0aG9uMw==
and cHl0aG9uMw
, but throws an error with cHl0aG9uMw=