Hi! First post here, so feel free to redirect me to the most appropriate place where I should submit this
I think it’d be great if the stdlib’s base64 decoding functions would allow taking a padded keyword argument in order to specify whether the to-be-decoded string includes padding or not.
This would allow decoding strings which do not have padding on purpose. For example, the JSON Web Signatures (and JWTs) standard(s) specify that their base64-encoded components should not include any trailing = characters (i.e. no padding).
Base64 encoding using the URL- and filename-safe character set defined in Section 5 of RFC 4648, with all trailing ‘=’ characters omitted (as permitted by Section 3.2)
In some circumstances, the use of padding (“=”) in base-encoded data is not required or used.
and
The pad character “=” is typically percent-encoded when used in an URI, but if the data length is known implicitly, this can be avoided by skipping the padding
Of course, this new boolean would default to true, to preserve the current behaviour.
Alternatively, a data_length argument could be added instead, so that the decoder can understand on its own what bits to ignore at the end. This StackOverflow answer may be useful in understanding what I mean: https:/ /stackoverflow. com/a/56240229/10767647
This is the right place.
Adding a checked data_length argument (and thorough tests) to b64decode and b32decode, and the corresponding binascii functions, sounds reasonable to me. Do you want to work on a PR?
I’m definitely interested, yes! I’ve never hacked on the cpython codebase though, do you have any suggestion as to where I should start? (I’m comfortable writing C code and dealing with non-conventional build systems, so that’s not the issue for me)
You’ll want to build CPython from source.
Then, hack on Modules/binascii.c and add tests to Lib/test/test_binascii.py. You’ll run into our code-generation tool for argument-handling, Argument Clinic. It has docs; the short story is that you’ll want to change the magic comment and run run make clinic (or Tools\clinic\clinic.py --make on Windows). Use the existing strict_mode argument for an example; for a count you’ll want the Py_ssize_t type (signed, so you can use -1 as default).
Also, file an issue, link to this conversation, and mention @encukou on it.
RFC 4648 does not permit the omission of padding. The omission of padding can be implemented in a different library whose specification explicitly allows this, such as RFC 7515.
Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise.
That was an oversight; it should (or must) return percent-encoded characters.
Note that I’m not proposing to change encoding functions, only decoding ones. All base64-encoded strings created in Python will still have padding.
Whether encoding functions should be extended too should probably be discussed elsewhere (and it would maybe go against the classic “be conservative in what you send, be liberal in what you accept”)
Bikeshedding this a bit: I’d be inclined to make the argument strict=True. If it’s True (which would be the default), then - as now - incorrect padding is an error, and the string MUST contain the correct number of equals signs. (Note that this may be zero; if the original binary string was a multiple of three bytes, the only correct amount of padding is none.) However, if it’s False, padding is allowed to be omitted.
Personally, I would be inclined to have ALL trailing equals signs be ignored when in non-strict mode, as this is convenient in many cases. The alternative would be to accept precisely two options: correct padding, or no padding.
Looking for prior art led me down a rabbit hole. I started with Pike’s MIME.decode_base64 function, which takes the simple and easy approach of stopping decoding when it hits an equals sign, but otherwise not counting them at all (you can even give it a string of dozens of equals signs and it won’t complain). It’s easy and it’s convenient, and the difference isn’t ever likely to come up anyway. Cool. Straight-forward.
JavaScript’s atob function (Window: atob() method - Web APIs | MDN) … okay, now it’s rabbit hole time. In Chrome, Firefox, and Node.js, it seems to accept either correct padding or no padding, but rejects anything else (eg you can’t atob("AA=")). Browsing MDN led me (via several steps) to this specification of the “forgiving” decode. I think that what this is saying is that you’re allowed to have either perfect padding (so that it results in a multiple of 4 characters including the padding) or no padding (which would result in either 2 or 3 characters beyond the last block of four), but nothing else. However, the specification doesn’t ACTUALLY say that you need to reject in step 2 if it has any other number of equals signs. I guess it’s implied? But - for example - "A===" and "AAAA====" both have code point lengths that are multiples of four, and they don’t end with “one or two U+003D”, so… I guess the specification is saying that you fail? If I were to naively code this up, I’d probably land in step 4 and discover that there’s still an equals sign and therefore return failure, but it’s kinda confusing.
Anyway!! All that’s to say that I think there are two quite reasonable interpretations here (“allow any padding” and “allow correct or none, but nothing else”), both of which have their merits. But I think it’s better to describe this as a strictness rule rather than a “this was/wasn’t padded” rule.
As I understand it, the base64 standard indicates that implementations must not support encoding or decoding non-padded base64 strings. Defining such behavior is left to other specifications.
For example, here is how to decode when padding is omitted:
return Convert.FromBase64String(s); // Standard base64 decoder
How would you implement this without breaking b64decode’s existing interface, though? It already specifies a “validate” boolean, which when true throws an error if the string doesn’t have enough padding. Are you proposing to deprecate its current behaviour?
validate=True isn’t what causes that, so I would expect that it still wouldn’t. When you set validate=True, you enforce that extraneous characters are errors:
>>> base64.b64decode("AAAA\nBBBB")
b'\x00\x00\x00\x04\x10A'
>>> base64.b64decode("AAAABBBB")
b'\x00\x00\x00\x04\x10A'
>>> base64.b64decode("AAAA\nBBBB", validate=True)
Traceback (most recent call last):
File "<python-input-4>", line 1, in <module>
base64.b64decode("AAAA\nBBBB", validate=True)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/base64.py", line 86, in b64decode
return binascii.a2b_base64(s, strict_mode=validate)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
binascii.Error: Only base64 data is allowed
So I would see them as orthogonal, albeit with confusingly-similar names. Using validate=True mandates that no extraneous characters slip in; using strict=True mandates that the padding at the end be correct. It’s also going to be more than a little strange that the default is validate=False, strict=True and I would personally be fine with the default being False, False, but that would be a notable change in behaviour and it’s reasonable to demand that the default maintain the status quo.
Validate and not strict would allow the padding to be omitted while demanding no newline in the middle of the text. Strict and not validate is the opposite. Both are reasonable in their respective circumstances.
I don’t want to be pedantic, but RFC 4648 explicitly states that ‘base64url’ is not the same as ‘base64’, nor does it require ‘base64url’ implementation if a library is named rfc4648.py.
This encoding may be referred to as “base64url”. This encoding should not be regarded as the same as the “base64” encoding and should not be referred to as only “base64”. Unless clarified otherwise, “base64” refers to the base 64 in the previous section.
I believe that the omission of padding should be implemented only in the ‘base64.urlsafe_b64*’ implementation. Otherwise, any willful violation should be documented, effectively creating a new standard.