Add boolean `padded` argument to `base64.b64decode` et al

Hi! First post here, so feel free to redirect me to the most appropriate place where I should submit this :slight_smile:

I think it’d be great if the stdlib’s base64 decoding functions would allow taking a padded keyword argument in order to specify whether the to-be-decoded string includes padding or not.

This would allow decoding strings which do not have padding on purpose. For example, the JSON Web Signatures (and JWTs) standard(s) specify that their base64-encoded components should not include any trailing = characters (i.e. no padding).

Citing RFC 7515 (JWS):

Base64 encoding using the URL- and filename-safe character set defined in Section 5 of RFC 4648, with all trailing ‘=’ characters omitted (as permitted by Section 3.2)

Also, RFC 4648 says:

In some circumstances, the use of padding (“=”) in base-encoded data is not required or used.

and

The pad character “=” is typically percent-encoded when used in an URI, but if the data length is known implicitly, this can be avoided by skipping the padding

Of course, this new boolean would default to true, to preserve the current behaviour.

Alternatively, a data_length argument could be added instead, so that the decoder can understand on its own what bits to ignore at the end. This StackOverflow answer may be useful in understanding what I mean: https:/ /stackoverflow. com/a/56240229/10767647

Bye!

At first glance, this seems reasonable, given the RFC. But I am not an expert on the module.

This is the right place.
Adding a checked data_length argument (and thorough tests) to b64decode and b32decode, and the corresponding binascii functions, sounds reasonable to me. Do you want to work on a PR?

No one is :‍)

Thanks for your fast replies!

I’m definitely interested, yes! I’ve never hacked on the cpython codebase though, do you have any suggestion as to where I should start? (I’m comfortable writing C code and dealing with non-conventional build systems, so that’s not the issue for me)

Thank you!

You’ll want to build CPython from source.
Then, hack on Modules/binascii.c and add tests to Lib/test/test_binascii.py. You’ll run into our code-generation tool for argument-handling, Argument Clinic. It has docs; the short story is that you’ll want to change the magic comment and run run make clinic (or Tools\clinic\clinic.py --make on Windows). Use the existing strict_mode argument for an example; for a count you’ll want the Py_ssize_t type (signed, so you can use -1 as default).

Also, file an issue, link to this conversation, and mention @encukou on it.

Let me know if you run into any issues.

Very helpful, thanks! Starting now…

Shouldn’t urlsafe_b64encode already produce an unpadded base64 string, and urlsafe_b64decode be able to decode these unpadded base64 strings?

import base64

data = b'Hello'

urlsafe_base64 = base64.urlsafe_b64encode(data).decode('utf-8')
print(urlsafe_base64)  # SGVsbG8= is not URL-safe

Currently, urlsafe_b64encode does not produce URL-safe base64 strings.

RFC 4648 does not permit the omission of padding. The omission of padding can be implemented in a different library whose specification explicitly allows this, such as RFC 7515.

Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise.


That was an oversight; it should (or must) return percent-encoded characters.

Note that I’m not proposing to change encoding functions, only decoding ones. All base64-encoded strings created in Python will still have padding.

Whether encoding functions should be extended too should probably be discussed elsewhere (and it would maybe go against the classic “be conservative in what you send, be liberal in what you accept”)

Bikeshedding this a bit: I’d be inclined to make the argument strict=True. If it’s True (which would be the default), then - as now - incorrect padding is an error, and the string MUST contain the correct number of equals signs. (Note that this may be zero; if the original binary string was a multiple of three bytes, the only correct amount of padding is none.) However, if it’s False, padding is allowed to be omitted.

Personally, I would be inclined to have ALL trailing equals signs be ignored when in non-strict mode, as this is convenient in many cases. The alternative would be to accept precisely two options: correct padding, or no padding.

Looking for prior art led me down a rabbit hole. I started with Pike’s MIME.decode_base64 function, which takes the simple and easy approach of stopping decoding when it hits an equals sign, but otherwise not counting them at all (you can even give it a string of dozens of equals signs and it won’t complain). It’s easy and it’s convenient, and the difference isn’t ever likely to come up anyway. Cool. Straight-forward.

JavaScript’s atob function (Window: atob() method - Web APIs | MDN) … okay, now it’s rabbit hole time. In Chrome, Firefox, and Node.js, it seems to accept either correct padding or no padding, but rejects anything else (eg you can’t atob("AA=")). Browsing MDN led me (via several steps) to this specification of the “forgiving” decode. I think that what this is saying is that you’re allowed to have either perfect padding (so that it results in a multiple of 4 characters including the padding) or no padding (which would result in either 2 or 3 characters beyond the last block of four), but nothing else. However, the specification doesn’t ACTUALLY say that you need to reject in step 2 if it has any other number of equals signs. I guess it’s implied? But - for example - "A===" and "AAAA====" both have code point lengths that are multiples of four, and they don’t end with “one or two U+003D”, so… I guess the specification is saying that you fail? If I were to naively code this up, I’d probably land in step 4 and discover that there’s still an equals sign and therefore return failure, but it’s kinda confusing.

Anyway!! All that’s to say that I think there are two quite reasonable interpretations here (“allow any padding” and “allow correct or none, but nothing else”), both of which have their merits. But I think it’s better to describe this as a strictness rule rather than a “this was/wasn’t padded” rule.

As I understand it, the base64 standard indicates that implementations must not support encoding or decoding non-padded base64 strings. Defining such behavior is left to other specifications.

For example, here is how to decode when padding is omitted:

return Convert.FromBase64String(s); // Standard base64 decoder

There is an open issue for this:

4 Likes

How would you implement this without breaking b64decode’s existing interface, though? It already specifies a “validate” boolean, which when true throws an error if the string doesn’t have enough padding. Are you proposing to deprecate its current behaviour?

validate=True isn’t what causes that, so I would expect that it still wouldn’t. When you set validate=True, you enforce that extraneous characters are errors:

>>> base64.b64decode("AAAA\nBBBB")
b'\x00\x00\x00\x04\x10A'
>>> base64.b64decode("AAAABBBB")
b'\x00\x00\x00\x04\x10A'
>>> base64.b64decode("AAAA\nBBBB", validate=True)
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    base64.b64decode("AAAA\nBBBB", validate=True)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/base64.py", line 86, in b64decode
    return binascii.a2b_base64(s, strict_mode=validate)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
binascii.Error: Only base64 data is allowed

So I would see them as orthogonal, albeit with confusingly-similar names. Using validate=True mandates that no extraneous characters slip in; using strict=True mandates that the padding at the end be correct. It’s also going to be more than a little strange that the default is validate=False, strict=True and I would personally be fine with the default being False, False, but that would be a notable change in behaviour and it’s reasonable to demand that the default maintain the status quo.

Validate and not strict would allow the padding to be omitted while demanding no newline in the middle of the text. Strict and not validate is the opposite. Both are reasonable in their respective circumstances.

Maybe strict should be strict_padding.

2 Likes

:+1: Good way to distinguish them, I like. It’s too late to rename validate to strict_charset but we can at least document the distinction that way.

1 Like

I don’t want to be pedantic, but RFC 4648 explicitly states that ‘base64url’ is not the same as ‘base64’, nor does it require ‘base64url’ implementation if a library is named rfc4648.py.

This encoding may be referred to as “base64url”. This encoding should not be regarded as the same as the “base64” encoding and should not be referred to as only “base64”. Unless clarified otherwise, “base64” refers to the base 64 in the previous section.

I believe that the omission of padding should be implemented only in the ‘base64.urlsafe_b64*’ implementation. Otherwise, any willful violation should be documented, effectively creating a new standard.

I interpret section 3.2 as “padding should be there by default, but it’s of course reasonable to leave it out if your environment does not need that”.

But again, here I just want to improve decoding functions, not (yet) encoding ones.

You should improve both urlsafe_b64encode and urlsafe_b64decode instead. Refer to RFC 7515 for implementation details.