Add boolean `padded` argument to `base64.b64decode` et al

Hi! First post here, so feel free to redirect me to the most appropriate place where I should submit this :slight_smile:

I think it’d be great if the stdlib’s base64 decoding functions would allow taking a padded keyword argument in order to specify whether the to-be-decoded string includes padding or not.

This would allow decoding strings which do not have padding on purpose. For example, the JSON Web Signatures (and JWTs) standard(s) specify that their base64-encoded components should not include any trailing = characters (i.e. no padding).

Citing RFC 7515 (JWS):

Base64 encoding using the URL- and filename-safe character set defined in Section 5 of RFC 4648, with all trailing ‘=’ characters omitted (as permitted by Section 3.2)

Also, RFC 4648 says:

In some circumstances, the use of padding (“=”) in base-encoded data is not required or used.

and

The pad character “=” is typically percent-encoded when used in an URI, but if the data length is known implicitly, this can be avoided by skipping the padding

Of course, this new boolean would default to true, to preserve the current behaviour.

Alternatively, a data_length argument could be added instead, so that the decoder can understand on its own what bits to ignore at the end. This StackOverflow answer may be useful in understanding what I mean: https:/ /stackoverflow. com/a/56240229/10767647

Bye!

1 Like

At first glance, this seems reasonable, given the RFC. But I am not an expert on the module.

This is the right place.
Adding a checked data_length argument (and thorough tests) to b64decode and b32decode, and the corresponding binascii functions, sounds reasonable to me. Do you want to work on a PR?

No one is :‍)

Thanks for your fast replies!

I’m definitely interested, yes! I’ve never hacked on the cpython codebase though, do you have any suggestion as to where I should start? (I’m comfortable writing C code and dealing with non-conventional build systems, so that’s not the issue for me)

Thank you!

You’ll want to build CPython from source.
Then, hack on Modules/binascii.c and add tests to Lib/test/test_binascii.py. You’ll run into our code-generation tool for argument-handling, Argument Clinic. It has docs; the short story is that you’ll want to change the magic comment and run run make clinic (or Tools\clinic\clinic.py --make on Windows). Use the existing strict_mode argument for an example; for a count you’ll want the Py_ssize_t type (signed, so you can use -1 as default).

Also, file an issue, link to this conversation, and mention @encukou on it.

Let me know if you run into any issues.

Very helpful, thanks! Starting now…

Shouldn’t urlsafe_b64encode already produce an unpadded base64 string, and urlsafe_b64decode be able to decode these unpadded base64 strings?

import base64

data = b'Hello'

urlsafe_base64 = base64.urlsafe_b64encode(data).decode('utf-8')
print(urlsafe_base64)  # SGVsbG8= is not URL-safe

Currently, urlsafe_b64encode does not produce URL-safe base64 strings.

RFC 4648 does not permit the omission of padding. The omission of padding can be implemented in a different library whose specification explicitly allows this, such as RFC 7515.

Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise.


That was an oversight; it should (or must) return percent-encoded characters.

Note that I’m not proposing to change encoding functions, only decoding ones. All base64-encoded strings created in Python will still have padding.

Whether encoding functions should be extended too should probably be discussed elsewhere (and it would maybe go against the classic “be conservative in what you send, be liberal in what you accept”)

1 Like

Bikeshedding this a bit: I’d be inclined to make the argument strict=True. If it’s True (which would be the default), then - as now - incorrect padding is an error, and the string MUST contain the correct number of equals signs. (Note that this may be zero; if the original binary string was a multiple of three bytes, the only correct amount of padding is none.) However, if it’s False, padding is allowed to be omitted.

Personally, I would be inclined to have ALL trailing equals signs be ignored when in non-strict mode, as this is convenient in many cases. The alternative would be to accept precisely two options: correct padding, or no padding.

Looking for prior art led me down a rabbit hole. I started with Pike’s MIME.decode_base64 function, which takes the simple and easy approach of stopping decoding when it hits an equals sign, but otherwise not counting them at all (you can even give it a string of dozens of equals signs and it won’t complain). It’s easy and it’s convenient, and the difference isn’t ever likely to come up anyway. Cool. Straight-forward.

JavaScript’s atob function (Window: atob() method - Web APIs | MDN) … okay, now it’s rabbit hole time. In Chrome, Firefox, and Node.js, it seems to accept either correct padding or no padding, but rejects anything else (eg you can’t atob("AA=")). Browsing MDN led me (via several steps) to this specification of the “forgiving” decode. I think that what this is saying is that you’re allowed to have either perfect padding (so that it results in a multiple of 4 characters including the padding) or no padding (which would result in either 2 or 3 characters beyond the last block of four), but nothing else. However, the specification doesn’t ACTUALLY say that you need to reject in step 2 if it has any other number of equals signs. I guess it’s implied? But - for example - "A===" and "AAAA====" both have code point lengths that are multiples of four, and they don’t end with “one or two U+003D”, so… I guess the specification is saying that you fail? If I were to naively code this up, I’d probably land in step 4 and discover that there’s still an equals sign and therefore return failure, but it’s kinda confusing.

Anyway!! All that’s to say that I think there are two quite reasonable interpretations here (“allow any padding” and “allow correct or none, but nothing else”), both of which have their merits. But I think it’s better to describe this as a strictness rule rather than a “this was/wasn’t padded” rule.

As I understand it, the base64 standard indicates that implementations must not support encoding or decoding non-padded base64 strings. Defining such behavior is left to other specifications.

For example, here is how to decode when padding is omitted:

return Convert.FromBase64String(s); // Standard base64 decoder

There is an open issue for this:

4 Likes

How would you implement this without breaking b64decode’s existing interface, though? It already specifies a “validate” boolean, which when true throws an error if the string doesn’t have enough padding. Are you proposing to deprecate its current behaviour?

validate=True isn’t what causes that, so I would expect that it still wouldn’t. When you set validate=True, you enforce that extraneous characters are errors:

>>> base64.b64decode("AAAA\nBBBB")
b'\x00\x00\x00\x04\x10A'
>>> base64.b64decode("AAAABBBB")
b'\x00\x00\x00\x04\x10A'
>>> base64.b64decode("AAAA\nBBBB", validate=True)
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    base64.b64decode("AAAA\nBBBB", validate=True)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/base64.py", line 86, in b64decode
    return binascii.a2b_base64(s, strict_mode=validate)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
binascii.Error: Only base64 data is allowed

So I would see them as orthogonal, albeit with confusingly-similar names. Using validate=True mandates that no extraneous characters slip in; using strict=True mandates that the padding at the end be correct. It’s also going to be more than a little strange that the default is validate=False, strict=True and I would personally be fine with the default being False, False, but that would be a notable change in behaviour and it’s reasonable to demand that the default maintain the status quo.

Validate and not strict would allow the padding to be omitted while demanding no newline in the middle of the text. Strict and not validate is the opposite. Both are reasonable in their respective circumstances.

Maybe strict should be strict_padding.

2 Likes

:+1: Good way to distinguish them, I like. It’s too late to rename validate to strict_charset but we can at least document the distinction that way.

1 Like

I don’t want to be pedantic, but RFC 4648 explicitly states that ‘base64url’ is not the same as ‘base64’, nor does it require ‘base64url’ implementation if a library is named rfc4648.py.

This encoding may be referred to as “base64url”. This encoding should not be regarded as the same as the “base64” encoding and should not be referred to as only “base64”. Unless clarified otherwise, “base64” refers to the base 64 in the previous section.

I believe that the omission of padding should be implemented only in the ‘base64.urlsafe_b64*’ implementation. Otherwise, any willful violation should be documented, effectively creating a new standard.

I interpret section 3.2 as “padding should be there by default, but it’s of course reasonable to leave it out if your environment does not need that”.

But again, here I just want to improve decoding functions, not (yet) encoding ones.

You should improve both urlsafe_b64encode and urlsafe_b64decode instead. Refer to RFC 7515 for implementation details.

Hi all! Sorry for my apparent inactivity. I’ve thought a bit about this, reading the base64 RFC more carefully and writing my own encoding and decoding functions in C.

I think that adding a data_length argument, as suggested by me and @encukou, isn’t the best choice for a few reasons:

  1. It is somewhat complex to handle and implement
  2. This complexity is also exposed to the user
  3. While it is possible to encode and decode a stream of data with a data length different from 8 (a byte/octet), I’m unsure how much this use case is applicable for Python. As far as I know, Python does not have a data type capable of representing anything more granular than a byte array (correct me if I’m wrong, I’m not an expert!). Suppose you’d decode a base64 input with a data_size of 5. Where would you store the output? Would a byte array be appropriate for this case? Should each array element contain a 5-bit datum? Or should they packed as a continuous stream of bits? If so, what about the last element?

All things considered, I believe that adding an option which only concerns padding would be better, and especially more easy to understand for users.

I’m also not convinced by @Rosuav’s strict_padding option. With it, the only option to accept unpadded inputs would be to set strict_padding=False, but this also implies accepting wrongly padded inputs (i.e. inputs with “not enough padding”). The RFC mentions either correctly padded inputs or unpadded inputs as “acceptable”, and inputs with incorrect padding would be in my opinion never acceptable.

Adding a padded option, on the other hand, would also make it more clear to people reading the code that in that given situation the developer chose to use the unpadded profile of base64. strict_padding=False, instead, would be less descriptive/precise.

The Go language, for reference, has different encoders/decoders for different base64 profiles: standard with padding, standard without padding, urlsafe with padding, urlsafe without padding. See https://pkg.go.dev/encoding/base64#pkg-variables.

Lastly, here’s a code comparison of padded and strict_padding, and how they interact with validate:

from base64 import b64decode, b64encode

assert b64encode("python3") == "cHl0aG9uMw=="

# padded option

b64decode("cHl0aG9uMw==", padded=True,  validate=True)  # ok
b64decode("cHl0aG9uMw==", padded=True,  validate=False) # ok
b64decode("cHl0aG9uMw==", padded=False, validate=False) # ok? error? I'd argue error
b64decode("cHl0aG9uMw==", padded=False, validate=True)  # error

b64decode("cHl0aG9uMw=", padded=True,  validate=True)  # error
b64decode("cHl0aG9uMw=", padded=True,  validate=False) # error
b64decode("cHl0aG9uMw=", padded=False, validate=False) # ok? error? I'd argue error
b64decode("cHl0aG9uMw=", padded=False, validate=True)  # error

b64decode("cHl0aG9uMw", padded=True,  validate=True)  # error
b64decode("cHl0aG9uMw", padded=True,  validate=False) # error
b64decode("cHl0aG9uMw", padded=False, validate=False) # ok
b64decode("cHl0aG9uMw", padded=False, validate=True)  # ok

# strict_padding option

b64decode("cHl0aG9uMw==", strict_padding=True,  validate=True)  # ok
b64decode("cHl0aG9uMw==", strict_padding=True,  validate=False) # ok
b64decode("cHl0aG9uMw==", strict_padding=False, validate=False) # ok
b64decode("cHl0aG9uMw==", strict_padding=False, validate=True)  # ok

b64decode("cHl0aG9uMw=", strict_padding=True,  validate=True)  # error
b64decode("cHl0aG9uMw=", strict_padding=True,  validate=False) # error
b64decode("cHl0aG9uMw=", strict_padding=False, validate=False) # ok
b64decode("cHl0aG9uMw=", strict_padding=False, validate=True)  # ok

b64decode("cHl0aG9uMw", strict_padding=True,  validate=True)  # error
b64decode("cHl0aG9uMw", strict_padding=True,  validate=False) # error
b64decode("cHl0aG9uMw", strict_padding=False, validate=False) # ok
b64decode("cHl0aG9uMw", strict_padding=False, validate=True)  # ok

From these examples, I hope it’s clear that padded implies a more conscious decision. It’s also way faster to type!

Once there’s consensus, I’ll resume working on the patch : )

Edit: the base64 command line utility, wrote by the same guy who wrote the RFC, happily decodes both cHl0aG9uMw== and cHl0aG9uMw, but throws an error with cHl0aG9uMw=

4 Likes