How does Python's Binascii.a2b_base64 (base64.b64decode) work?

I checked the source code in Python and implemented a function that is the same as the Python’s binascii.a2b_base64 function.

static PyObject *
binascii_a2b_base64_impl(PyObject *module, Py_buffer *data, int strict_mode)

I used C++ and re implemented this function in my own code according to the original function in Python, in order to better understand and learn the working principle of Base64 decoding.

However, I don’t know why the function I implemented cannot handle non Base64 encoded characters correctly.

I have checked these codes and confirmed that they do not affect the function’s handling of non Base64 encoded characters, such as function [_PyBytesWriter_Init, _PyBytesWriter_Alloc, _PyBytesWriter_Finish, …], and ignored it from my code.

When processing Base64 strings that comply with the RFC4648 standard, as well as, In the case where only \n is used as a non Base64 encoded character, the function I implemented will achieve the same result as the corresponding function in Python.
For example:

const char *encoded = {
    "QUJDREVGR0hJSktMTU5PUFFSU1RVVldYWVpBQkNERUZHSElKS0xNTk9QUVJTVFVW\n"
    "V1hZWkFCQ0RFRkdISUpLTE1OT1BRUlNUVVZXWFlaQUJDREVGR0hJSktMTU5PUFFS\n"
    "U1RVVldYWVpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWg==\n"
};

Using either my function or Python’s binascii.a2b_base64 function will yield the same result as the following:

ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ

Here is the specific implementation of my code:

#define BASE64PAD '='

constexpr uint8_t b64de_table[256] = {
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255, 62, 255,255,255, 63,
    52 , 53, 54, 55,  56, 57, 58, 59,  60, 61,255,255, 255,  0,255,255,

    255,  0,  1,  2,   3,  4,  5,  6,   7,  8,  9, 10,  11, 12, 13, 14,
    15 , 16, 17, 18,  19, 20, 21, 22,  23, 24, 25,255, 255,255,255,255,
    255, 26, 27, 28,  29, 30, 31, 32,  33, 34, 35, 36,  37, 38, 39, 40,
    41 , 42, 43, 44,  45, 46, 47, 48,  49, 50, 51,255, 255,255,255,255,

    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,

    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255};

uint8_t *
pyBase64Decode(const char *buffer, size_t &length,
               bool strict_mode = false)
{
    std::string error_message;

    const uint8_t *ascii_data = (const uint8_t *)buffer;
    size_t ascii_len = length;
    bool padding_started = 0;

    size_t bin_len = ascii_len / 4 * 3; 
    uint8_t *bin_data = new (std::nothrow) uint8_t[bin_len + 1];
    if(!bin_data) {
        throw std::runtime_error("Failed to allocate memory for bin_data.");
    }
    uint8_t *bin_data_start = bin_data;
    bin_data[bin_len] = 0x0;

    uint8_t leftchar = 0;
    uint32_t quad_pos = 0;
    uint32_t pads = 0;

    if(strict_mode && (ascii_len > 0) && (*ascii_data == BASE64PAD)) {
        error_message = "Leading padding not allowed.";
        goto error_end;
    }

    size_t i;
    uint8_t this_ch;
    for(i = 0; i < ascii_len; ++i) {
        this_ch = ascii_data[i];

        if(this_ch == BASE64PAD) {
            padding_started = true;
            // If the current character is a padding character, the length
            // will be reduced by one to obtain the decoded true length.
            bin_len--;

            if(strict_mode && (!quad_pos)) {
                error_message = "Excess padding not allowed.";
                goto error_end;
            }

            if((quad_pos >= 2) && (quad_pos + (++pads) >= 4)) {

                if(strict_mode && ((i + 1) < ascii_len)) {
                    error_message = "Excess data after padding.";
                    goto error_end;
                }

                goto done;
            }

            continue;
        }

        this_ch = b64de_table[this_ch];
        if(this_ch == 255) {
            if(strict_mode) {
                error_message = "Only base64 data is allowed.";
                goto error_end;
            }
            continue;
        }

        if(strict_mode && padding_started) {
            error_message = "Discontinuous padding not allowed.";
            goto error_end;
        }

        pads = 0;

        switch(quad_pos) {
        case 0:
            quad_pos = 1;
            leftchar = this_ch;
            break;
        case 1:
            quad_pos = 2;
            *bin_data++ = (leftchar << 2) | (this_ch >> 4);
            leftchar = this_ch & 0xf;
            break;
        case 2:
            quad_pos = 3;
            *bin_data++ = (leftchar << 4) | (this_ch >> 2);
            leftchar = this_ch & 0x3;
            break;
        case 3:
            quad_pos = 0;
            *bin_data++ = (leftchar << 6) | (this_ch);
            leftchar = 0;
            break;
        }
    }

    if(quad_pos) {
        if(quad_pos == 1) {
            char tmpMsg[128]{};
            snprintf(tmpMsg, sizeof(tmpMsg),
                    "Invalid base64-encoded string: "
                    "number of data characters (%zd) cannot be 1 more "
                    "than a multiple of 4",
                    (bin_data - bin_data_start) / 3 * 4 + 1);
            error_message = tmpMsg;
            goto error_end;
        } else {
            error_message = "Incorrect padding.";
            goto error_end;
        }
        error_end:
        delete[] bin_data;
        throw std::runtime_error(error_message);
    }

done:
    length = bin_len;
    return bin_data_start;
}

How to use this function:

int main()
{
    const char *encoded = "aGVsbG8sIHdvcmxkLg==";
    size_t length = strlen(encoded);
    uint8_t *decoded = pyBase64Decode(encoded, length);
    printf("decoded: %s\n", decoded);
    return 0;
}

Here are a few samples with different results after executing Python and my code.

original decoded:

stackoverflow

original encoded:

c3RhY2tvdmVyZmxvdw==

sample 1:

original “c3##RhY2t…vdmV!?y~Zmxvdw==”
result of python “stackoverflow”
result of pyBase64Decode "stackoverflowP"[1]
result of pyBase64Decode “stackoverflow”[2] but, length: 19

sample 2:

original “c3\n\nRh~Y2tvd#$mVyZmx$vdw==”
result of python “stackoverflow”
result of pyBase64Decode “stackoverflow”[1:1] but, length: 16

sample 3:

original “c3Rh$$$$$$$$$$$$$$$$$$$$$Y2tvdmVy###############Zmxvdw==”
result of python “stackoverflow”
result of pyBase64Decode “stackoverflowP\2;SP2;SPROFILE_”[1:2] length: 40 Bytes
result of pyBase64Decode “stackoverflow”[2:1] but, length: 40


  1. cout << std::string((char *)result, length) << endl; ↩︎ ↩︎ ↩︎

  2. printf(“%s”, result); ↩︎ ↩︎

You have C++ code problem and you are asking for help from python community?

:frowning: yes, please help me.

Because I don’t understand why my ability to reproduce almost entirely based on Python’s source code is actually different from the final implementation in Python.

I see C++ code in your first post, not python code, what am I misunderstanding?

Although it may not be polite for me to say so next, but I really need help.
I currently have no questions or confusion regarding the use of Python as a programming language.
I am a developer who uses C, C++, Python, and C #. Recently, I suddenly became curious about how Python implements Base64 decoding. So I went to browse through the C implementation source code of Python (because there was a feature that surprised me) and tried to replicate an identical code in my own code. However, after writing the code, I encountered a mismatch in the results. So I first went to StackOverflow to seek help, but did not receive a valid answer. I then submitted an issue to the Github page in CPython, but was informed by the administrator that I should come here to seek help.

Well, the first thing I see is that your initial calculation of bin_len is different. The original uses ((ascii_len+3)/4)*3, you are using ascii_len / 4 * 3. This probably explains off-by-one-mistakes you are seeing.

I tried changing the calculation method to the one in Python, but it was ineffective.

Aha, yeah, the way the python code adjusts the final length is hidden inside of _PyBytesWriter_Finish. It has memory of starting location and final location is being passed in, from which in the calculates the correct length.

This fixes your program:

    done:
    length = bin_data - bin_data_start;
    return bin_data_start;

The garbage characters you are seeing are just reading uninitialized memory returned from new[].

This solution can solve this problem, but do you have any good solutions for initially requesting more memory space?

Not sure what you mean? Why would you request more memory space? The initial bin_len computed in the cpython version is a guaranteed upper bound. You could consider reducing the size of the allocation at the end

Sorry, English is not my main language, I used translation software, which may be the reason why I did not express my meaning well.

Specifically, in uint8_t *bin_data = new (std::nothrow) uint8_t[bin_len + 1];, I have applied for a larger memory space than length = bin_data - bin_data_start;, but in reality, it does not require such a large memory space.

I have come up with two solutions to this, but neither seems to be very good. The first one is to create a new pointer and apply for the final determined length of memory space, and then copy the contents of bin_data over. The second one is to use the malloc function to apply for memory for bin_data from the beginning, and finally use the realloc function to adjust the memory size.