How does Python's Binascii.a2b_base64 (base64.b64decode) work?

sngrotesque · June 22, 2024, 3:01pm

I checked the source code in Python and implemented a function that is the same as the Python’s binascii.a2b_base64 function.

static PyObject *
binascii_a2b_base64_impl(PyObject *module, Py_buffer *data, int strict_mode)

I used C++ and re implemented this function in my own code according to the original function in Python, in order to better understand and learn the working principle of Base64 decoding.

However, I don’t know why the function I implemented cannot handle non Base64 encoded characters correctly.

I have checked these codes and confirmed that they do not affect the function’s handling of non Base64 encoded characters, such as function [_PyBytesWriter_Init, _PyBytesWriter_Alloc, _PyBytesWriter_Finish, …], and ignored it from my code.

When processing Base64 strings that comply with the RFC4648 standard, as well as, In the case where only \n is used as a non Base64 encoded character, the function I implemented will achieve the same result as the corresponding function in Python.
For example:

const char *encoded = {
    "QUJDREVGR0hJSktMTU5PUFFSU1RVVldYWVpBQkNERUZHSElKS0xNTk9QUVJTVFVW\n"
    "V1hZWkFCQ0RFRkdISUpLTE1OT1BRUlNUVVZXWFlaQUJDREVGR0hJSktMTU5PUFFS\n"
    "U1RVVldYWVpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWg==\n"
};

Using either my function or Python’s binascii.a2b_base64 function will yield the same result as the following:

ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ

Here is the specific implementation of my code:

#define BASE64PAD '='

constexpr uint8_t b64de_table[256] = {
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255, 62, 255,255,255, 63,
    52 , 53, 54, 55,  56, 57, 58, 59,  60, 61,255,255, 255,  0,255,255,

    255,  0,  1,  2,   3,  4,  5,  6,   7,  8,  9, 10,  11, 12, 13, 14,
    15 , 16, 17, 18,  19, 20, 21, 22,  23, 24, 25,255, 255,255,255,255,
    255, 26, 27, 28,  29, 30, 31, 32,  33, 34, 35, 36,  37, 38, 39, 40,
    41 , 42, 43, 44,  45, 46, 47, 48,  49, 50, 51,255, 255,255,255,255,

    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,

    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255,
    255,255,255,255, 255,255,255,255, 255,255,255,255, 255,255,255,255};

uint8_t *
pyBase64Decode(const char *buffer, size_t &length,
               bool strict_mode = false)
{
    std::string error_message;

    const uint8_t *ascii_data = (const uint8_t *)buffer;
    size_t ascii_len = length;
    bool padding_started = 0;

    size_t bin_len = ascii_len / 4 * 3; 
    uint8_t *bin_data = new (std::nothrow) uint8_t[bin_len + 1];
    if(!bin_data) {
        throw std::runtime_error("Failed to allocate memory for bin_data.");
    }
    uint8_t *bin_data_start = bin_data;
    bin_data[bin_len] = 0x0;

    uint8_t leftchar = 0;
    uint32_t quad_pos = 0;
    uint32_t pads = 0;

    if(strict_mode && (ascii_len > 0) && (*ascii_data == BASE64PAD)) {
        error_message = "Leading padding not allowed.";
        goto error_end;
    }

    size_t i;
    uint8_t this_ch;
    for(i = 0; i < ascii_len; ++i) {
        this_ch = ascii_data[i];

        if(this_ch == BASE64PAD) {
            padding_started = true;
            // If the current character is a padding character, the length
            // will be reduced by one to obtain the decoded true length.
            bin_len--;

            if(strict_mode && (!quad_pos)) {
                error_message = "Excess padding not allowed.";
                goto error_end;
            }

            if((quad_pos >= 2) && (quad_pos + (++pads) >= 4)) {

                if(strict_mode && ((i + 1) < ascii_len)) {
                    error_message = "Excess data after padding.";
                    goto error_end;
                }

                goto done;
            }

            continue;
        }

        this_ch = b64de_table[this_ch];
        if(this_ch == 255) {
            if(strict_mode) {
                error_message = "Only base64 data is allowed.";
                goto error_end;
            }
            continue;
        }

        if(strict_mode && padding_started) {
            error_message = "Discontinuous padding not allowed.";
            goto error_end;
        }

        pads = 0;

        switch(quad_pos) {
        case 0:
            quad_pos = 1;
            leftchar = this_ch;
            break;
        case 1:
            quad_pos = 2;
            *bin_data++ = (leftchar << 2) | (this_ch >> 4);
            leftchar = this_ch & 0xf;
            break;
        case 2:
            quad_pos = 3;
            *bin_data++ = (leftchar << 4) | (this_ch >> 2);
            leftchar = this_ch & 0x3;
            break;
        case 3:
            quad_pos = 0;
            *bin_data++ = (leftchar << 6) | (this_ch);
            leftchar = 0;
            break;
        }
    }

    if(quad_pos) {
        if(quad_pos == 1) {
            char tmpMsg[128]{};
            snprintf(tmpMsg, sizeof(tmpMsg),
                    "Invalid base64-encoded string: "
                    "number of data characters (%zd) cannot be 1 more "
                    "than a multiple of 4",
                    (bin_data - bin_data_start) / 3 * 4 + 1);
            error_message = tmpMsg;
            goto error_end;
        } else {
            error_message = "Incorrect padding.";
            goto error_end;
        }
        error_end:
        delete[] bin_data;
        throw std::runtime_error(error_message);
    }

done:
    length = bin_len;
    return bin_data_start;
}

How to use this function:

int main()
{
    const char *encoded = "aGVsbG8sIHdvcmxkLg==";
    size_t length = strlen(encoded);
    uint8_t *decoded = pyBase64Decode(encoded, length);
    printf("decoded: %s\n", decoded);
    return 0;
}

Here are a few samples with different results after executing Python and my code.

original decoded:

stackoverflow

original encoded:

c3RhY2tvdmVyZmxvdw==

sample 1:

original	“c3##RhY2t…vdmV!?y~Zmxvdw==”
result of python	“stackoverflow”
result of pyBase64Decode	"stackoverflowP"^[1]
result of pyBase64Decode	“stackoverflow”^[2] but, length: 19

sample 2:

original	“c3\n\nRh~Y2tvd#$mVyZmx$vdw==”
result of python	“stackoverflow”
result of pyBase64Decode	“stackoverflow”^[1:1] but, length: 16

sample 3:

original	“c3Rh$$$$$$$$$$$$$$$$$$$$$Y2tvdmVy###############Zmxvdw==”
result of python	“stackoverflow”
result of pyBase64Decode	“stackoverflowP\2;SP2;SPROFILE_”^[1:2] length: 40 Bytes
result of pyBase64Decode	“stackoverflow”^[2:1] but, length: 40

cout << std::string((char *)result, length) << endl; ↩︎ ↩︎ ↩︎
printf(“%s”, result); ↩︎ ↩︎

barry-scott · June 22, 2024, 3:40pm

You have C++ code problem and you are asking for help from python community?

sngrotesque · June 22, 2024, 9:36pm

yes, please help me.

sngrotesque · June 22, 2024, 9:45pm

Because I don’t understand why my ability to reproduce almost entirely based on Python’s source code is actually different from the final implementation in Python.

barry-scott · June 22, 2024, 10:10pm

I see C++ code in your first post, not python code, what am I misunderstanding?

sngrotesque · June 22, 2024, 10:19pm

Although it may not be polite for me to say so next, but I really need help.
I currently have no questions or confusion regarding the use of Python as a programming language.
I am a developer who uses C, C++, Python, and C #. Recently, I suddenly became curious about how Python implements Base64 decoding. So I went to browse through the C implementation source code of Python (because there was a feature that surprised me) and tried to replicate an identical code in my own code. However, after writing the code, I encountered a mismatch in the results. So I first went to StackOverflow to seek help, but did not receive a valid answer. I then submitted an issue to the Github page in CPython, but was informed by the administrator that I should come here to seek help.

MegaIng · June 22, 2024, 10:41pm

Well, the first thing I see is that your initial calculation of bin_len is different. The original uses ((ascii_len+3)/4)*3, you are using ascii_len / 4 * 3. This probably explains off-by-one-mistakes you are seeing.

sngrotesque · June 22, 2024, 10:44pm

I tried changing the calculation method to the one in Python, but it was ineffective.

MegaIng · June 22, 2024, 11:01pm

Aha, yeah, the way the python code adjusts the final length is hidden inside of _PyBytesWriter_Finish. It has memory of starting location and final location is being passed in, from which in the calculates the correct length.

This fixes your program:

    done:
    length = bin_data - bin_data_start;
    return bin_data_start;

The garbage characters you are seeing are just reading uninitialized memory returned from new[].

sngrotesque · June 23, 2024, 3:05am

This solution can solve this problem, but do you have any good solutions for initially requesting more memory space?

MegaIng · June 23, 2024, 3:07am

Not sure what you mean? Why would you request more memory space? The initial bin_len computed in the cpython version is a guaranteed upper bound. You could consider reducing the size of the allocation at the end

sngrotesque · June 23, 2024, 3:12am

Sorry, English is not my main language, I used translation software, which may be the reason why I did not express my meaning well.

Specifically, in uint8_t *bin_data = new (std::nothrow) uint8_t[bin_len + 1];, I have applied for a larger memory space than length = bin_data - bin_data_start;, but in reality, it does not require such a large memory space.

sngrotesque · June 23, 2024, 3:17am

I have come up with two solutions to this, but neither seems to be very good. The first one is to create a new pointer and apply for the final determined length of memory space, and then copy the contents of bin_data over. The second one is to use the malloc function to apply for memory for bin_data from the beginning, and finally use the realloc function to adjust the memory size.