Base-utf8 encoding without escape sequences?

con-f-use · July 25, 2023, 8:57pm

Base64 maps binary data to the [a-zA-Z0-9=/+] space of characters. I wonder is there some codec library that maps binary data to the space of valid python utf-8 literals that don't require escaping?

I want to represent binary data as the most compact python unicode literal. Is there something ready-made that does that?

barry-scott · July 25, 2023, 9:39pm

Do not use text at all if the binary data must be as small as possible.
Think about compressing the binary data.

If you must have a text encoding of the data what damage do you need to
pretect against?

For example base64 was designed to survive the damage that email and http header processing will do to binary data. Damage like having the top bit of each byte set to 0
or having bytes stripped or replaced for example.

Once you know what the damage will be you can do better then base64 if your requirements allow.

Using unicode is unlikely to be the solution as its using code points that do not fit in a byte.

You need 24 bits to represent uncode, but data transmission and storage are in bytes, 8 bits at a time.

con-f-use · July 25, 2023, 10:03pm

Hi Barry, I appreciate that you took the time to read my question and answer me. I really do! I also appreciate, that you want to safe me from myself, that is very valiant of you! It’s just that I’m a bit tired of the game: person asks question → person immediately has to justify why and gets lectured on how what they want is wrong. So please accept that I have that unusual need and don’t want something else. I hope you don’t think it rude of me to rather avoid having a deep discussion on the philosophy of proper software engineering right now. Sorry. Maybe, the better way would be to answer the question that was asked first, and then explain the dangers and that there might be a better way.
Surprisingly, it is really not about the bit-size to me, I actually care about a visually compact representation of the data as a python unicode literal and a transformation back. I should have made that clearer.

cameron · July 25, 2023, 11:38pm

Hi Barry, I appreciate that you took the time to read my question and
answer me. I really do! I also appreciate, that you want to safe me
from myself, that is very valiant of you! It’s just that I’m a bit
tired of the game: person asks question → person immediately has to justify why and gets lectured on how what they want is wrong.

Barry’s explaining that the purpose of encodings is embedding the string
in some transport, and that choosing how to encode depending on your
objective.

Also, very often, people ask for a specific technical approach to some
larger undescribed problem which often has a better technical solution.
So we often ask about the context.

So please accept that I have that unusual need and don’t want something
else. […]
Surprisingly, it is really not about the bitzise to me, I actually care
about a visually compact representation of the data as a python unicode
literal. I should have made that clearer.

And here’s the larger context. Thank you.

Note that UTF-8 is a binary encoding with no relationship to your
“visually compact” object. You just want “Unicode text legal in a Python
literal”.

The specification for a Python Unicode literal is here:

It suggests that you can possibly use any Unicode character except the
quote, the backslash and the newline. You’d probably do well with
something very simple which escaped (eg with a backslash) just those 3
characters.

Or you could get very fancy, and run zlib.compress on your data and
then encode as a string. That will often be smaller.

However, keep in mind that base64 and the like are chosen not just to
get through many email systems with varying character sets and 8-bit
cleanliness but also to be human readable. The more characters you use
beyond a core set, the more visual ambiguity there can be to someone
reading the text, and this can depend a lot on fonts too. Is that an
“i”, an “I”, an “l”, a “1”? A “0” or an “O”? And that’s without moving
beyond the ASCII Latin letters and Arabic numerals.

So: should your encoding be visually clear to a human reader, eg to
someone reading it aloud to another person or debugging an encoding
problem? Maybe not, but you should consider it.

Also, you will want to write some tests to check that things round
trip through your encoding and back to the original bytes, and also that
Python accepts the string literals you’re generating.

Cheers,
Cameron Simpson cs@cskk.id.au

kknechtel · July 26, 2023, 9:17am

I guess that the actual practical purpose for this is to embed binary data in JSON, which uses a very similar string format to Python’s string literals (but only using double quotes as delimiters). I’ve considered this problem a fair bit in the past.

You really can’t save a lot by following the road you have in mind. I made a start at designing something that exploits UTF-8, but it really only pushes the limit to 6 encoding bits for 5 data bits (base64 uses 4 for 3; there is base85 that gets 5 for 4) and is quite complex. Aside from being a variable-length encoding, UTF-8 wastes a lot of bits on state encoding (if you randomly index the data, you can instantly tell whether you are inside a multi-byte sequence by examining the current byte) and a lot of code points are just not assigned or will have weird display effects. Just the state encoding is already a massive limitation: if you only used two-byte UTF8 sequences, you would not even be able to beat base64, even by using unassigned code points (each pair of bytes will use five bits on state encoding).

con-f-use · August 9, 2023, 8:38pm

One of my major usescases for it would be to inline binary data as unobtrusive as possible into single-script files for easy sharing. But it’s not the only one. I’ve managed a solution the meantime but am still curious as to whether there’s a better one.

Eastonhan · August 25, 2023, 2:41pm

I’m looking for a codec library that can efficiently map binary data to valid Python UTF-8 literals without requiring any escaping. While Base64 is common for encoding binary data, I’m specifically interested in a method that directly maps to Python Unicode literals, optimizing for compactness. Has anyone come across a codec library or a workaround that achieves this? My goal is to represent binary data in the most space-efficient form within Python strings.

Rosuav · August 25, 2023, 2:52pm

zlib has already been mentioned in this thread. That’s almost certainly the most compact way to represent text. If it’s not text but arbitrary binary data (maybe FLAC audio data, which is already compressed and won’t zip down usefully), Base 64 or Base 85 would most likely be your best bets. What is a “UTF-8 literal”? Are you talking about text or not?

barry-scott · August 26, 2023, 7:46am

As i understand the OP its to minimise the text as seen in a unicode editor.
It is not to make the .py file small.

I embed Icons into .py files for my PyQt apps.
Which I machine generate and put in its own file using base64.
As i never need to edit that file the minimisation is not an issue for me.

con-f-use · August 26, 2023, 8:25am

Yes, that is what I want in a nutshell. I’m aware that not all glyphs are supported by all fonts, but mapping bytes to a reasonably well supported subset of usually displayable, 1-char wide glyphs, would be nice.

Rosuav · August 26, 2023, 9:26am

“1-char wide glyphs”? Okay, here’s a stupid strategy then.

data = b"put your arbitrary binary data here"
import base64
tr = str.maketrans('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',
	''.join(chr(x) for x in range(0x300, 0x340)))
string = "a" + base64.b64encode(data).decode("ascii").translate(tr)

This gives you a VERY compact string literal - at least visually. To decode, reverse the procedure:

string = 'ȧ̴̜̥̯̝̗̈̇̕='
import base64
tr = str.maketrans(''.join(chr(x) for x in range(0x300, 0x340)),
	'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/')
data = base64.b64decode(string[1:].translate(tr).encode("ascii"))

Does that count?