Expression for embeding static, relative-path content

Klaider · June 12, 2022, 6:55pm

I wished it was possible to include octet-stream or plain text from a Python script, based on relative-path, without any runtime instructions. Some languages have this feature, as I know: Rust (let some_text = include_str!("./data.txt");, let some_ba = include_bytes!("./data.bin");), ActionScript ([Embed]) and .NET languages (.NET languages combine project configuration and resource API for that).

Python’s compiler bytecode gives a good chance for the language to have this feature. The bytecode format could easily contain octet stream.

What I want would be something like:

some_ba = embed './data.bin': bytearray
some_text = embed './data.txt': str

If the script is located at project/src/lib.py, then some_ba resolves to the content located at project/src/data.bin.

(NOTE embed can be a context keyword followed by a string literal.)

guido · June 12, 2022, 7:18pm

IIUC the Rust version of this just constructs a string literal from the contents of the indicated file at compile time. The Python compiler could also do something like that. (But the fact that we use bytecode doesn’t make it easier.)

I’m not excited about this though – I don’t think I’ve ever felt the need to do something like this. Certainly not until you (@Klaider) have answered this: What would you use it for? How often do you think that use case occurs in real code?

Klaider · June 12, 2022, 7:21pm

Sometime I needed that feature in ECMAScript for embedding Unicode General Categories data, which I was going to use for parsing an ECMAScript dialect. That’s not the case anymore. But I keep thinking whether it maybe useful for internationalization (i18n) data, which would be inefficient if it were stored in JSON.

Usually the data I’m talking about is generated before building the Python program or library.

The feature to embed text is still useful because it won’t require instructions to be retrieved. Data in octets can be structured to be parsed faster.

ajoino · June 12, 2022, 8:12pm

Could you explain how this feature would be used? I am not familiar with any of the languages you listed and you don’t explain why this feature is nice beyond referencing those languages.

Klaider · June 12, 2022, 8:19pm

Suppose we have this project tree:

src
- lib.py
- some_font.ttf

src/lib.py

some_ba = embed './some_font.ttf': bytearray

When we run pyc for this script, the embed expression will return a bytearray containing the octets (or “bytes”) from the file src/some_font.ttf. So some_ba will be a TTF file as a bytearray.

And what is nice? It resolves file with relative-path and attaches the embedded file content to the pyc's output bytecode.

Also, if a type annotation is inconvenient, then the syntax could be embed '...' as 'application/octet-stream'

ajoino · June 12, 2022, 8:29pm

Thanks for the quick answer!

So it’s essentially some kind of “compile-time file reading with caching”?

Not sure I’ve ever needed it but I’ll let more knowledgeable people comment on the usefulness.

guido · June 12, 2022, 8:40pm

Does it include the contents of the file as it is at compile time? Or as it is when the pyc file runs?

Klaider · June 12, 2022, 8:43pm

Yes, it’d be verified/resolved at compile-time (when generating the bytecode).

pf_moore · June 12, 2022, 9:30pm

So if I have a file

a_str = embed './some_text.txt': str
print(a_str)

then I do py the_file.py it will print the contents of some_text.txt. And if I then change some_text.txt and re-run py the_file.py, it will print the old contents of the file, as that’s what’s embedded in the bytecode? That seems rather error-prone. Or would it need to recompile whenever either the source or any embedded files get changed?

Klaider · June 12, 2022, 9:32pm

Oh yes, I’ve not thought about that possibility of the cache. If the file being cached at the bytecode is a problem, then the developer would need to recompile the program when needed.

steven.daprano · June 13, 2022, 3:25am

I don’t understand your example or why this is a keyword “embed” rather than a function that returns the requested binary data.

You have a TTF file some_font.ttf, which you want to read into a byte array. I am completely perplexed by your comment that this:

“attaches the embedded file content to the pyc's output bytecode.”

I don’t understand that part. A TTF font is not bytecode. Why do you want to append it to a .pyc file? How will you retrieve it afterwards?

Is this something like the old Basic DATA command?

I can see two good improvements here:

make it easier to locate the current module’s location (as opposed to the current working directory)
make it easier to read from a file into a bytearray.

For 1, I never remember the steps to find a module’s directory. By memory, I think the dance goes something like this:

import __main__
import os
module_dir = os.path.dirname(__main__.__file__)
data_file_location = os.path.join(module_dir, 'somefont.ttf')

It would be nice to have a more obvious and easier way to do this.

For 2, if you know the file is smallish, you can just do this:

with open(data_file_location, 'rb') as f:
    ba = bytearray(f.read())

but if you want to read it incrementally, to avoid building up a giant bytes object that needs to be garbage collected, it gets more annoying. So again I can see the benefit of a nicer method to read into a bytearray (if there isn’t already one).

But I don’t understand this business of pretending that an arbitrary binary file is byte code and appending it to the .pyc file.

What am I missing?

storchaka · June 13, 2022, 4:48am

It is a solution of a problem that is not actual in Python. In all your examples the end result of compilation is a monolithic binary file which is distributed separately from source files. Python program is usually distributed in sources and stored as a tree of files. There is no problem with including yet one file in the distribution, and there is no benefit of adding its content to one of .pyc files. Even if you decide to distribute only .pyc files and pack them in the ZIP archive, there is no problem with adding yet one file to archive.

In modern Python the idiomatic way to access the content of such file is using the importlib.resources module. It works with zipimport, .pyc-only distributions, etc.

malemburg · June 13, 2022, 7:35am

In situations where you might want this, the common way of adding such external (binary) content to a Python source file is by having a separate script take the file, compress it and add it via base64 encoding to a long string literal, together with logic to decode and uncompress it.

Since special processing is needed, this typically goes beyond just embedding a raw data blob and so special syntax wouldn’t make this any easier.

I have written such code in the past for e.g. creating self-extracting files written in Python, or to have byte code embedded in .py files (rather than shipping pyc/pyo files). A special keyword to have the compiler read external data wouldn’t have helped with these, since I needed the embedding to happen even before the compiler is run.

pf_moore · June 13, 2022, 10:19am

Agreed. An example of this is get-pip.py, which embeds the whole of pip into a single script, so we can provide a “download this one file and run it” installation method. An “embed” mechanism like the one proposed here wouldn’t help, for the reasons you mention, as well as the fact that we want to ship the source script, not bytecode.

So I’m -1 on this idea, as I don’t think it solves any problem that isn’t better solved by other means.

steven.daprano · June 13, 2022, 1:21pm

“An example of this is get-pip.py, which embeds the whole of pip into a single script”

Why not use a zip app?

That’s the technique used by the extremely popular youtube-dl program.

pf_moore · June 13, 2022, 1:44pm

Because it doesn’t work for our requirements^[1]. Pip vendors at least one module that isn’t “zip-safe”. As the core dev responsible for zipapp, and a pip maintainer, I can confirm I was aware of that option But a zipapp is a very good approach if it works for you. And tools like shiv create self-extracting zipapps which even handle cases that “plain” zipapps don’t.

This is pretty off-topic, though.

Technically it might now, some of the constraints that caused us problems might have changed. But we would have to test that and what we have works fine. ↩︎

CAM-Gerlach · June 13, 2022, 5:28pm

Steven D'Aprano:

For 1, I never remember the steps to find a module’s directory. By memory, I think the dance goes something like this:
import __main__
import os
module_dir = os.path.dirname(__main__.__file__)
data_file_location = os.path.join(module_dir, 'somefont.ttf')
It would be nice to have a more obvious and easier way to do this.

If you actually want the module’s directory, that’s just __file__, and with pathlib, you can open it in a oneliner. So, you could replace both steps with just

from pathlib import Path
ba = (Path(__file__).parent / 'somefont.ttf').read_bytes()

But the right way to do it is as @storchaka said,

Klaider · June 13, 2022, 6:46pm

What if Python cannot always be distributed in sources? For example, when you distribute Python for a web browser page, it’d usually make sense to pass the page its bytecode, not the original Python sources.

Actually, the PyScript framework allows you to load Python scripts with actual Python sources in web browser page, but I find that idea bad (because the sources are parsed inside the web browser page).

That’s why the bytecode containing everything would be ideal. The solution provided by importlib.resources seems to be the Node.js equivalent of fs.readFileSync(path.resolve(__dirname, './some_thing.bin')). This implies that the program is provided in sources for the front-end users.

Klaider · June 13, 2022, 6:53pm

The problem is that the base64 content will have to be converted back to binary content in runtime. This is specially a problem for large files.

pf_moore · June 13, 2022, 7:16pm

Why is that a problem? Have you measured your application and confirmed that this is the critical performance bottleneck that makes the difference between your application working and failing?

I don’t personally consider that to be a valid use case. If you’re distributing bytecode, you have to ensure that your users are all using the exact same version of Python because we offer no guarantees that bytecode will be stable between versions. I guess there may be cases where that’s necessary, but they aren’t going to be common enough to justify a new language feature.