I wished it was possible to include octet-stream or plain text from a Python script, based on relative-path, without any runtime instructions. Some languages have this feature, as I know: Rust (let some_text = include_str!("./data.txt");, let some_ba = include_bytes!("./data.bin");), ActionScript ([Embed]) and .NET languages (.NET languages combine project configuration and resource API for that).
Python’s compiler bytecode gives a good chance for the language to have this feature. The bytecode format could easily contain octet stream.
IIUC the Rust version of this just constructs a string literal from the contents of the indicated file at compile time. The Python compiler could also do something like that. (But the fact that we use bytecode doesn’t make it easier.)
I’m not excited about this though – I don’t think I’ve ever felt the need to do something like this. Certainly not until you (@Klaider) have answered this: What would you use it for? How often do you think that use case occurs in real code?
Sometime I needed that feature in ECMAScript for embedding Unicode General Categories data, which I was going to use for parsing an ECMAScript dialect. That’s not the case anymore. But I keep thinking whether it maybe useful for internationalization (i18n) data, which would be inefficient if it were stored in JSON.
Usually the data I’m talking about is generated before building the Python program or library.
The feature to embed text is still useful because it won’t require instructions to be retrieved. Data in octets can be structured to be parsed faster.
then I do py the_file.py it will print the contents of some_text.txt. And if I then change some_text.txt and re-run py the_file.py, it will print the old contents of the file, as that’s what’s embedded in the bytecode? That seems rather error-prone. Or would it need to recompile whenever either the source or any embedded files get changed?
It would be nice to have a more obvious and easier way to do this.
For 2, if you know the file is smallish, you can just do this:
with open(data_file_location, 'rb') as f:
ba = bytearray(f.read())
but if you want to read it incrementally, to avoid building up a giant bytes object that needs to be garbage collected, it gets more annoying. So again I can see the benefit of a nicer method to read into a bytearray (if there isn’t already one).
But I don’t understand this business of pretending that an arbitrary binary file is byte code and appending it to the .pyc file.
It is a solution of a problem that is not actual in Python. In all your examples the end result of compilation is a monolithic binary file which is distributed separately from source files. Python program is usually distributed in sources and stored as a tree of files. There is no problem with including yet one file in the distribution, and there is no benefit of adding its content to one of .pyc files. Even if you decide to distribute only .pyc files and pack them in the ZIP archive, there is no problem with adding yet one file to archive.
In modern Python the idiomatic way to access the content of such file is using the importlib.resources module. It works with zipimport, .pyc-only distributions, etc.
In situations where you might want this, the common way of adding such external (binary) content to a Python source file is by having a separate script take the file, compress it and add it via base64 encoding to a long string literal, together with logic to decode and uncompress it.
Since special processing is needed, this typically goes beyond just embedding a raw data blob and so special syntax wouldn’t make this any easier.
I have written such code in the past for e.g. creating self-extracting files written in Python, or to have byte code embedded in .py files (rather than shipping pyc/pyo files). A special keyword to have the compiler read external data wouldn’t have helped with these, since I needed the embedding to happen even before the compiler is run.
Agreed. An example of this is get-pip.py, which embeds the whole of pip into a single script, so we can provide a “download this one file and run it” installation method. An “embed” mechanism like the one proposed here wouldn’t help, for the reasons you mention, as well as the fact that we want to ship the source script, not bytecode.
So I’m -1 on this idea, as I don’t think it solves any problem that isn’t better solved by other means.
Because it doesn’t work for our requirements. Pip vendors at least one module that isn’t “zip-safe”. As the core dev responsible for zipapp, and a pip maintainer, I can confirm I was aware of that option But a zipapp is a very good approach if it works for you. And tools like shiv create self-extracting zipapps which even handle cases that “plain” zipapps don’t.
This is pretty off-topic, though.
Technically it might now, some of the constraints that caused us problems might have changed. But we would have to test that and what we have works fine. ↩︎
What if Python cannot always be distributed in sources? For example, when you distribute Python for a web browser page, it’d usually make sense to pass the page its bytecode, not the original Python sources.
Actually, the PyScript framework allows you to load Python scripts with actual Python sources in web browser page, but I find that idea bad (because the sources are parsed inside the web browser page).
That’s why the bytecode containing everything would be ideal. The solution provided by importlib.resources seems to be the Node.js equivalent of fs.readFileSync(path.resolve(__dirname, './some_thing.bin')). This implies that the program is provided in sources for the front-end users.
Why is that a problem? Have you measured your application and confirmed that this is the critical performance bottleneck that makes the difference between your application working and failing?
I don’t personally consider that to be a valid use case. If you’re distributing bytecode, you have to ensure that your users are all using the exact same version of Python because we offer no guarantees that bytecode will be stable between versions. I guess there may be cases where that’s necessary, but they aren’t going to be common enough to justify a new language feature.