How to read 1MB of input from stdin?

Python sys.stdin.buffer.read() exits when input to stdin is greater than 873816 length. 1MB is 1048576.

Have you reading multiple smaller chunks until you have a total of 1MB?

I tried using this code How to read exactly N bytes from stdin in Native Messaging host? - #2 by Rosuav.

I can read input to stdin with length 1045001 (new Array(209000)) from JavaScript using C, C++, JavaScript (QuickJS; Deno; Node.js). I am trying to achieve the same using Python; which currently exits when input is new Array(174763).

I was just experimenting with os.read() and can’t get that working as expected either. I don’t have a preference for how to achieve the requirement, just to achieve it.

FWIW, this is how I achieve the algorithm in C, C++, and QuickJS

this is the source of the original Python Native Messaging host from MDN that I slightly modified - though I didn’t change with getMessage function

This works out of the box using QuickJS native-messaging-quickjs/nm_qjs.js at main · guest271314/native-messaging-quickjs · GitHub

function getMessage() {
  const header = new Uint32Array(1);
  std.in.read(header.buffer, 0, 4);
  const output = new Uint8Array(header[0]);
  std.in.read(output.buffer, 0, output.length);
  return output;
}

I just need a Python equaivalent. Whether a loop or one-time read is used doesn’t matter.

Try something like this:

def get_message():
    header = read_bytes(sys.stdin.buffer, 4)
    size = int.from_bytes(header, 'little') # Assuming little-endian.
    return read_bytes(sys.stdin.buffer, size)

def read_bytes(input_stream, size):
    MAX_CHUNK_SIZE = 1024 ** 2
    data = b''
    
    while len(data) < size:
        chunk = input_stream.read(min(size - len(data), MAX_CHUNK_SIZE))
        
        if not chunk:
            raise EOFError()
            
        data += chunk
        
    return data

What platform and Python version are you using? Here’s Python 3.9 on Linux reading 1 GiB (1073741824 bytes) piped to stdin:

$ python3.9 -c "import sys; sys.stdout.buffer.write(b'a' * (1024*1024*1024))" |
> python3.9 -c "import sys; print('read:', len(sys.stdin.buffer.read()))"
read: 1073741824

I cannot replicate your assertion:

I use Python to write 5 MiB (plus one newline) to stdout, redirect it to stdin, and read it back again:

[steve stdin_test]$ cat writer.py 
# Write five mebibytes (plus one newline) of text to stdout.
print('spam '*(1024*1024))

[steve stdin_test]$ cat reader.py 
import sys
arr = sys.stdin.buffer.read()
print(len(arr))

[steve stdin_test]$ python3.10 writer.py | python3.10 reader.py
5242881

Works fine for me. I think you might be misinterpreting what you are seeing. How do you know that stdin actually contains a full 1 MiB of data?

Can you show a minimal reproducible example?

P.S. Definitions of the SI units: The binary prefixes

Unfortunately the Python script exited.

Linux.

$ python3 --version
Python 3.10.6

Sure.

I am testing a Python Native Messaging host slightly modified from this MDN code Fix native message examples with python3 by evolighting · Pull Request #157 · mdn/webextensions-examples · GitHub.

Create a folder, e.g., native-messaging-python, include the following files

nm_python.py

#!/usr/bin/env -S python3 -u
# https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_messaging
# https://github.com/mdn/webextensions-examples/pull/157
# Note that running python with the `-u` flag is required on Windows,
# in order to ensure that stdin and stdout are opened in binary, rather
# than text, mode.

import sys
import json
import struct
import os

try:
    # Python 3.x version
    # Read a message from stdin and decode it.
    def getMessage():
        rawLength = sys.stdin.buffer.read(4)
        if len(rawLength) == 0:
            sys.exit(0)
        messageLength = struct.unpack('@I', rawLength)[0]

        f = open('log.txt', 'w')
        f.write(str(messageLength))
        f.close() 

        message = sys.stdin.buffer.read(messageLength).decode('utf-8')
        return json.loads(message)

    # Encode a message for transmission,
    # given its content.
    def encodeMessage(messageContent):
        encodedContent = json.dumps(messageContent).encode('utf-8')
        encodedLength = struct.pack('@I', len(encodedContent))
        return {'length': encodedLength, 'content': encodedContent}

    # Send an encoded message to stdout
    def sendMessage(encodedMessage):
        sys.stdout.buffer.write(encodedMessage['length'])
        sys.stdout.buffer.write(encodedMessage['content'])
        sys.stdout.buffer.flush()

    # os.set_blocking(sys.stdout.fileno(), False)
    while True:
        receivedMessage = getMessage()

        f = open('data.txt', 'w')
        f.write(str(len(receivedMessage)))
        f.close()

        sendMessage(encodeMessage(receivedMessage))

except Exception as e:
    sys.stdout.buffer.flush()
    sys.stdin.buffer.flush()
    sys.exit(0)

manifest.json

{
  "name": "nm-python",
  "short_name": "nm_python",
  "version": "1.0",
  "description": "Python Native Messaging host",
  "manifest_version": 3,
  "permissions": ["nativeMessaging"],
  "background": {
    "service_worker": "background.js",
    "type": "module"
  },
  "action": {}
}

background.js

globalThis.name = chrome.runtime.getManifest().short_name;

globalThis.port = chrome.runtime.connectNative(globalThis.name);
port.onMessage.addListener((message) => console.log(JSON.stringify(message).length, message));
port.onDisconnect.addListener((p) => console.log(chrome.runtime.lastError));
port.postMessage(new Array(174762));

chrome.runtime.onInstalled.addListener((reason) => {
  console.log(reason);
});

nm_python.json

{
  "name": "nm_python",
  "description": "Native Messaging Host Protocol Example",
  "path": "/home/user/native-messaging-python/nm_python.py",
  "type": "stdio",
  "allowed_origins": [
    "chrome-extension://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/"
  ]
}

Make sure nm_python.py is executable chmod u+x nm_python.py.

On Chrome or Chromium navigate to chrome://extensions, click “Developer mode”, then click “Load unpacked” and select the folder created above. Note the generated extension ID, and substitute that 32 length value “xxx…” with the extension ID in nm_python.json. Copy nm_python.json to ~/.config/google-chrome or whatever version of Chrome or Chromium you are using. I’m testing on Chromium Version 111.0.5522.0 (Developer Build) (64-bit). Now reload the extension and click the “service_worker” link on the extension page.

“background.js” will automatically establish a connection to the Python Native Messaging host and echo back input. Adjust the value input to the Python Native Messaging host to

port.postMessage(new Array(174763));

to observe the limitation for Python.

Then repeat the same steps above for C, C++, JavaScript engines (QuickJS; Deno; and Node.js) Native Messaging hosts

Substitute the message I hardcoded in the GitHub repositories for

port.postMessage(new Array(209715));

and observe C, C++, JavaScript (QuickJS; Deno; Node.js) hosts echo back 1MB; the input message

JSON.stringify(new Array(209715)).length === 1024*1024 // true

I didn’t write the Python Native Messaging host. I’m just trying to figure out how to get it working to produce what is produced using other programming languages.

Ugh. You’re eating all exceptions and doing nothing with them. Fix that and you might find that there’s something else going on.

1 Like

Fix that how? I’m trying to fix the entire script.

Before fixing entire script, you need to understand what’s up.
So you must not throw away the exception.

Put this code in the except clause and read the log file after your script is exit.

import traceback
with open("/tmp/myscript.log", "w", encoding="utf-8") as f:
    traceback.print_exc(file=f)
Traceback (most recent call last):
  File "/home/user/native-messaging-python/nm_python.py", line 51, in <module>
    sendMessage(encodeMessage(receivedMessage))
  File "/home/user/native-messaging-python/nm_python.py", line 40, in sendMessage
    sys.stdout.buffer.write(encodedMessage['content'])
BrokenPipeError: [Errno 32] Broken pipe

So your problem is writing large data into pipe, not reading.
This is why you must not guess your problem.

So your problem is writing large data into pipe, not reading.

That’s what it looks like. I was narrowing the issue down using open() and writing to file.

This is why you must not guess your problem.

Sure I have hypotheses. I eliminate steps that are not the issue as I go, with your help.

I have been just using the Python Native Messaging host from MDN without testing the limit of Native Messaging protocol. QuickJS, C++, C worked out of the box with calls to read().

I write more JavaScript than Python. I’m asking the experts in Python how to fix the script.

The other end of the pipe was closed. Maybe there’s something wrong with the data. Try skipping the intermediate data processing of json.loads() and json.dumps(). Send the same data that you read, just like what you wrote in the C/C++ examples.

I figured it out by moving around open(), write(), close() in the code and this python 3.x - Why is the length of json.dumps(json.loads(line)) bigger than line? - Stack Overflow answer.

def encodeMessage(messageContent):
        encodedContent = json.dumps(messageContent).encode('utf-8')
        f = open('data.txt', 'w')
        # [b for b in encodedMessage['length']]
        f.write(str(len(encodedContent)))
        f.close()
        encodedLength = struct.pack('@I', len(encodedContent))
        return {'length': encodedLength, 'content': encodedContent}

When separators=(',', ':') is not passed to json.dumps() parameter the length for input from JavaScript (which is serialized to a JSON-like format by the application (Chrome)) port.postMessage(new Array(174763)); is read in Python as

1048578

which is greater than 1024*1024. Essentially space character formatting were being encoded as part of the length of the JSON.

Solved by passing separators=(',', ':') for “compact encoding” json — JSON encoder and decoder — Python 3.11.1 documentation.

Now we can write 1MB with

port.postMessage(new Array(209715));

from JavaScript and omit including and counting formatting space characters as part of encoded message length.

Thanks for you help.

1 Like