Making shutil.copyfileobj faster

I’m working on a project[1] that spends a lot of its runtime downloading files over HTTP and writing them to local files. I noticed that shutil.copyfileobj allocates a new buffer for every read it performs which seems inefficient:

def copyfileobj(fsrc, fdst, length=COPY_BUFSIZE):
    while buf := fsrc.read(length):  # 😬 allocates a new buffer on every read
        fdst.write(buf)

I notice there is a private shutil._copyfileobj_readinto which uses the more efficient approach of allocating a single buffer in advance and then using it repeatedly for all the reads:

def _copyfileobj_readinto(fsrc, fdst, length=COPY_BUFSIZE):
    mv = memoryview(bytearray(length))
    while n := fsrc.readinto(mv):  # 🙂 reuses buffer mv
        if n < length:
            fdst.write(mv[:n])
            break
        else:
            fdst.write(mv)

It seems copyfileobj would be faster if it were to use the approach of _copyfileobj_readinto whenever possible:

def copyfileobj(fsrc, fdst, length=COPY_BUFSIZE):
    if hasattr(fsrc, 'readinto'):
        _copyfileobj_readinto(fsrc, fdst, length)
    else:
        # Original implementation of shutil.copyfileobj
        while buf := fsrc.read(length):
            fdst.write(buf)

Does anyone see a problem with me submitting this kind of change as a PR?


P.S. copyfileobj could be made even faster if fsrc is known to be a real file object that supports os.sendfile, but I’ll leave that discussion to another thread.


  1. In case anyone is curious, the project I’m working on is Crystal, a website downloader. ↩︎

There’s a note in _fastcopy_sendfile, which was added in #7160 (#77852)

Note: copyfileobj() is left alone in order to not introduce any unexpected breakage. Possible risks by using zero-copy calls in copyfileobj() are:

  • fdst cannot be open in “a”(ppend) mode
  • fsrc and fdst may be open in “t”(ext) mode
  • fsrc may be a BufferedReader (which hides unread data in a buffer), GzipFile (which decompresses data), HTTPResponse (which decodes chunks).
  • possibly others (e.g. encrypted fs/partition?)

That discussion may have further relevant notes as to why copyfileobj wasn’t changed when the zero-copy methods were added—though it was five years ago now.

cc: @grodola

A

1 Like

Those items appear to be related to altering shutil.copyfileobj to use os.sendfile, which I’m not proposing here.

My proposal is to have shutil.copyfileobj use readinto rather than read when possible.

Have you measured any of this, or are you just going by a hunch? I wouldn’t think memory allocation would be the long pole here, but I haven’t measured anything either.

Oh, you’re quite right, sorry for my misunderstanding.

A

Is see this:

COPY_BUFSIZE = 1024 * 1024 if _WINDOWS else 64 * 1024

Not sure why a 1MiB buffer is not used by default on all OS.

I have always increased the buffer size to help with speeding this up.

Is the default length of COPY_BUFSIZE too small for your application?

I’m going by a hunch. I speculate a new approach that does only one dynamic memory allocation would be faster than the old approach that does many.

I can put a benchmark together in a day or two to check.

1 Like

I created a benchmark where:

  • An HTTP server loads a byte array simulating a large file (~200 MB) into memory at startup and responds to all GET requests with that byte array as the file content.
  • An HTTP client makes a request to that HTTP server, running on the same computer, and uses either copyfileobj or _copyfileobj_readinto to pipe the socket data to /dev/null.
  • timeit is used to run the HTTP client 20 times and print the minimum runtime.

In general, the _copyfileobj_readinto method is faster than copyfileobj but only by about 0.002s under the tested workload:

  • _copyfileobj_readinto:
    • Min Times (repeat=20): 0.0458, 0.0452, 0.0446, 0.0460, 0.0445, 0.0439
  • copyfileobj:
    • Min Times (repeat=20): 0.0474, 0.0491, 0.0460, 0.0476, 0.0468, 0.0485

That’s not enough of a performance gain for me to continue advocating for changing the behavior of the default shutil.copyfileobj.

benchmark.py
from http.client import HTTPConnection, HTTPSConnection
from http.server import BaseHTTPRequestHandler, HTTPServer
import os
import shutil
import socket
import sys
import timeit

# Case 1: Serve file quickly
def run_server():
    port = 7777
    content_type = 'video/mp4'
    body_bytes = bytearray(193_645_614)
    
    class RequestHandler(BaseHTTPRequestHandler):
        def do_GET(self) -> None:  # override
            self.send_response(200)
            self.send_header('Content-Type', content_type)
            self.end_headers()
            self.wfile.write(body_bytes)
    
    print(f'Running server on port {port}')
    address = ('', port)
    server = HTTPServer(address, RequestHandler)
    try:
        server.serve_forever()
    except KeyboardInterrupt:
        pass

# Case 2: Download file quickly
def run_client():
    HTTPConnectionClass = HTTPConnection
    host_and_port = 'localhost:7777'
    path_and_query = '/'
    body_filepath = '/dev/null'
    
    conn = HTTPConnectionClass(host_and_port)
    conn.request('GET', path_and_query)
    response = conn.getresponse()
    
    with open(body_filepath, 'wb') as body_file:
        if True:  # 👈 change this condition to alter download method
            shutil.copyfileobj(response, body_file)
        else:
            shutil._copyfileobj_readinto(response, body_file)

if __name__ == '__main__':
    if len(sys.argv) != 2 or sys.argv[1] not in ['client', 'server', 'timeit']:
        sys.exit('Usage: python3 benchmark.py client|server|timeit')
    mode = sys.argv[1]
    if mode == 'server':
        run_server()
    elif mode == 'client':
        run_client()
    elif mode == 'timeit':
        print('Timing...')
        t = timeit.Timer(run_client)
        times = t.repeat(20, number=1)
        print(f'Times: {times}')
        print(f'Min Time: {min(times)}')
4 Likes

Thank you for not advocating premature optimization!

1 Like