Significantly improve `shutil.copytree`?

MengQinYuan · August 27, 2024, 6:31am

Discuss of speeding up shutil.copytree

Version

Python: 3.11.9

Background

shutil is a very useful moudle in Python. You can find it in github: https://github.com/python/cpython/blob/master/Lib/shutil.py

shutil.copytree is a function that copies a folder to another folder.

In this function, it calls _copytree function to copy.

What does _copytree do ?

Ignoring specified files/directories.
Creating destination directories.
Copying files or directories while handling symbolic links.
Collecting and eventually raising errors encountered (e.g., permission issues).
Replicating metadata of the source directory to the destination directory.

Problems

_copytree speed is not very fast when the numbers of files are large or the file size is large.

Test here:

import os
import shutil

os.mkdir('test')
os.mkdir('test/source')

def bench_mark(func, *args):
    import time
    start = time.time()
    func(*args)
    end = time.time()
    print(f'{func.__name__} takes {end - start} seconds')
    return end - start

# write in 3000 files
def write_in_5000_files():
    for i in range(5000):
        with open(f'test/source/{i}.txt', 'w') as f:
            f.write('Hello World' + os.urandom(24).hex())
            f.close()

bench_mark(write_in_5000_files)

def copy():
    shutil.copytree('test/source', 'test/destination')

bench_mark(copy)

The result is:

write_in_5000_files takes 4.084963083267212 seconds
copy takes 27.12768316268921 seconds

What I done

Multithreading

I use multithread to speed up the copying process. And I rename the funtion _copytree_single_threaded add a new function _copytree_multithreaded. Here is the copytree_multithreaded:

def _copytree_multithreaded(src, dst, symlinks=False, ignore=None, copy_function=shutil.copy2,
                            ignore_dangling_symlinks=False, dirs_exist_ok=False, max_workers=4):
    """Recursively copy a directory tree using multiple threads."""
    sys.audit("shutil.copytree", src, dst)

    # get the entries to copy
    entries = list(os.scandir(src))

    # make the pool
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # submit the tasks
        futures = [
            executor.submit(_copytree_single_threaded, entries=[entry], src=src, dst=dst,
                            symlinks=symlinks, ignore=ignore, copy_function=copy_function,
                            ignore_dangling_symlinks=ignore_dangling_symlinks,
                            dirs_exist_ok=dirs_exist_ok)
            for entry in entries
        ]

        # wait for the tasks
        for future in as_completed(futures):
            try:
                future.result()
            except Exception as e:
                print(f"Failed to copy: {e}")
                raise

I add a judgement to choose use multithread or not.

if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100*1024*1024:
        # multithreaded version
        return _copytree_multithreaded(src, dst, symlinks=symlinks, ignore=ignore,
                                        copy_function=copy_function,
                                        ignore_dangling_symlinks=ignore_dangling_symlinks,
                                        dirs_exist_ok=dirs_exist_ok)

else:
    # single threaded version
    return _copytree_single_threaded(entries=entries, src=src, dst=dst,
                                        symlinks=symlinks, ignore=ignore,
                                        copy_function=copy_function,
                                        ignore_dangling_symlinks=ignore_dangling_symlinks,
                                        dirs_exist_ok=dirs_exist_ok)

Test

I write 50000 files in the source folder. Bench Mark:


def bench_mark(func, *args):
    import time
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start}s")

Write in:


import os
os.mkdir("Test")
os.mkdir("Test/source")

# write in 50000 files
def write_in_file():
    for i in range(50000):
         with open(f"Test/source/{i}.txt", 'w') as f:
             f.write(f"{i}")
             f.close()

Two comparing:


def copy1():
    import shutil
    shutil.copytree('test/source', 'test/destination1')

def copy2():
    import my_shutil
    my_shutil.copytree('test/source', 'test/destination2')

“my_shutil” is my modified version of shutil.

copy1 costs 173.04780609999943s
copy2 costs 155.81321870000102s

copy2 is faster than copy1 a lot. You can run many times.

Advantages & Disadvantages

Use multithread can speed up the copying process. But it will increase the memory usage. But we do not need to rewrite the multithread in the code.

Async

Thanks to “Barry Scott”. I will follow his/her suggestion :

You might get the same improvement for less overhead by using async I/O.

I write these code:


import os
import shutil
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time


# create directory
def create_target_directory(dst):
    os.makedirs(dst, exist_ok=True)

# copy 1 file
async def copy_file_async(src, dst):
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, shutil.copy2, src, dst)

# copy directory
async def copy_directory_async(src, dst, symlinks=False, ignore=None, dirs_exist_ok=False):
    entries = os.scandir(src)
    create_target_directory(dst)

    tasks = []
    for entry in entries:
        src_path = entry.path
        dst_path = os.path.join(dst, entry.name)

        if entry.is_dir(follow_symlinks=not symlinks):
            tasks.append(copy_directory_async(src_path, dst_path, symlinks, ignore, dirs_exist_ok))
        else:
            tasks.append(copy_file_async(src_path, dst_path))

    await asyncio.gather(*tasks)
# choose copy method
def choose_copy_method(entries, src, dst, **kwargs):
    if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100 * 1024 * 1024:
        # async version
        asyncio.run(copy_directory_async(src, dst, **kwargs))
    else:
        # single thread version
        shutil.copytree(src, dst, **kwargs)
# test function
def bench_mark(func, *args):
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start:.2f}s")

# write in 50000 files
def write_in_50000_files():
    for i in range(50000):
        with open(f"Test/source/{i}.txt", 'w') as f:
            f.write(f"{i}")

def main():
    os.makedirs('Test/source', exist_ok=True)
    write_in_50000_files()

    # single threading
    def copy1():
        shutil.copytree('Test/source', 'Test/destination1')

    def copy2():
        shutil.copytree('Test/source', 'Test/destination2')

    # async
    def copy3():
        entries = list(os.scandir('Test/source'))
        choose_copy_method(entries, 'Test/source', 'Test/destination3')

    bench_mark(copy1)
    bench_mark(copy2)
    bench_mark(copy3)

    shutil.rmtree('Test')

if __name__ == "__main__":
    main()

Output:

copy1 costs 187.21s
copy2 costs 244.33s
copy3 costs 111.27s

You can see that the async version is faster than the single thread version. But the single thread version is faster than the multi-thread version. ( Maybe my test environment is not very good, you can try and send your result as a reply to me )

Thank you Barry Scott !

Advantages & Disadvantages

Async is a good choice. But no solution is perfect. If you find some problem, you can send me as a reply.

End

This is my first time to write discussion on python.org. If there is any problem, please let me know. Thank you.

blhsing · August 27, 2024, 9:16am

That’s about 10% improvement in performance, which I wouldn’t characterize as “a lot”, but is maybe still somewhat worth pursuing.

However, on a filesystem on a medium such as a conventional harddrive that performs much better with sequential access than with random access, you’ll most likely see “a lot” of performance degradation rather than improvement with multithreaded copying.

You may have to take a much more sophisticated approach than simple size checks in your choose_copy_method to avoid potentially large performance degradation, perhaps by first performing a small amount of multithreaded copying to determine if the filesystem is random-access-friendly before deciding on which way to go for the rest of the copying.

barry-scott · August 27, 2024, 5:08pm

You might get the same improvement for less overhead by using async I/O.

MengQinYuan · August 28, 2024, 10:36am

Run this code and send me the result


import os
import shutil
import my_shutil
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time


# create directory
def create_target_directory(dst):
    os.makedirs(dst, exist_ok=True)

# copy 1 file
async def copy_file_async(src, dst):
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, shutil.copy2, src, dst)

# copy directory
async def copy_directory_async(src, dst, symlinks=False, ignore=None, dirs_exist_ok=False):
    entries = os.scandir(src)
    create_target_directory(dst)

    tasks = []
    for entry in entries:
        src_path = entry.path
        dst_path = os.path.join(dst, entry.name)

        if entry.is_dir(follow_symlinks=not symlinks):
            tasks.append(copy_directory_async(src_path, dst_path, symlinks, ignore, dirs_exist_ok))
        else:
            tasks.append(copy_file_async(src_path, dst_path))

    await asyncio.gather(*tasks)
# choose copy method
def choose_copy_method(entries, src, dst, **kwargs):
    if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100 * 1024 * 1024:
        # async version
        asyncio.run(copy_directory_async(src, dst, **kwargs))
    else:
        # single thread version
        shutil.copytree(src, dst, **kwargs)
# test function
def bench_mark(func, *args):
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start:.2f}s")
    return end-start

# write in  files
def write_in_n_files(n):
    for i in range(n):
        with open(f"Test/source/{i}.txt", 'w') as f:
            f.write(f"{i}")

def main(n):

    os.makedirs('Test/source', exist_ok=True)
    write_in_n_files(n)
    def copy1():
        shutil.copytree('Test/source', 'Test/destination1')

    def copy2():
        my_shutil.copytree('Test/source', 'Test/destination2')

    def copy3():
        entries = list(os.scandir('Test/source'))
        choose_copy_method(entries, 'Test/source', 'Test/destination3')

    single_threading_time = bench_mark(copy1)
    multi_threading_time = bench_mark(copy2)
    async_time = bench_mark(copy3)

    shutil.rmtree('Test')

    return single_threading_time, multi_threading_time, async_time


if __name__ == '__main__':
    single_threading_times = []
    multi_threading_times = []
    async_times = []
    for n in range(1001, 50001, 100):
        print(f"{n} files")
        single_threading_time, multi_threading_time, async_time = main(n)
        single_threading_times.append(single_threading_time)
        multi_threading_times.append(multi_threading_time)
        async_times.append(async_time)
        print("========================")

    import matplotlib.pyplot as plt
    plt.plot(range(1001, 50001, 100), single_threading_times, label='single threading')
    plt.plot(range(1001, 50001, 100), multi_threading_times, label='multi threading')
    plt.plot(range(1001, 50001, 100), async_times, label='async')
    plt.xlabel('Number of files')
    plt.ylabel('Time (s)')
    plt.legend()
    plt.show()

[End : 2024/9/30 ]

[PS : You should change my_shutil.py because the multithreading version is slow. You can find the reply by myself.]

MengQinYuan · August 29, 2024, 12:46am

ADD: Improve multi-threading version

This is a improved version :



def _copytree_multithreaded(src, dst, symlinks=False, ignore=None, copy_function=copy2,
                            ignore_dangling_symlinks=False, dirs_exist_ok=False, max_workers=4):
    """Recursively copy a directory tree using multiple threads."""
    sys.audit("shutil.copytree", src, dst)
    entries = list(os.scandir(src))
    os.makedirs(dst, exist_ok=dirs_exist_ok)
    errors = []

    from concurrent.futures import ThreadPoolExecutor

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for srcentry in entries:
            srcname = os.path.join(src, srcentry.name)
            dstname = os.path.join(dst, srcentry.name)
            if srcentry.is_dir():
                futures.append(executor.submit(
                    _copytree_multithreaded,
                    srcname, dstname,
                    symlinks=symlinks, ignore=ignore, copy_function=copy_function,
                    ignore_dangling_symlinks=ignore_dangling_symlinks,
                    dirs_exist_ok=dirs_exist_ok
                ))
            else:
                futures.append(executor.submit(
                    copy_function, srcname, dstname
                ))

        for future in futures:
            try:
                future.result()
            except Exception as e:
                print(f"Failed to copy: {e}")
                raise

    try:
        copystat(src, dst)
    except OSError as why:
        if getattr(why, 'winerror', None) is None:
            errors.append((src, dst, str(why)))
    if errors:
        raise Error(errors)
    return dst

It it fast then before a lot.

TomMonkeyMan · August 29, 2024, 3:08am

Love the way you’re thinking. In scenarios where you have a lot of small files to copy, another approach could be to first compress (e.g., using tar or zip) your files or folders and then copy the archive. This way, you’d only need to handle the metadata for a single file. However, it’s important to balance the time spent on compressing and decompressing with the time saved during copying.

MengQinYuan · August 29, 2024, 7:33am

Thank you Tom Monkey Man. I’m sorry I do not mention your message because I am busy. I think your way is useful especially when there are a lot of small files.

This is the code follow your suggestions:


import shutil
import os
import tarfile

source_dir = '/path/to/source'
destination_dir = '/path/to/destination'

temp_dir = '/tmp/temp_archive'
os.makedirs(temp_dir, exist_ok=True)

# create zipped folder
archive_path = os.path.join(temp_dir, 'archive.tar.gz')

# zipped files
with tarfile.open(archive_path, 'w:gz') as tar:
    tar.add(source_dir, arcname=os.path.basename(source_dir))
# copy files to the dir
shutil.copy2(archive_path, destination_dir)

# unzipped files
archive_path_in_destination = os.path.join(destination_dir, os.path.basename(archive_path))

with tarfile.open(archive_path_in_destination, 'r:gz') as tar:
    tar.extractall(path=destination_dir)

And we need to judge is there are a lot of small files?


import os

def count_files_and_sizes(directory):
    file_count = 0
    total_size = 0
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            file_count += 1
            total_size += os.path.getsize(file_path)
    
    return file_count, total_size

def check_large_number_of_small_files(directory, threshold_count=1000, threshold_size_kb=10):
    file_count, total_size = count_files_and_sizes(directory)
    average_size = total_size / file_count if file_count > 0 else 0
    
    # judge file sum over value
    if file_count > threshold_count:
        return True
    
    # judge avg file size over the value
    if average_size < threshold_size_kb * 1024:
        return True
    
    return False

mcepl · August 29, 2024, 7:38am

Remember there is zip -0 and tar without any compression.

MengQinYuan · August 29, 2024, 7:46am

Thank you for the reminder! Using zip -0 or tar without compression can indeed speed up the process when dealing with a large number of small files. This approach reduces the time spent on compression and decompression while still providing the benefits of handling a single archive file.

This is two solution:


import os
import shutil
import zipfile

# Define source and destination directories
source_dir = '/path/to/source'
destination_dir = '/path/to/destination'

# Create a temporary directory to store the archive
temp_dir = '/tmp/temp_archive'
os.makedirs(temp_dir, exist_ok=True)

# Use zip -0 to pack files
archive_path = os.path.join(temp_dir, 'archive.zip')

# Pack the directory
with zipfile.ZipFile(archive_path, 'w', zipfile.ZIP_STORED) as zipf:
    for root, dirs, files in os.walk(source_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, source_dir)
            zipf.write(file_path, arcname)

print(f"Packing completed: {archive_path}")

# Copy the archived file to the destination
shutil.copy2(archive_path, destination_dir)

print(f"Archive file copied to: {os.path.join(destination_dir, os.path.basename(archive_path))}")

# Extract the archive
archive_path_in_destination = os.path.join(destination_dir, os.path.basename(archive_path))

# Extract the archive
with zipfile.ZipFile(archive_path_in_destination, 'r') as zipf:
    zipf.extractall(destination_dir)

print(f"Extraction completed: {destination_dir}")

OR:

import os
import shutil
import tarfile

# Define source and destination directories
source_dir = '/path/to/source'
destination_dir = '/path/to/destination'

# Create a temporary directory to store the archive
temp_dir = '/tmp/temp_archive'
os.makedirs(temp_dir, exist_ok=True)

# Use tar to pack files without compression
archive_path = os.path.join(temp_dir, 'archive.tar')

# Pack the directory
with tarfile.open(archive_path, 'w') as tar:
    tar.add(source_dir, arcname=os.path.basename(source_dir))

print(f"Packing completed: {archive_path}")

# Copy the archived file to the destination
shutil.copy2(archive_path, destination_dir)

print(f"Archive file copied to: {os.path.join(destination_dir, os.path.basename(archive_path))}")

# Extract the archive
archive_path_in_destination = os.path.join(destination_dir, os.path.basename(archive_path))

# Extract the archive
with tarfile.open(archive_path_in_destination, 'r') as tar:
    tar.extractall(destination_dir)

print(f"Extraction completed: {destination_dir}")

Best regards,
Meng Qinyuan

mcepl · August 29, 2024, 8:33am

Sorry, forgot to add one more trick: tar format is extremely resistant to any kind of abuse, so it is possible just to send it through a pipe between two tar instances, one archiving, other extracting to the new location (not sure, whether zip format would survive it, but tar does). Or if not that, the archive can be stored in a FIFO socket and extracted from there? (not sure about that) The result is that you don’t have to actually create a physical temporary file, and you are not limited by the available drive space. And of course, you get the advantages of parallel processing for free.

The same trick can be used even for other purposes (using shell commands as an example, it is more simple):

tar cf - sourcedir/ | ssh somewhere tar -C targetlocation/ -xf -

barry-scott · August 29, 2024, 9:27am

Lots of small files can cost you in metadata look ups and the kernel locking that that requires. The use of a tar or zip can help batch up the a set of small files,
but it does not help with the metadata and its kernel locking. For that you do indeed need multiple parallel operations.

As always with this type of problem benchmark to see where the bottle necks are in practice. Oh and its likely to be different on each OS.

MengQinYuan · August 29, 2024, 10:14am

Yes，“|” is very useful.

mcepl · August 29, 2024, 12:42pm

Not sure, how does GitHub - rchui/ptgz: Fast, parallel C++ file archive, bundling, and compression utility. works.

barry-scott · August 29, 2024, 12:47pm

Are you asking for information beyond what the README explains?

gerardw · August 30, 2024, 12:14am

It’s not clear to me what the idea to improve the Python language is? Are you suggesting changing the stdlib implementation of shutil.copytree?

As other commentators have noted, the speed of execution is going to be very dependent on the OS and what is implementing the file system (e.g. ram disk , local disk, NFS, sshfs, etc.)

The stdlib version should be simple and predictable, and optimization of large copies left to libraries outside of stdlib.

MengQinYuan · August 30, 2024, 12:39am

Hi Gerardwx,

Thanks for your insights. Here are some clarifications:

Improving File Copying Performance:
- The goal is to significantly improve the speed of copying files, especially when dealing with a large number of small files.
- We are not proposing changes to the standard library itself, but rather exploring alternative approaches and optimizations that can be implemented in external libraries.
Execution Speed Factors:
- You’re right that the execution speed depends heavily on the underlying file system and operating system. Different environments (e.g., RAM disk, local disk, NFS, sshfs) will have varying performance characteristics.
- It’s important to benchmark and test these optimizations in different environments to understand their impact.
Standard Library Simplicity:
- Keeping the standard library simple and predictable is crucial. External libraries can provide more advanced and optimized solutions for specific use cases.
Alternative Approaches:
- We’ve explored several approaches, including:
  - Improved multi-threading using ThreadPoolExecutor.
  - Compressing files into archives before copying and then decompressing.
  - Using asynchronous I/O to reduce overhead.

By testing these approaches, we can identify the best methods for optimizing file copying in various scenarios.

New Title: “Significantly Improve Speed of Copying Files”

Best regards,
Meng Qinyuan

mcepl · August 30, 2024, 8:38am

No, no, no! This is not NodeJS, we have batteries included, the standard library is actually meant to be useful, not just a toy/example. If there is any issue with any part of it, which can be resolved with the reasonable amount of fuss and without negatively affecting any other uses, it should be fixed!

TomMonkeyMan · September 2, 2024, 3:07am

This makes a lot of sense to me, as Python isn’t primarily designed for problem-solving. However, I really appreciate Meng’s idea of doing optimizations purely based on Python code, especially considering how easy it is to port py to heterogeneous envs nowadays. I’ve encountered similar issues when syncing chunks of data, but I’ve never fully tried to get the best practices. I think this is a topic worth discussing, even though it may take some time to determine whether it warrants a PR in the standard library.

MengQinYuan · September 7, 2024, 3:35am

Sorry I am busy this time. I may do not reply some messages because I am now at school

MengQinYuan · September 11, 2024, 8:18am

You will add my name in it. YES??