Discuss of speeding up shutil.copytree
Version
- Python: 3.11.9
Background
shutil is a very useful moudle in Python. You can find it in github: https://github.com/python/cpython/blob/master/Lib/shutil.py
shutil.copytree is a function that copies a folder to another folder.
In this function, it calls _copytree
function to copy.
What does _copytree
do ?
- Ignoring specified files/directories.
- Creating destination directories.
- Copying files or directories while handling symbolic links.
- Collecting and eventually raising errors encountered (e.g., permission issues).
- Replicating metadata of the source directory to the destination directory.
Problems
_copytree
speed is not very fast when the numbers of files are large or the file size is large.
Test here:
import os
import shutil
os.mkdir('test')
os.mkdir('test/source')
def bench_mark(func, *args):
import time
start = time.time()
func(*args)
end = time.time()
print(f'{func.__name__} takes {end - start} seconds')
return end - start
# write in 3000 files
def write_in_5000_files():
for i in range(5000):
with open(f'test/source/{i}.txt', 'w') as f:
f.write('Hello World' + os.urandom(24).hex())
f.close()
bench_mark(write_in_5000_files)
def copy():
shutil.copytree('test/source', 'test/destination')
bench_mark(copy)
The result is:
write_in_5000_files takes 4.084963083267212 seconds
copy takes 27.12768316268921 seconds
What I done
Multithreading
I use multithread to speed up the copying process. And I rename the funtion _copytree_single_threaded
add a new function _copytree_multithreaded
. Here is the copytree_multithreaded
:
def _copytree_multithreaded(src, dst, symlinks=False, ignore=None, copy_function=shutil.copy2,
ignore_dangling_symlinks=False, dirs_exist_ok=False, max_workers=4):
"""Recursively copy a directory tree using multiple threads."""
sys.audit("shutil.copytree", src, dst)
# get the entries to copy
entries = list(os.scandir(src))
# make the pool
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# submit the tasks
futures = [
executor.submit(_copytree_single_threaded, entries=[entry], src=src, dst=dst,
symlinks=symlinks, ignore=ignore, copy_function=copy_function,
ignore_dangling_symlinks=ignore_dangling_symlinks,
dirs_exist_ok=dirs_exist_ok)
for entry in entries
]
# wait for the tasks
for future in as_completed(futures):
try:
future.result()
except Exception as e:
print(f"Failed to copy: {e}")
raise
I add a judgement to choose use multithread or not.
if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100*1024*1024:
# multithreaded version
return _copytree_multithreaded(src, dst, symlinks=symlinks, ignore=ignore,
copy_function=copy_function,
ignore_dangling_symlinks=ignore_dangling_symlinks,
dirs_exist_ok=dirs_exist_ok)
else:
# single threaded version
return _copytree_single_threaded(entries=entries, src=src, dst=dst,
symlinks=symlinks, ignore=ignore,
copy_function=copy_function,
ignore_dangling_symlinks=ignore_dangling_symlinks,
dirs_exist_ok=dirs_exist_ok)
Test
I write 50000 files in the source folder. Bench Mark:
def bench_mark(func, *args):
import time
start = time.perf_counter()
func(*args)
end = time.perf_counter()
print(f"{func.__name__} costs {end - start}s")
Write in:
import os
os.mkdir("Test")
os.mkdir("Test/source")
# write in 50000 files
def write_in_file():
for i in range(50000):
with open(f"Test/source/{i}.txt", 'w') as f:
f.write(f"{i}")
f.close()
Two comparing:
def copy1():
import shutil
shutil.copytree('test/source', 'test/destination1')
def copy2():
import my_shutil
my_shutil.copytree('test/source', 'test/destination2')
- “my_shutil” is my modified version of shutil.
copy1 costs 173.04780609999943s
copy2 costs 155.81321870000102s
copy2
is faster than copy1
a lot. You can run many times.
Advantages & Disadvantages
Use multithread can speed up the copying process. But it will increase the memory usage. But we do not need to rewrite the multithread in the code.
Async
Thanks to “Barry Scott”. I will follow his/her suggestion :
You might get the same improvement for less overhead by using async I/O.
I write these code:
import os
import shutil
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time
# create directory
def create_target_directory(dst):
os.makedirs(dst, exist_ok=True)
# copy 1 file
async def copy_file_async(src, dst):
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, shutil.copy2, src, dst)
# copy directory
async def copy_directory_async(src, dst, symlinks=False, ignore=None, dirs_exist_ok=False):
entries = os.scandir(src)
create_target_directory(dst)
tasks = []
for entry in entries:
src_path = entry.path
dst_path = os.path.join(dst, entry.name)
if entry.is_dir(follow_symlinks=not symlinks):
tasks.append(copy_directory_async(src_path, dst_path, symlinks, ignore, dirs_exist_ok))
else:
tasks.append(copy_file_async(src_path, dst_path))
await asyncio.gather(*tasks)
# choose copy method
def choose_copy_method(entries, src, dst, **kwargs):
if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100 * 1024 * 1024:
# async version
asyncio.run(copy_directory_async(src, dst, **kwargs))
else:
# single thread version
shutil.copytree(src, dst, **kwargs)
# test function
def bench_mark(func, *args):
start = time.perf_counter()
func(*args)
end = time.perf_counter()
print(f"{func.__name__} costs {end - start:.2f}s")
# write in 50000 files
def write_in_50000_files():
for i in range(50000):
with open(f"Test/source/{i}.txt", 'w') as f:
f.write(f"{i}")
def main():
os.makedirs('Test/source', exist_ok=True)
write_in_50000_files()
# single threading
def copy1():
shutil.copytree('Test/source', 'Test/destination1')
def copy2():
shutil.copytree('Test/source', 'Test/destination2')
# async
def copy3():
entries = list(os.scandir('Test/source'))
choose_copy_method(entries, 'Test/source', 'Test/destination3')
bench_mark(copy1)
bench_mark(copy2)
bench_mark(copy3)
shutil.rmtree('Test')
if __name__ == "__main__":
main()
Output:
copy1 costs 187.21s
copy2 costs 244.33s
copy3 costs 111.27s
You can see that the async version is faster than the single thread version. But the single thread version is faster than the multi-thread version. ( Maybe my test environment is not very good, you can try and send your result as a reply to me )
Thank you Barry Scott !
Advantages & Disadvantages
Async is a good choice. But no solution is perfect. If you find some problem, you can send me as a reply.
End
This is my first time to write discussion on python.org. If there is any problem, please let me know. Thank you.