I would like to copy a given amount of content from file a to file b in a streamed manner (works with arbitrary size of file with little impact on memory usage). I was lead to the shutil.copyfileobj, but it lacks the ability to only copy a given number of bytes. It would copy everything from file a’s current position till file a is exhausted. That is, it works for skipping padding at the beginning, but not padding at the end of file a.
A more careful search hinted that tarfile.copyfileobj does exactly what I want. However this one is not in the documentation, nor is it shown in the help(tarfile). It appears that such function was part of CPython since v3.6.
Questions:
Are there other documented function that would achieve what I would like to do?
Would shutil.copyfileobj be extended to allow specifying the length of data that would be copied?
Is it safe to use tarfile.copyfileobj in my project - would this method be modified, or could it be documented and stabilize as part of the standard library?
copyfileobj is not in tarfile.__all__. It means copyfileobj is private. Future Python may remove it without deprecation period. So it is unsafe.
Instead, you can just copy the function. “A little copying is better than a little dependency.”
If the function is really useful for many Python developers, it should be moved into shutils. I am -1 on document&stabilize it as public tarfile member.
It’s a good function, but although you said “in a streamed manner”, it’s not actually doing anything fancy. Nor any async stuff. It’s implemented in the simple obvious way that probably first occurred to us all, which boils down to this for loop:
for b in range(blocks):
buf = src.read(bufsize)
if len(buf) < bufsize:
raise exception("unexpected end of data")
dst.write(buf)
Checking Git Blame, it was written 22 years ago, and last updated 8 years ago. So yes it is unsafe, but I’d say it’s pretty low risk, and easily mitigated, to just depend on this particular implementation detail and import it from tarfile. And copy the old version or reimplement it if the need is felt to change it for Python 3.14 for some reason.
The shutil.copyfileobj is using the same logic as well. Everyone is doing the same thing as to read a block of size x from file a and write it to file b, though I am not sure if it could have some platform-specific optimization in the future.
I just need something to make my code cleaner - defining such a size-constraint version of shutil.copyfileobj as custom code is not quite fun, nor is it fun to introduce a hard-coded arbitrary copy_bufsize that I have to explain.
Given that the tarfile.copyfileobj is there since its debut I might just keep using it.
Does anyone consider getting such function into shutil would be helpful? I believe that it could be used in the following situations:
obtaining a slice of a file to another location for further processing
handling streams in the format of length-payload, such as SCGI
Also if it could get into shutil, should it replace copyfileobj with a schema change, or should it be its own method with a new name such as copyfileslice?
There will be use cases, e.g. copying fixed size file headers. But I think it’s a special case when it is known exactly how many bytes of a file are needed (from the start of the file too), and when those bytes should be copied to another file using buffering, not read into memory.
I think most code needs to do further processing based on the structure and content of what’s actually in those bytes.
As I already said, I’d just import it from tarfile, and fix any issues afterwards should they happen. But I don’t think it’s worth doing anymore work on it than that, to get it into shutil or otherwise.