Read/write byte objects

I’m coding a file backup app and I’ve got the basics down, aside from the ‘engine’, so to speak.

I’ve coded this:

def read_write(source, target):
    rBytes = True
    with open(source, mode='rb') as read_file, open(target, mode='wb') as write_file:
        while rBytes:
            rBytes = read_file.read(1024)
            write_file.write(rBytes)

… and it works. I’ve tested it with a number of files, the smallest being 415 bytes, the largest being 167.5 MB

My thinking is: just because it works, does not mean that it’s the correct or even the best way to do this. Having to set rBytes = True seems like a bit of a hack to me, but I can’t think how else to initialize the read/write. I’ve done that in 1024 chunks, so that larger files do not have to occupy a whole lot of RAM.

Any thoughts and guidance would be appreciated as I’ve never even tried to read/write byte objects before now.

Thanks.


Related to: File backup app [solved]

My thinking is: just because it works, does not mean that it’s the
correct or even the best way to do this. Having to set rBytes = True
seems like a bit of a hack to me, but I can’t think how else to
initialize the read/write. I’ve done that in 1024 chunks, so that
larger files do not have to occupy a whole lot of RAM.

These days 1MiB is not unreasonable.

Any thoughts and guidance would be appreciated as I’ve never even tried
to read/write byte objects before now.

Try the walrus operator?

 while rBytes := read_file.read(1024):
     write_file.write(rBytes)

Cheers,
Cameron Simpson cs@cskk.id.au

Thanks, but I need the app to be backward compatible. I have systems that are still on Python 3.6.9

That does seem (to me) to be one of the best use cases I’ve seen for the ‘walrus’.

If you are trying to make a copy of a file, you should consider using the functions in the shutil module.

Otherwise, your function is fine for what it does.

1 Like

In that case, I prefer the pattern:

def read_write(source, target):
    with open(source, mode='rb') as read_file, open(target, mode='wb') as write_file:
        while 1:
            rBytes = read_file.read(1024)
            write_file.write(rBytes)
            if rBytes < 1024:
                break
1 Like

That loses the end of the file. You need this:

if rBytes == b'':
     break

I tend to find 1MiB small for coping big files. I’ve used 128MiB in cases where speed and large files are involved.

It’s already written the content out, so in the normal case, it should be fine. However, this can only be guaranteed for ordinary files, and shouldn’t be trusted for pipes.

This is a very good use for automatic boolification. You can use while rBytes or if not rBytes: break safely here, as there’s no way for it ever to mean the wrong thing; but with the comparison, you risk an infinite loop if ever you switch your files from text to binary or vice versa. Why would you introduce a completely unnecessary bug risk if simpler code is also safer?

As my system will be doing incremental backups, the only speed issue (such as it will be) will be the first time it is run or if a large file (define as you will) has been created or altered since the previous run.

On the other points here: as my function is working as I intended and nobody has said “Hey Rob; you’re going run into issues with that!” I see no compelling reason to alter it, but I do, none-the-less, appreciate the feedback and the time you have all spent in posting such.

Oh right. It normal to have if after the read so that avoids a write of empty string.

The read method will return up to the specified size, so it might return fewer bytes even if more are available. The only guarantee is that it’ll return 0 bytes at the end of the file but > 0 otherwise.

I was wrong, I meant:

if len(rBytes) < 1024':
     break

I think comparing contents can have one unnecessary loop than comparing length, though both costs the same.

Kinda OT, but I"d think you would want slightly smaller than your disk cache – which you probably son’t know.

In practice, when I’ve performance tested this kind of thing, there is very little difference between anything tiny and huge – try it, but tiny is less than O(1k) or so, and huge is greater then O(1GB) (depending you your hardware of course :slight_smile:

I believe that Python does its own internal caching of reads/writes, so even when you read a byte at a time, Python is actually reading a larger block from disk, and then feeding you one byte at a time from that, not hitting the disk over and over again for one byte. Likewise for writes.

And then the OS also does caching, and so does the drive itself, so with two or three distinct layers of indirection there is no longer any obvious correlation between how much you data you read or write in Python and when / how much data hits the spinning metal. (If there even is spinning metal.)

The results can depend on your drive, e.g. many hard drives will lie to the OS and claim to have finished writing data to disk when it is still cached in volatile memory. So you think the data is written but it can take seconds more before it actually is written permanently.

That was a problem a long time ago. But after Microsoft refused to give WHQL to disks that lied that stopped, or so i believe.
Do you have recent evidence of this problem coming back?

Hmmm, well we agree that it was a problem, and I hope we agree that not all of those hard drives from “a long time ago” (ten years?) will have been retired, so I suppose we agree that I was technically correct (he best sort of correct!) that some drives will lie to the OS.

I’m not aware of the problem “coming back”, but neither am I aware of the problem “going away”.

At least 15 years ago. its not something that you need to worry about pragmatically.