I’m coding a file backup app and I’ve got the basics down, aside from the ‘engine’, so to speak.
I’ve coded this:
def read_write(source, target):
rBytes = True
with open(source, mode='rb') as read_file, open(target, mode='wb') as write_file:
while rBytes:
rBytes = read_file.read(1024)
write_file.write(rBytes)
… and it works. I’ve tested it with a number of files, the smallest being 415 bytes, the largest being 167.5 MB
My thinking is: just because it works, does not mean that it’s the correct or even the best way to do this. Having to set rBytes = True seems like a bit of a hack to me, but I can’t think how else to initialize the read/write. I’ve done that in 1024 chunks, so that larger files do not have to occupy a whole lot of RAM.
Any thoughts and guidance would be appreciated as I’ve never even tried to read/write byte objects before now.
My thinking is: just because it works, does not mean that it’s the
correct or even the best way to do this. Having to set rBytes = True
seems like a bit of a hack to me, but I can’t think how else to
initialize the read/write. I’ve done that in 1024 chunks, so that
larger files do not have to occupy a whole lot of RAM.
These days 1MiB is not unreasonable.
Any thoughts and guidance would be appreciated as I’ve never even tried
to read/write byte objects before now.
Try the walrus operator?
while rBytes := read_file.read(1024):
write_file.write(rBytes)
def read_write(source, target):
with open(source, mode='rb') as read_file, open(target, mode='wb') as write_file:
while 1:
rBytes = read_file.read(1024)
write_file.write(rBytes)
if rBytes < 1024:
break
It’s already written the content out, so in the normal case, it should be fine. However, this can only be guaranteed for ordinary files, and shouldn’t be trusted for pipes.
This is a very good use for automatic boolification. You can use while rBytes or if not rBytes: break safely here, as there’s no way for it ever to mean the wrong thing; but with the comparison, you risk an infinite loop if ever you switch your files from text to binary or vice versa. Why would you introduce a completely unnecessary bug risk if simpler code is also safer?
As my system will be doing incremental backups, the only speed issue (such as it will be) will be the first time it is run or if a large file (define as you will) has been created or altered since the previous run.
On the other points here: as my function is working as I intended and nobody has said “Hey Rob; you’re going run into issues with that!” I see no compelling reason to alter it, but I do, none-the-less, appreciate the feedback and the time you have all spent in posting such.
The read method will return up to the specified size, so it might return fewer bytes even if more are available. The only guarantee is that it’ll return 0 bytes at the end of the file but > 0 otherwise.
Kinda OT, but I"d think you would want slightly smaller than your disk cache – which you probably son’t know.
In practice, when I’ve performance tested this kind of thing, there is very little difference between anything tiny and huge – try it, but tiny is less than O(1k) or so, and huge is greater then O(1GB) (depending you your hardware of course
I believe that Python does its own internal caching of reads/writes, so even when you read a byte at a time, Python is actually reading a larger block from disk, and then feeding you one byte at a time from that, not hitting the disk over and over again for one byte. Likewise for writes.
And then the OS also does caching, and so does the drive itself, so with two or three distinct layers of indirection there is no longer any obvious correlation between how much you data you read or write in Python and when / how much data hits the spinning metal. (If there even is spinning metal.)
The results can depend on your drive, e.g. many hard drives will lie to the OS and claim to have finished writing data to disk when it is still cached in volatile memory. So you think the data is written but it can take seconds more before it actually is written permanently.
That was a problem a long time ago. But after Microsoft refused to give WHQL to disks that lied that stopped, or so i believe.
Do you have recent evidence of this problem coming back?
Hmmm, well we agree that it was a problem, and I hope we agree that not all of those hard drives from “a long time ago” (ten years?) will have been retired, so I suppose we agree that I was technically correct (he best sort of correct!) that some drives will lie to the OS.
I’m not aware of the problem “coming back”, but neither am I aware of the problem “going away”.