Disk space used by a file

I was looking for a way to estimate the disk space in use by a file, i.e. a Python version of the du command. Naively, I thought that shutil.disk_usage would give me that, since its docstring says: Return disk usage statistics about the given path. But it turns out that it is instead a Python equivalent of the df command: it shows information about the file system on which the path resides. So, I have two questions:

  • Is there a Python equivalence of the du command? I am happy now with calling du as an external, but a Python variant might be useful for repeated use.
  • Can the docstring of shutil.disk_usage be improved to make clear that it only shows results about the whole file system, not about the individual path?

If available on your system, os.stat_result.st_blocks may be of help.

In a pinch, you could try rounding st_size up to the next multiple of the filesystem block size. Of course, that won’t be correct for sparse files and the like.

Taking a cursory look at how du is actually implemented, it essentially just computes st_blocks * 512 when st_blocks is available. You can read through the full details below and try to match the logic for your system.

Thanks for the suggestions. I may take a dip into it when I have more time. For now, I will stay with running du in a subprocess, with the following function:

def disk_usage(paths):
    cmd = ['du', '-B1', '-s', '-D'] + [os.path.expanduser(p) for p in paths]
    P = subprocess.run(cmd, capture_output=True, encoding='utf-8')
    out = {}
    for line in P.stdout.split('\n'):
        s = line.split('\t')
        if len(s) == 2:
            out[s[1]] = int(s[0])
    return out

But don’t (physical on-disk) disk block sizes vary over time as drives get bigger? I remember when a HDD block size was 32K, now it might be 512K or more.

How would one find the block size for an individual drive so this routine would work for 30 years?

Are block sizes managed differently for HDD and SDD?

I was looking for a way to estimate the disk space in use by a file,

Somehow on my Windows system I have du from https://www.sysinternals.com. Hm, but that’s just for a directory size only, not for a single file size on the physical disk.

Or maybe take a look at the gnu utils du source code. Start somewhere around here: Coreutils - GNU core utilities

In these cases, there should be a separate .blksize attribute to tell you the block size.

I have to say, it’s quite strange not to see an obvious, high-level interface for this in the standard library.

To be clear, st_blocks isn’t measured in units of filesystem blocksizes or in units of st_blksize (all three are uncorrelated).

The units of st_blocks aren’t POSIX-standardized, but it’s often blocks of 512-bytes. Often enough that the Python docs simply document it as “Number of 512-byte blocks allocated for file.”

The actual filesystem blocksize is found in statvfs.f_bsize, which has the Python analog os.statvfs("...").f_bsize.

Per sys/stat.h:

  • st_blksize - A file system-specific preferred I/O block size for this object. In some file system types, this may vary from file to file.
  • st_blocks - Number of blocks allocated for this object.

[…]

The unit for the st_blocks member of the stat structure is not defined within IEEE Std 1003.1-2001. In some implementations it is 512 bytes. It may differ on a file system basis. There is no correlation between values of the st_blocks and st_blksize, and the f_bsize (from <sys/statvfs.h>) structure members.

The value that st_blocks reports has no direct relation to the underlying filesytem block size. Per the coreutils source code linked in my post above, du assumes that st_blocks is in units of 512 bytes unless you redefine it to some other value with a macro at compile time.

2 Likes