Add a basic progressbar implementation to `shutil`?

ncoghlan · June 7, 2024, 7:36am

I recently wanted a progressbar implementation that I could hook into the filtering callback on tar.add so there was some evidence of progress being made when archiving a directory tree with lots of files.

If you care enough about progress bars to want to add a dependency for them, there are a lot of options for them referenced from Python Progress Bar - Stack Overflow

There are also several short, self-contained progress bar suggestions on that SO question, but I found that most of them shared a common problem: they had a tendency to generate massive log spam if piped to something other than an interactive terminal (no terminal → \r does nothing - > every output request gets written to the stream rather than overwriting).

I contributed my own answer that curbs some of the worst potential offences (skip rewriting the output when the text hasn’t changed at all, skip rewriting non-interactive output when only the percentage has changed), but the sheer number of answers to that question prompted me to wonder: is this a case where we can come up with a “good enough” option for shutil that will cover the cases where folks would like to add a progress bar to a CLI application, but don’t want to add a dependency for it?

(although, adding such a function to shutil would make more sense if there was a straightforward way to hook it into shutil.make_archive. At the moment there isn’t a simple way to do that - for the use case where this came up, I already had other reasons to derive my own functions based on shutil._make_tarball and shutil._make_zipfile rather than using shutil.make_archive directly).

Monarch · June 7, 2024, 9:19am

+1 on this. I’ve wanted a simple progress bar on various occasions but having to pull heavy dependencies like Rich or tqdm is a bit of a deterrent especially when I’m writing a script instead of a project.

abessman · June 7, 2024, 9:45am

+1, though I wonder if shutil is the right place for it? If I, with no prior knowledge, were to go looking through the stdlib for a progressbar, the first place I’d look would probably be argparse.

The argparse module makes it easy to write user-friendly command-line interfaces. The program defines what arguments it requires, and argparse will figure out how to parse those out of sys.argv. The argparse module also automatically generates help and usage messages.

Seems to me like a progressbar could fit under this description.

encukou · June 7, 2024, 9:53am

Progress bars are a rabbit hole of complexity; it’ll be difficult to find a clear, stdlib-sized scope for this feature.
Simple progress bar implementations have limitations that only make sense if you know the implementation (i.e. you know it’s \r). For example: What happens if you print() while a progress bar is active, or get an exception traceback? What happens if you nest progress bars?

The log spam issue you mention is similar to terminal colours: there you also don’t want the cruft (escape sequences) to go in logs, and you want options to force/disable the feature in case terminal detection fails. Alas, CPython’s _colorize.py is private and unstable.

Maybe the right API to expose is only a function that determines whether you should send fancy terminal codes to stdout?

storchaka · June 7, 2024, 11:23am

I agree with @encukou that this hole is deeper than it looks. I wrote multiple progress bar implementations for my code, and used several third-party implementations from CLI and TUG libraries. There are design questions:

What to output besides the bar? Absolute value (it requires special formatting if it is the size in bytes)? Percentage of completion? Passed time, estimated total and/or remaining time? And in what format?
If output the current processed file etc, what to do if it is too long?
What to do if steps are too fast. Redrawing the progress bar for every step can add significant cost.
What if the end value is not known or changed in process of the work?
Should it use some blinking/rotating indicator to indicate the progress even if it is so slow that you doubt whether it moves at all.
How to use fancy symbols? For example “▁▂▃▄▅▆▇█” or even just “-|/”.
What if you need to show several progress bars simultaneously? For example the progress for a file anf for the whole archive. Or the progress in bytes and in the number of files.
What if the number of simultaneous tasks is variable? Display progress bars for all of them, or only for the few, and how to choose these few?
How to show the progress for hierarhical data? For example show the current directory and progress for files in it? And what about nested directories?

It deserves a separate module, and it will not cover all use cases.

ncoghlan · June 7, 2024, 11:37am

We pretty much have that already in sys.stdout.isatty(). That’s what I use in my SO answer to detect interactive vs non-interactive output.

I’ll drop the class I defined for that answer inline, as it sits at the level of complexity where I think the stdlib could reasonably play:

class ProgressBar:
    """Display & update a progress bar"""
    TEXT_ABORTING = "Aborting..."
    TEXT_COMPLETE = "Complete!"
    TEXT_PROGRESS = "Progress"

    def __init__(self, bar_length=25, stream=sys.stdout):
        self.bar_length = bar_length
        self.stream = stream
        self._last_displayed_text = None
        self._last_displayed_summary = None

    def reset(self):
        """Forget any previously displayed text (affects subsequent call to show())"""
        self._last_displayed_text = None
        self._last_displayed_summary = None

    def _format_progress(self, progress, aborting):
        """Internal helper that also reports the number of completed increments and the displayed status"""
        bar_length = self.bar_length
        progress = float(progress)
        if progress >= 1:
            # Report task completion
            completed_increments = bar_length
            status = " " + self.TEXT_COMPLETE
            progress = 1.0
        else:
            # Truncate progress to ensure bar only fills when complete
            progress = max(progress, 0.0) # Clamp negative values to zero
            completed_increments = int(progress * bar_length)
            status = (" " + self.TEXT_ABORTING) if aborting else ""
        remaining_increments = bar_length - completed_increments
        bar_content = f"{'#'*completed_increments}{'-'*remaining_increments}"
        percentage = f"{progress*100:.2f}"
        progress_text = f"{self.TEXT_PROGRESS}: [{bar_content}] {percentage}%{status}"
        return progress_text, (completed_increments, status)

    def format_progress(self, progress, *, aborting=False):
        """Format progress bar, percentage, and status for given fractional progress"""
        return self._format_progress(progress, aborting)[0]

    def show(self, progress, *, aborting=False):
        """Display the current progress on the console"""
        progress_text, progress_summary = self._format_progress(progress, aborting)
        if progress_text == self._last_displayed_text:
            # No change to display output, so skip writing anything
            # (this reduces overhead on both interactive and non-interactive streams)
            return
        interactive = self.stream.isatty()
        if not interactive and progress_summary == self._last_displayed_summary:
            # For non-interactive streams, skip output if only the percentage has changed
            # (this avoids flooding the output on non-interactive streams that ignore '\r')
            return
        if not interactive or aborting or progress >= 1:
            # Final or non-interactive output, so advance to next line
            line_end = "\n"
        else:
            # Interactive progress output, so try to return to start of current line
            line_end = "\r"
        sys.stdout.write(progress_text + line_end)
        sys.stdout.flush() # Ensure text is emitted regardless of stream buffering
        self._last_displayed_text = progress_text
        self._last_displayed_summary = progress_summary

Key features:

bar length and output stream can be set per instance.
display elements (such as the prefix and the status text messages) can be customised via subclassing (and you could easily add PERCENTAGE_PRECISION, BAR_SYMBOL_COMPLETE and BAR_SYMBOL_INCOMPLETE to the set of elements customisable that way)
only updates interactive streams if output changes (repeatedly writing same output is a common SO answer bug that is slow on any stream, and super spammy for non-interactive ones)
only updates non-interactive streams if the progress bar state or status text change (limits number of messages to the number of progress notches when non-interactive, rather than potentially emitting 1000 messages if every possible percentage value is emitted)
seeks to the beginning after each line, so other output will appear over the top of the progress bar rather than starting off to the right of screen
if other output is emitted between calls to show(), then the progress bar will simply emit a new output line once the progress status changes
the format_progress method allows the formatting behaviour to be used without tying it directly to the IO operation

One thing this class doesn’t do, but a stdlib version should is to make stream either a read-only property, or else a read/write property that resets the output history when modified.

For anything beyond the level of what this class offers, I’d put a few recipes in the standard library, and then mention the availability of third party progressbar implementations, either in libraries dedicated to that purpose, or in more general command line utility libraries.

The recipes I’d suggesting including would be:

using a subclass to customise the output
calling bar.show(progress) from a callback function in an API like shutil.copytree (see below)
using it to make an iterator wrapper that updates the progress bar (see below)

Example of integrating with a callback API:

import os
from shutil import copy2, copytree, ProgressBar

def count_files(src, ignore=None):
    """Recursively count files in a tree (respecting a `copytree` `ignore` filter)"""
    total_files = 0
    for this_dir, dirnames, filenames in os.walk(src):
        if ignore is None:
            total_files += len(filenames)
            continue
        ignored_names = ignore(this_dir, dirnames + filenames)
        # Don't count ignored files
        total_files += sum(1 for name in filenames if name not in ignored_names)
        # Don't iterate over ignored directories
        dirnames[:] = [name for name in dirnames if name not in ignored_names]
    return total_files

def copytree_with_progress(src, dst,
                           symlinks=False, ignore=None,
                           copy_function=copy2,
                           ignore_dangling_symlinks=False,
                           dirs_exist_ok=False):
    """Display a console progress bar while copytree is running"""
    progress_bar = ProgressBar()
    total_files_to_copy = count_files(src, ignore)
    def copy_with_progress_update(src, dst, *, follow_symlinks=True):
        nonlocal files_copied
        result = copy_function(src, dst, follow_symlinks=follow_symlinks)
        files_copied += 1
        progress_bar.show(files_copied / total_files_to_copy)
        return result
    progress_bar.show(0.0)
    copytree(src, dst, symlinks, ignore, copy_with_progress_update,
             ignore_dangling_symlinks, dirs_exist_ok)
    progress_bar.show(1.0)

Wrapping an iterator or iterable:

from shutil import ProgressBar

def iter_with_progress(iterable, *, max_iterations=None):
    """Display a progress bar while iterating over an iterable"""
    if max_iterations is None:
        # Iterable must define __len__ if max_iterations is not given
        max_iterations = len(iterable)
    progress_bar = ProgressBar()
    progress_bar.show(0.0)
    items_processed = 0
    for item in iterable:
        yield item
        items_processed += 1
        progress_bar.show(items_processed / max_iterations)
        if max_iterations is not None and items_processed = max_iterations:
            break # Terminate now even if the underlying iterator isn't complete

# Passing in a sequence means the maximum progress value is determined automatically
from pathlib import Path
all_files = list(path for path in Path(input_dir).rglob("*") if path.is_file())
processed_files = [process_file(fpath) for fpath in iter_with_progress(all_files)]

# Alternatively, the maximum progress value can be passed in explicitly
num_files = sum(1 for __ in Path(input_dir).rglob("*") if path.is_file())
iter_files = (path for path in Path(input_dir).rglob("*") if path.is_file())
processed_files = [process_file(fpath) for fpath in iter_with_progress(iter_files, max_iterations=num_files)]

As far as “Why in shutil?” goes, part of my motivation is that shutil has some example use cases for a progress bar (copytree now, and maybe someday make_archive), and the rest is that it and argparse are the only real contenders, and argparse has never really been a general purpose CLI utility library, while shutil already has get_terminal_size().

There would then be several things I would declare as explicitly out of scope and leave them to third party libraries (yes, this is inspired directly by Serhiy’s list of the more complex cases that arise):

displaying multiple concurrent progress bars
displaying hierachical progress rather than linear progress
displaying anything other than a simple completion percentage against a fixed target
displaying more complex output than a fixed length bar that changes from one text character to another

For all those use cases, the recommendation should be to reach for a specialised third party library, since they need more advanced console manipulation tools than a simple \r character in the output stream. (edit: according to tqdm’s readme, I’m wrong about that requirement. It still needs more complex code than this to achieve it, though)

The one other utility that I think might reasonably live in the stdlib is a basic rotating -\|/ activity indicator with some text after it that can be used when you don’t have any useful way to estimate progress, but do want to indicate that the process is still doing something when running at an interactive console.

fungi · June 7, 2024, 12:51pm

What happens if you print() while a progress bar is active, or
get an exception traceback?

Periodic checkpoints are a workaround I’ve found useful. Some
software I maintain includes a very simple progress bar which does
no redisplaying (just flushing the buffer after each “.” character
or percentage value), uses no non-ASCII characters, and fits into an
80-column terminal display with 2% quantization/granularity. The
full bar looks like this:

0%....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%

Yes it’s not pretty, but at least if stdout (or whatever fd you’re
using) gets interrupted and scrolls off the display you’ll still
know what the progress is again within a few updates. That approach
needs no TTY checking and doesn’t spam logs (it gets echoed into
some distro package build logs for example and comes out looking
fine).

I agree though, every use case is different and there is on OSFA
solution to this general category of terminal widget.

kknechtel · June 7, 2024, 5:09pm

Given this, I don’t really understand the original motivation:

Why not just check for a TTY and then conditionally just… not use the suggested implementation?

As regards the idea: I agree with others here that there are way too many possible customization options for a progress bar to allow a reasonably simple interface. It’s almost a mini application in itself, rather than a library.

What I’d rather see is a higher-level wrapper over curses that tries to ensure cross-platform portability (by choosing a backend - curses, msvcrt, etc.), uses friendlier names, tries to make the abstractions easier to understand etc. Basically the same sort of treatment that Requests offers for the urllib family. But maybe this is already on PyPI.

If some kind of progress bar functionality were added to the standard library, I would definitely want it to be in a new module or package. The suggestion of shutil in particular makes no sense to me. My first thought was, “what’s so special about moving or copying files that you’d want a progress bar for those, as opposed to anything else the standard library can do that takes a long time?”. shutil contains shell utilities, not terminal utilities. That’s very different. Similarly, while argparse’s documentation might describe it as being used for creating CLIs, it clearly is specifically about parseing the command-line arguments for the program.

davidism · June 7, 2024, 5:14pm

Click has a progress bar implementation. It attracts constant bug reports and feature requests, as there are a myriad of different ways people might expect it to act in different situations. Often the differnent ways are contradictory, or the current behavior isn’t wrong just different than expected. I just don’t have the expertise to manage all of that and the quirks of terminal output modes, on top of maintaining Click’s core functionality. Now I say no to further changes and just point people at tqdm or rich if they need a progress bar, both of which are focused on making progress bars very very nice. Speaking as a maintainer, I would not recommend trying to implement yet another progress bar, it will cause way too much noise compared to any benefit.

ncoghlan · June 7, 2024, 6:14pm

Yeah, I thought there might be a sweet spot we could hit, but I now suspect this falls into the same category as staged (aka atomic) file writes:

useful enough often enough to be worthy of space in the standard library
hard enough to get right that it would be really nice to provide as an included battery to help lower the barriers to entry for writing more correct or user friendly software
almost certainly an absolute nightmare to maintain as users (or potential users) make incorrect assumptions about how it works and the boundaries of what it is intended to offer, so it just isn’t worth inviting the extra hassle

barry-scott · June 7, 2024, 6:55pm

A progress bar traditionally needs to know what 100% value is so that it
can mark how far the current value is from the end.

How does the tarfile, zipfile etc expose in their API what 100% looks like?
Is that 100% of the files or 100% of the size of the files?

How many more worms can I see wrigging inside the can we just opened?

eryksun · June 7, 2024, 10:29pm

On Windows, the isatty() method is true for any character device, not necessarily a console/terminal. Most commonly the false positive will be for the “\\.\nul” device. But even a console file may not support “fancy terminal codes”, which I assume means virtual terminal codes. _colorize.can_colorize() checks for the latter via nt._supports_virtual_terminal(), though the current implementation is not generally useful, or even correct^[1].

The implementation of nt._supports_virtual_terminal() assumes that WinAPI STD_ERROR_HANDLE corresponds to sys.stderr, but they can get out of sync at a low level in C or at a high level in Python. More generally, it lacks support for checking a particular file descriptor, such as that of sys.stdout or any other duped or opened file descriptor – e.g. from opening “\\.\conout$”, or maybe a new screen buffer file created via WinAPI CreateConsoleScreenBuffer(). It should be implemented to take a file descriptor argument, e.g. nt._supports_virtual_terminal(sys.stderr.fileno()), from which the implementation gets the OS handle via _Py_get_osfhandle_noraise(fd). ↩︎

cameron · June 7, 2024, 11:53pm

I recently wanted a progressbar implementation that I could hook into
the filtering callback on tar.add so there was some evidence of
progress being made when archiving a directory tree with lots of files.

Yeah, everyone wants one of these I’ve got an untar script which
runs a nice progressbar for similar reasons.

If you care enough about progress bars to want to add a dependency for them, there are a lot of options for them referenced from Python Progress Bar - Stack Overflow

Aye, and I’ve got my own as cs.progress.

The difficulty is that there are, as pointed out, many issues. The
biggest one is how you interleave other output with the display.

My own approach has three pieces:

a Progress class which track progress, with optional total/expected value
a progressbar iterable wrapper to wrap something, handy for for-loops, making a Progress and presenting a display
an underlying cs.upd package which does multiline status displays, and the progress bar uses one

The interleave-output stuff does this with three main pieces:

cs.upd autodisables its display if the target output (default sys.stderr) is not a tty
cs.upd provides a print() drop in to withdrawn the display, do a
print(), restore the display
similar wrapping for logging to the tty

Oh, Karl: curses is a terrible way to do this - it takes over the whole
display. I do use it to look up the terminal capability strings though.

cameron · June 7, 2024, 11:55pm

My own progress bars have optional totals. It omits the bar itself with no total, but still reports throughput.

ncoghlan · June 9, 2024, 1:05am

There’s a reason my copytree example included a count_files function that accepts a copytree style ignore filter

For archiving, cumulative size is technically a better indicator than a simple file count, but a file count will be good enough when the goal is just to show some progress being made over time. Either way, you have to scan the tree beforehand to calculate the expected maximum rather than the archiving APIs offering a way to get the info.

The progress bar API itself avoids the problem by working solely with progress percentages (accepting a float and clamping it to a value between 0.0 and 1.0 inclusive)

(Edit: I decided an improved helper function for getting file counts and sizes out of a tree while respecting a shutil.copytree style ignore filter was worth posting to one of the assorted Stack Overflow questions about this topic: python - Return total number of files in directory and subdirectories - Stack Overflow)

adamsilkey · June 11, 2024, 2:23pm

I think this is right on the money.

As an alternative to including something in the Standard Library, maybe there’s a documentation play somewhere? I think there are a lot of common recipes that do not make sense to be part of the Standard Library but would be helpful if made accessible for beginners.

I’m not even sure where that would go inside the documentation space, but I think it’s a worthy thing to consider and possibly discuss.

ncoghlan · June 14, 2024, 5:55am

I considered that myself, but honestly, Stack Overflow is probably a better platform for knowledge sharing in this particular case. Folks that are happy with a dependency will be guided towards tqdm by the top answer (a very reasonable approach), while folks that want to avoid a dependency will find some reasonable options to use as a starting point if they keep reading further down the list.

Examples and specific HOWTO guides in the core Python docs are useful when there are standard lib components specifically related to the topic, but progress bars are mostly a matter of using carriage returns (i.e. \r) effectively to keep repainting the same line in an interactive line-oriented display (and checking stream.isatty() to make a reasonable guess as to when the output is non-interactive if you want to avoid flooding log files with incremental progress reports).

rrolls · June 17, 2024, 4:14pm

Personally, I don’t think shutil (or any other builtin Python thing that “does work”) should have a console-oriented progress bar implementation built in to it. I think that’s too specific.

Instead, if Python is going to add progress reporting at all, what I think would be a good/useful idea is to add an optional kwarg on functions that “might do a lot of work” (copytree, rmtree etc.), which would take a callback function.

The callback function could be written like this (most likely not using a global or a direct call to print, but this should illustrate the point):

tenths_reported = 0
def the_callback(work_done: int, work_total: int) -> None:
  global tenths_reported
  tenths = work_done * 10 // work_total
  if tenths_reported < tenths:
    tenths_reported = tenths
    print(str(tenths * 10) + '% done')

By supplying the callback (as opposed to not supplying it), you would be requesting not only that it is called every time some amount of work is done, but also that the total amount of work to be done should be calculated before starting - sometimes this calculation itself can take a long time, so it’s important to give the caller the option whether to do that for the sake of progress or skip it and not be able to report progress (but possibly be faster overall).

Then, when someone wants to see a “pretty” progress bar in their console application, their “pretty progress bar implementation” of choice can provide a callback, and they can supply that directly to the work-doing function. Python could have a new stdlib module added to provide such a “pretty progress bar implementation” built-in, or this could be left as a job for a third party package.

Meanwhile, people writing GUI apps, or server backends, or whatnot, that don’t have any sensible console to print output to - or people using a console that doesn’t support ANSI escape sequences, or Unicode, or whatnot - would be able to supply a different callback function suited to their needs. The writer of the callback function can also write their own time estimator or decide whether or not to print the absolute values as well as a percentage, etc etc. So it’s very flexible.

Note also:

I’ve written the callback as taking two separate int args - rather than a single float or a int in a predefined range (such as 0-100). That’s intentional, and I think it’s quite important: it means whoever writes the callback can decide what kind of granularity they want. “Work-doing functions” should not be deciding that the caller wants progress reported in tenths, or hundredths, or anything else; but they will all know how many units of work are done and how many there are in total (provided they calculate this in advance, of course).
The work-doing function could change the value of work_total in subsequent calls, if for some reason the initial calculated number turns out to be wrong.
The work-doing function should be required to ensure that work_total >= 1 and 0 <= work_done <= work_total hold on all calls, so that writers of callback functions don’t need to check for nonsensical values.

However, even with all this said, it still leaves open the important question of… what gets counted as units of work? For example, in a copytree operation, should this be the number of files? Or number of bytes? Should there be another kwarg to determine which one is reported? What if the caller wants both - use two callbacks, or use more args in one callback? Other operations might have even more ambiguity over what is considered work.

So overall, from me (just a random Python user at the end of the day): a +1 if sensible solutions to these issues are found, a slight -1 to implementing this without considering these issues, and a big -1 to implementing any kind of simple show_progress=True kwarg on functions like copytree that would then only work on a console.

ncoghlan · June 18, 2024, 2:25am

These complications are the main reason a “unit of work” callback like the one on tarfile’s add method ends up being more convenient than a dedicated “progress” callback: the author of the callback is then free to decide how much progress each unit of work represents. If the callback is triggered immediately before doing work, it even becomes possible to alter the work to be done (or skip it entirely). The latter is actually what the tarball callback was designed to handle, it just works for progress reporting, too.

Similarly, while there’s an argument in favour of adding helper functions that estimate the expected amount of work that a long running function will need to do, it’s OK if that is a separate operation that needs to be called explicitly.

Edit: on the question of adding more activity callbacks to APIs like shutil.make_archive, I’m currently ambivalent. For the use case where adding a progress bar came up, I ended up using the tarfile and zipfile functions in shutil as templates for my own custom archiving code. They’re not actually that long, and there comes a point in higher level API design complexity where giving someone a pointer to the code and saying “the license lets you copy and edit this implementation code, you’re not stuck with using the API it exposes” is a better idea than adding ever more API options. On the other hand, if we can come up with a reasonable shutil.ArchiveEntry abstraction (similar to TarInfo, but valid for zipfile entries as well), then a filter_entry callback on shutil.make_archive could be genuinely interesting). Further discussion of that idea should be posted as a new Ideas thread, though, since it’s potentially useful for more than just updating progress bars.