Remove old backups when the retain count is exceeded

CJ_CHU · December 22, 2024, 4:00am

Sorry, I’m using an LLMs to achieve my goal.

I’m trying to remove old backups when the retain count is exceeded. But none of them (LLMs) got it right in terms of pytest and expected output except one (which has a chain of thoughts). But I still don’t know why. I’ve spent many hours trying to debug this, so I know I’m stuck and need help now.

Can someone please provide me with the basic code to remove old backups/files when the retain count is exceeded?

oscarbenjamin · December 22, 2024, 4:18am

Your question is too vague for anyone to give a reasonable answer. I suspect this is also why the LLMs were not of use.

CJ_CHU · December 22, 2024, 4:22am

As I said, I spent many hours debugging it, so it really hurts to say that.

CJ_CHU · December 22, 2024, 4:23am

Also an LLM got it right, at least the output part, not the pytest, so…

CJ_CHU · December 22, 2024, 7:32am

Alternatively, I can use this ls -rt | head -n -3 | xargs rm -v command.

Advantage: less code.
Disadvantage: only works on Linux.

MRAB · December 22, 2024, 4:27pm

Do you know what that does? If you do, it’s not hard to duplicate it in Python.

List the files, sort them in reverse order by timestamp (modification time), and delete all but the last 3 (i.e. all but the 3 most recent ones).

CJ_CHU · December 22, 2024, 11:50pm

Yes, that’s what I’m trying to do, but somehow when sorting the list (using .sort() and putting a lambda in its key argument) things get weird and “don’t work”. And even if it does work, is it worth it? I have to maintain the code when a more professional one is already available

MRAB · December 22, 2024, 11:59pm

As you haven’t provided any code and haven’t explained what exactly what’s going wrong, there’s not much anyone here can do to help.

CJ_CHU · December 23, 2024, 12:09am

That’s true and I’m sorry. Can I just ask if I’m on the right track by using the Linux command if no library is available for my goal instead of creating my own solution?

If you want me to share the code, even though it made by LLM, I’ll share it

@MRAB

MRAB · December 23, 2024, 12:30am

If you want help with the code (and it’s not too long), you need to show it.

Just remember that what LLM produces can be convincing, but not necessarily correct.

CJ_CHU · December 23, 2024, 4:59am

Both fail the generated unit test. It seems that the name format (timestamp) of the backup files contributes to the problem.

GPT-4o

Code

# File: backup_file.py
import os
import shutil
from datetime import datetime


def backup_file(filepath, backup_dir, retain_count=3):
    """
    Create a timestamped backup of a file and retain only a specified number of the latest backups.
    
    Args:
        filepath (str): Path to the file to be backed up.
        backup_dir (str): Directory where backups will be stored.
        retain_count (int): Number of latest backups to retain (default is 3).
        
    Returns:
        str: Path to the created backup file.
        
    Raises:
        FileNotFoundError: If the file to be backed up does not exist.
        ValueError: If the retain_count is negative.
        OSError: If there are issues with file or directory permissions.
        
    Examples:
        >>> backup_file("example.py", "/backups", retain_count=2)
        '/backups/example_20231223_153045.py'
    """
    if not os.path.isfile(filepath):
        raise FileNotFoundError(f"The file {filepath} does not exist.")
    if not os.path.isdir(backup_dir):
        os.makedirs(backup_dir)
    if retain_count < 0:
        raise ValueError("retain_count must be a non-negative integer.")

    # Generate the backup file name
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = os.path.basename(filepath)
    backup_name = f"{os.path.splitext(filename)[0]}_{timestamp}{os.path.splitext(filename)[1]}"
    backup_path = os.path.join(backup_dir, backup_name)

    # Create the backup
    shutil.copy2(filepath, backup_path)

    # Manage retention of backups
    backups = sorted(
        [f for f in os.listdir(backup_dir) if f.startswith(os.path.splitext(filename)[0]) and f.endswith(os.path.splitext(filename)[1])],
        key=lambda f: os.path.getctime(os.path.join(backup_dir, f))
    )
    while len(backups) > retain_count:
        oldest_backup = backups.pop(0)
        os.remove(os.path.join(backup_dir, oldest_backup))

    return backup_path

Pytest

# File: test_backup_file.py
import os
import pytest
import tempfile
from backup_file import backup_file


@pytest.fixture
def sample_file():
    """Fixture to create a sample file for testing."""
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(b"Sample content for testing.")
        return temp_file.name


@pytest.fixture
def backup_directory():
    """Fixture to create a temporary backup directory."""
    backup_dir = tempfile.mkdtemp()
    yield backup_dir
    for file in os.listdir(backup_dir):
        os.remove(os.path.join(backup_dir, file))
    os.rmdir(backup_dir)


def test_backup_file_creates_backup(sample_file, backup_directory):
    """Test normal case of creating a backup."""
    backup_path = backup_file(sample_file, backup_directory)
    assert os.path.exists(backup_path), "Backup file should be created."


def test_backup_file_retains_specified_count(sample_file, backup_directory):
    """Test that only the specified number of backups are retained."""
    for _ in range(5):
        backup_file(sample_file, backup_directory, retain_count=3)
    backups = [f for f in os.listdir(backup_directory) if os.path.isfile(os.path.join(backup_directory, f))]
    assert len(backups) == 3, "Only 3 backups should be retained."


def test_backup_file_handles_non_existent_file(backup_directory):
    """Test that an error is raised if the file does not exist."""
    with pytest.raises(FileNotFoundError):
        backup_file("nonexistent.py", backup_directory)


def test_backup_file_creates_directory_if_missing(sample_file):
    """Test that the backup directory is created if it does not exist."""
    temp_dir = tempfile.mkdtemp()
    os.rmdir(temp_dir)
    backup_file(sample_file, temp_dir)
    assert os.path.isdir(temp_dir), "Backup directory should be created."


def test_backup_file_invalid_retain_count(sample_file, backup_directory):
    """Test that an error is raised if retain_count is negative."""
    with pytest.raises(ValueError, match="retain_count must be a non-negative integer."):
        backup_file(sample_file, backup_directory, retain_count=-1)


@pytest.mark.parametrize("retain_count", [0, 1])
def test_backup_file_retain_count_zero_or_one(sample_file, backup_directory, retain_count):
    """Test edge cases where retain_count is 0 or 1."""
    for _ in range(3):
        backup_file(sample_file, backup_directory, retain_count=retain_count)
    backups = [f for f in os.listdir(backup_directory) if os.path.isfile(os.path.join(backup_directory, f))]
    assert len(backups) == retain_count, f"Only {retain_count} backups should be retained."

Claude 3.5

Code

# File: file_backup.py
import os
import shutil
from datetime import datetime
import re
from pathlib import Path
from typing import List, Optional

def backup_file(filepath: str, retain_count: Optional[int] = None) -> str:
    """
    Create a backup of a file with timestamp and optionally retain only N latest backups.
    
    Args:
        filepath (str): Path to the file to backup
        retain_count (int, optional): Number of recent backups to keep (including new backup).
                                    If None, keeps all backups. Defaults to None.
    
    Returns:
        str: Path of the created backup file
        
    Raises:
        FileNotFoundError: If source file doesn't exist
        ValueError: If retain_count is less than 1
        
    Examples:
        >>> backup_file('example.py', retain_count=3)
        'example_20240223_143022.py'
    """
    # Validate inputs
    if not os.path.isfile(filepath):
        raise FileNotFoundError(f"Source file not found: {filepath}")
    if retain_count is not None and retain_count < 1:
        raise ValueError("retain_count must be at least 1")
    
    # Parse file path components
    path = Path(filepath)
    directory = str(path.parent) if str(path.parent) != '.' else ''
    basename = path.stem
    extension = path.suffix
    
    # Generate timestamp and backup filename
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_filename = f"{basename}_{timestamp}{extension}"
    backup_path = os.path.join(directory, backup_filename) if directory else backup_filename
    
    # Create the backup
    shutil.copy2(filepath, backup_path)
    
    # Handle backup retention if specified
    if retain_count is not None:
        cleanup_old_backups(filepath, retain_count)
        
    return backup_path

def cleanup_old_backups(filepath: str, retain_count: int) -> List[str]:
    """
    Remove old backups keeping only the N most recent ones.
    
    Args:
        filepath (str): Original file path to match backup pattern
        retain_count (int): Number of recent backups to keep
        
    Returns:
        List[str]: List of removed backup files
    """
    path = Path(filepath)
    directory = str(path.parent) if str(path.parent) != '.' else ''
    basename = path.stem
    extension = path.suffix
    
    # Pattern to match backup files
    pattern = f"{basename}_[0-9]{{8}}_[0-9]{{6}}{extension}"
    
    # List all matching backup files
    backup_files = []
    for f in os.listdir(directory) if directory else os.listdir('.'):
        if re.match(pattern, f):
            full_path = os.path.join(directory, f) if directory else f
            backup_files.append((full_path, os.path.getmtime(full_path)))
    
    # Sort by modification time and get files to remove
    backup_files.sort(key=lambda x: x[1])  # Sort by modification time
    files_to_remove = [f[0] for f in backup_files[:-retain_count]] if len(backup_files) > retain_count else []
    
    # Remove old backup files
    for file in files_to_remove:
        os.remove(file)
    
    return files_to_remove

Pytest

# File: test_file_backup.py
import pytest
import os
from datetime import datetime
import time
from pathlib import Path
from file_backup import backup_file, cleanup_old_backups

@pytest.fixture
def temp_file(tmp_path):
    """Create a temporary file for testing."""
    file_path = tmp_path / "test.txt"
    with open(file_path, "w") as f:
        f.write("test content")
    return str(file_path)

@pytest.fixture
def temp_file_with_backups(temp_file):
    """Create multiple backups of a file for testing."""
    backups = []
    for _ in range(5):
        backup_path = backup_file(temp_file)
        backups.append(backup_path)
        time.sleep(0.1)  # Ensure unique timestamps
    return temp_file, backups

def test_backup_file_creates_backup(temp_file):
    backup_path = backup_file(temp_file)
    
    # Verify backup exists
    assert os.path.exists(backup_path), "Backup file should exist"
    
    # Verify timestamp format in filename
    filename = Path(backup_path).name
    assert re.match(r'test_\d{8}_\d{6}\.txt', filename), \
        "Backup filename should match expected format"
    
    # Verify backup content matches original
    with open(temp_file) as f1, open(backup_path) as f2:
        assert f1.read() == f2.read(), "Backup content should match original"

def test_backup_file_with_retention(temp_file_with_backups):
    original_file, existing_backups = temp_file_with_backups
    retain_count = 3
    
    # Create one more backup with retention
    backup_file(original_file, retain_count=retain_count)
    
    # Get current backups
    directory = str(Path(original_file).parent)
    current_backups = [f for f in os.listdir(directory) 
                      if re.match(r'test_\d{8}_\d{6}\.txt', f)]
    
    assert len(current_backups) == retain_count, \
        f"Should retain exactly {retain_count} backups"

def test_backup_file_nonexistent_file():
    with pytest.raises(FileNotFoundError) as exc_info:
        backup_file("nonexistent.txt")
    assert "Source file not found" in str(exc_info.value)

def test_backup_file_invalid_retain_count(temp_file):
    with pytest.raises(ValueError) as exc_info:
        backup_file(temp_file, retain_count=0)
    assert "retain_count must be at least 1" in str(exc_info.value)

@pytest.mark.parametrize("retain_count,expected_remaining", [
    (1, 1),
    (2, 2),
    (3, 3),
    (5, 5),
])
def test_cleanup_old_backups_various_counts(temp_file_with_backups, retain_count, expected_remaining):
    original_file, _ = temp_file_with_backups
    
    # Perform cleanup
    cleanup_old_backups(original_file, retain_count)
    
    # Count remaining backups
    directory = str(Path(original_file).parent)
    remaining_backups = [f for f in os.listdir(directory) 
                        if re.match(r'test_\d{8}_\d{6}\.txt', f)]
    
    assert len(remaining_backups) == expected_remaining, \
        f"Should have {expected_remaining} backups remaining"

elis.byberi · December 23, 2024, 4:13pm

Consider doing micro-testing to identify why the code isn’t working as intended. Debugging that section as part of a larger algorithm will be difficult.

MRAB · December 23, 2024, 5:43pm

This:

    backups = sorted(
        [f for f in os.listdir(backup_dir) if f.startswith(os.path.splitext(filename)[0]) and f.endswith(os.path.splitext(filename)[1])],
        key=lambda f: os.path.getctime(os.path.join(backup_dir, f))
    )

packs too much into one line.

It can be simplified to:

    backups = [f for f in os.listdir(backup_dir) if f.startswith(os.path.splitext(filename)[0]) and f.endswith(os.path.splitext(filename)[1])]
    backups = sorted(
        backups,
        key=lambda f: os.path.getctime(os.path.join(backup_dir, f))
    )

and then to:

    backups = [f for f in os.listdir(backup_dir) if f.startswith(os.path.splitext(filename)[0]) and f.endswith(os.path.splitext(filename)[1])]
    backups.sort(key=lambda f: os.path.getctime(os.path.join(backup_dir, f)))

Unfortunately, the first line has a subtle problem.

Suppose you have 2 files, “foo.txt” and “foobar.txt”.

Backing them up could lead to backups called, say, “foo_20241223_171418.txt” and “foobar_20241223_172025.txt”.

Now look at what the retention code is doing: f.startswith(os.path.splitext(filename)[0]) and f.endswith(os.path.splitext(filename)[1].

When you backup “foo.txt”, it asks whether “foo_20241223_171418.txt” starts with “foo”. It does.

But it also asks whether “foobar_20241223_172025.txt” starts with “foo”. It does.

So it treats “foobar_20241223_172025.txt” as though it’s a backup of “foo.txt”.

That’s not what you want.

You want it to check whether the name of the backup file is the name of the original file plus a timestamp plus the extension. Best to put that test into a function to keep the code clearer:

    backups = [f for f in os.listdir(backup_dir) if is_backup_file(f, filename)]
    backups.sort(key=lambda f: os.path.getctime(os.path.join(backup_dir, f)))

and define the functions:

def is_backup_file(backup_filename, orig_filename):
    backup_base, backup_ext = os.path.splitext(backup_filename)
    orig_base, orig_ext = os.path.splitext(orig_filename)
    return backup_base.startswith(orig_base) and is_timestamp(backup_base[len(orig_base) : ]) and backup_ext == orig_ext

def is_timestamp(ts):
    try:
        timestamp = datetime.strptime(ts, "_%Y%m%d_%H%M%S")
    except ValueError:
        return False

    return True

CJ_CHU · December 24, 2024, 3:58am

What does micro-testing mean? Does it mean I should also test the lambda?

CJ_CHU · December 24, 2024, 4:22am

Technically, you solved my problem indirectly. Thanks also for simplifying and pointing out the subtle sorting problem.

The reason the test fails on test_backup_file_retains_specified_count is that there’s no time.sleep(1) in the test, so the backup files have similar names and the test fails.

elis.byberi · December 24, 2024, 6:27pm

Ideally, you should test everything, especially if you’re not fully confident in a programming language. Overconfidence can lead to skipping tests, increasing the risk of bugs and issues. It’s essential to strike a balance.

By micro-testing, I mean writing small, focused tests that run code in isolation with predefined inputs and outputs, then comparing the actual results to the expected ones. It doesn’t need to be a full unit test—just printing the lambda result for simple use cases is enough.

In this case, I would hardcode the expected lambda result to rule out the possibility of a logic bug. Then, I would test what the lambda actually returns, step by step.