How to make requests_cache store human-readable JSON files that can also serve as debug HTML?

I’m working on an AO3 fanfiction recommender script that makes many HTTP requests to Archive of Our Own. Currently I have two separate systems:

  1. SQLite cache (requests_cache) - stores responses to avoid re-fetching
  2. Debug HTML files - saves raw HTML to disk with --save-debug-html flag for debugging

Current setup

Here’s what my client currently looks like:

class AO3Client:
    def __init__(self, use_cache=True, save_debug_html=False):
        # SQLite cache
        if use_cache:
            self.session = requests_cache.CachedSession(
                "cache/ao3_cache",
                backend="sqlite",
                expire_after=datetime.timedelta(days=30),
            )
        
        # Separate debug HTML
        self.save_debug_html = save_debug_html
    
    def get_soup(self, url):
        response = self.session.get(url)
        
        # Save debug HTML separately
        if self.save_debug_html:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"{timestamp}_{url.replace('/', '_')}.html"
            with open(f"debug_html/{filename}", "w") as f:
                f.write(response.text)
        
        return BeautifulSoup(response.text, "html.parser")

Why I need both

When something goes wrong (parsing fails, content changed, etc.), I need to inspect what the server returned. The SQLite cache is a binary blob that’s hard to inspect:

$ file cache/ao3_cache.sqlite
cache/ao3_cache.sqlite: SQLite 3.x database
$ sqlite3 cache/ao3_cache.sqlite "SELECT substr(value,1,100) FROM responses LIMIT 1"
(blob, not human-readable)

So I added --save-debug-html which writes raw HTML files:

debug_html/
  20260325_015941_https___archiveofourown.org_works_1113588_view_adult_true.html

I can open these in a browser to see exactly what the script saw. This works great for debugging but has downsides:

  • Duplicates data (already cached, now also saved as HTML)
  • Separate code path to maintain
  • Have to remember to run with --save-debug-html when debugging

My goal

I want to replace both with a single cache that:

  1. Stores responses as human-readable JSON files (not binary SQLite)
  2. Filenames include the URL so I can find them: archiveofourown.org_works_123456_a1b2c3d4.json
  3. Automatically saves an accompanying HTML file for browser viewing
  4. Has an index.json for fast lookups

This would give me:

  • Caching without re-fetching ✓
  • Built-in debugging (no separate flag needed) ✓
  • Single system to maintain ✓

What I’ve tried

I implemented a custom storage class based on requests_cache.backends.base.BaseStorage:

import hashlib
import json
from pathlib import Path
from requests_cache.backends.base import BaseCache, BaseStorage

class JSONFileStorage(BaseStorage):
    def __init__(self, directory: str, **kwargs):
        super().__init__(**kwargs)
        self.directory = Path(directory)
        self.directory.mkdir(parents=True, exist_ok=True)
        self.index_file = self.directory / "index.json"
        self.index = self._load_index()
    
    def _load_index(self):
        if self.index_file.exists():
            with open(self.index_file) as f:
                return json.load(f)
        return {}
    
    def _save_index(self):
        with open(self.index_file, "w") as f:
            json.dump(self.index, f, indent=2)
    
    def _make_filename(self, key: str, url: str) -> str:
        # Human-readable from URL: archiveofourown.org_works_123456_a1b2c3d4.json
        safe_url = url.replace("https://", "").replace("/", "_")
        safe_url = re.sub(r"[\\/*?:\"<>|]", "_", safe_url)
        key_hash = hashlib.sha256(key.encode()).hexdigest()[:8]
        return f"{safe_url}_{key_hash}.json"
    
    def __setitem__(self, key: str, value):
        # value is whatever requests_cache passes - maybe a dict or CachedResponse?
        url = value.get("url", key) if isinstance(value, dict) else key
        filename = self._make_filename(key, url)
        path = self.directory / filename
        
        # Save JSON
        with open(path, "w") as f:
            json.dump(value, f, indent=2)
        
        # Save HTML for debugging
        html_path = path.with_suffix(".html")
        if "text" in value:
            with open(html_path, "w") as f:
                f.write(value["text"])
        
        self.index[key] = filename
        self._save_index()
    
    def __getitem__(self, key: str):
        path = self.directory / self.index[key]
        with open(path) as f:
            return json.load(f)  # But requests_cache expects something else!
    
    # ... other required methods (__len__, __iter__, __delitem__, clear)

And then I try to use it:

cache = JSONFileCache(directory="cache/ao3_cache")
self.session = requests_cache.CachedSession(backend=cache, expire_after=30)

The problem

When retrieving from cache, requests_cache expects the cached response to have specific attributes like:

  • .expires - datetime or None
  • .is_expired - property/method
  • .text - response body
  • .url - original URL
  • .status_code - HTTP status

My __getitem__ returns a dict, but requests_cache internally does things like:

# Inside requests_cache's policy/actions.py
if cached_response.expires is None:  # AttributeError if it's a dict
    return True
if cached_response.is_expired:       # AttributeError
    return False

So I tried wrapping the dict in a custom class:

class CachedResponse:
    def __init__(self, data):
        self.text = data.get("text", "")
        self.url = data.get("url", "")
        self.status_code = data.get("status_code", 200)
        self._expires = data.get("expires")
    
    @property
    def expires(self):
        return self._expires
    
    @property
    def is_expired(self):
        return False  # Simplified for now

But I still get errors:

TypeError: can only concatenate str (not "datetime.timedelta") to str

It seems requests_cache expects the expires field to be a datetime object, not a string, and it performs arithmetic on it.

Questions

  1. What’s the correct interface I need to implement for a custom backend? Is there documentation or an example I can follow?

  2. Is there a simpler approach than trying to hack into requests_cache’s internals? Maybe just wrapping requests.Session with my own caching layer:

class HTMLCachingSession:
    def get(self, url):
        cached = self._check_cache(url)
        if cached:
            return cached
        response = self.session.get(url)
        self._save_cache(url, response)
        return response

Would this be more maintainable than fighting with requests_cache?

  1. Should I just give up and keep the two separate systems? The duplication bothers me, but maybe it’s not worth the complexity.

Any guidance appreciated!

2 Likes

It may not fulfill all of your requirements, but would the Filesystem backend be of use? At a glance it seems like it would at least give you the human readable JSON files, rather than the binary blob of the SQLite backend.