I’m working on an AO3 fanfiction recommender script that makes many HTTP requests to Archive of Our Own. Currently I have two separate systems:
- SQLite cache (
requests_cache) - stores responses to avoid re-fetching - Debug HTML files - saves raw HTML to disk with
--save-debug-htmlflag for debugging
Current setup
Here’s what my client currently looks like:
class AO3Client:
def __init__(self, use_cache=True, save_debug_html=False):
# SQLite cache
if use_cache:
self.session = requests_cache.CachedSession(
"cache/ao3_cache",
backend="sqlite",
expire_after=datetime.timedelta(days=30),
)
# Separate debug HTML
self.save_debug_html = save_debug_html
def get_soup(self, url):
response = self.session.get(url)
# Save debug HTML separately
if self.save_debug_html:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{timestamp}_{url.replace('/', '_')}.html"
with open(f"debug_html/{filename}", "w") as f:
f.write(response.text)
return BeautifulSoup(response.text, "html.parser")
Why I need both
When something goes wrong (parsing fails, content changed, etc.), I need to inspect what the server returned. The SQLite cache is a binary blob that’s hard to inspect:
$ file cache/ao3_cache.sqlite
cache/ao3_cache.sqlite: SQLite 3.x database
$ sqlite3 cache/ao3_cache.sqlite "SELECT substr(value,1,100) FROM responses LIMIT 1"
(blob, not human-readable)
So I added --save-debug-html which writes raw HTML files:
debug_html/
20260325_015941_https___archiveofourown.org_works_1113588_view_adult_true.html
I can open these in a browser to see exactly what the script saw. This works great for debugging but has downsides:
- Duplicates data (already cached, now also saved as HTML)
- Separate code path to maintain
- Have to remember to run with
--save-debug-htmlwhen debugging
My goal
I want to replace both with a single cache that:
- Stores responses as human-readable JSON files (not binary SQLite)
- Filenames include the URL so I can find them:
archiveofourown.org_works_123456_a1b2c3d4.json - Automatically saves an accompanying HTML file for browser viewing
- Has an index.json for fast lookups
This would give me:
- Caching without re-fetching ✓
- Built-in debugging (no separate flag needed) ✓
- Single system to maintain ✓
What I’ve tried
I implemented a custom storage class based on requests_cache.backends.base.BaseStorage:
import hashlib
import json
from pathlib import Path
from requests_cache.backends.base import BaseCache, BaseStorage
class JSONFileStorage(BaseStorage):
def __init__(self, directory: str, **kwargs):
super().__init__(**kwargs)
self.directory = Path(directory)
self.directory.mkdir(parents=True, exist_ok=True)
self.index_file = self.directory / "index.json"
self.index = self._load_index()
def _load_index(self):
if self.index_file.exists():
with open(self.index_file) as f:
return json.load(f)
return {}
def _save_index(self):
with open(self.index_file, "w") as f:
json.dump(self.index, f, indent=2)
def _make_filename(self, key: str, url: str) -> str:
# Human-readable from URL: archiveofourown.org_works_123456_a1b2c3d4.json
safe_url = url.replace("https://", "").replace("/", "_")
safe_url = re.sub(r"[\\/*?:\"<>|]", "_", safe_url)
key_hash = hashlib.sha256(key.encode()).hexdigest()[:8]
return f"{safe_url}_{key_hash}.json"
def __setitem__(self, key: str, value):
# value is whatever requests_cache passes - maybe a dict or CachedResponse?
url = value.get("url", key) if isinstance(value, dict) else key
filename = self._make_filename(key, url)
path = self.directory / filename
# Save JSON
with open(path, "w") as f:
json.dump(value, f, indent=2)
# Save HTML for debugging
html_path = path.with_suffix(".html")
if "text" in value:
with open(html_path, "w") as f:
f.write(value["text"])
self.index[key] = filename
self._save_index()
def __getitem__(self, key: str):
path = self.directory / self.index[key]
with open(path) as f:
return json.load(f) # But requests_cache expects something else!
# ... other required methods (__len__, __iter__, __delitem__, clear)
And then I try to use it:
cache = JSONFileCache(directory="cache/ao3_cache")
self.session = requests_cache.CachedSession(backend=cache, expire_after=30)
The problem
When retrieving from cache, requests_cache expects the cached response to have specific attributes like:
.expires- datetime or None.is_expired- property/method.text- response body.url- original URL.status_code- HTTP status
My __getitem__ returns a dict, but requests_cache internally does things like:
# Inside requests_cache's policy/actions.py
if cached_response.expires is None: # AttributeError if it's a dict
return True
if cached_response.is_expired: # AttributeError
return False
So I tried wrapping the dict in a custom class:
class CachedResponse:
def __init__(self, data):
self.text = data.get("text", "")
self.url = data.get("url", "")
self.status_code = data.get("status_code", 200)
self._expires = data.get("expires")
@property
def expires(self):
return self._expires
@property
def is_expired(self):
return False # Simplified for now
But I still get errors:
TypeError: can only concatenate str (not "datetime.timedelta") to str
It seems requests_cache expects the expires field to be a datetime object, not a string, and it performs arithmetic on it.
Questions
-
What’s the correct interface I need to implement for a custom backend? Is there documentation or an example I can follow?
-
Is there a simpler approach than trying to hack into
requests_cache’s internals? Maybe just wrappingrequests.Sessionwith my own caching layer:
class HTMLCachingSession:
def get(self, url):
cached = self._check_cache(url)
if cached:
return cached
response = self.session.get(url)
self._save_cache(url, response)
return response
Would this be more maintainable than fighting with requests_cache?
- Should I just give up and keep the two separate systems? The duplication bothers me, but maybe it’s not worth the complexity.
Any guidance appreciated!