Tracking removed HTML IDs

Hello,
There is a request for splitting up a docs page, and more will probably follow. I’d like to address an issue I see with such reorganization: preventing broken links.

And by links, I mean HTML IDs – the part in the URL after #, as in https://docs.python.org/3/reference/datamodel.html#special-method-names.
Sphinx sometimes adds these automatically, so it’s easy to remove one by mistake.

Ideally, if someone links to Python’s documentation, their link will continue to work forever.
Practically, we should at least make sure links are broken deliberately.

(So, please don’t use this topic to discuss if/when IDs should be (re)moved, or how to best move content while keeping links intact. Here, it’s just about detecting that we broke a possible link.)

We already have checks with known exceptions: for Sphinx warnings or C globals. I suggest doing something similar for removed HTML pages and IDs: check in CI, right after the docs are built, and ask people to add an entry to a file like Doc/data/removed-ids.txt if a removal was deliberate.

To detect removals, we need a baseline – the set of IDs from a previous version of the docs. I suggest to publish the set of IDs with the docs, for example as a html-ids.json.gz file alongside objects.inv. That way, CI doesn’t need to download or build more than it needs.

The baseline for main would be the latest stable version; for other branches (if enabled) it would be the current live docs for that branch.

Does that sound reasonable?


For reference:

IDs removed in 3.15, relative to 3.14 (numbered ones and whatsnew filtered out)
c-api/complex.html: complex-numbers-as-python-objects

c-api/extension-modules.html: initialization-function

c-api/import.html: c.PyImport_ImportModuleNoBlock

c-api/init.html: c.Py_GetExecPrefix
c-api/init.html: c.Py_GetPath
c-api/init.html: c.Py_GetPrefix
c-api/init.html: c.Py_GetProgramFullPath
c-api/init.html: c.Py_GetProgramName
c-api/init.html: c.Py_GetPythonHome

c-api/module.html: module-definitions
c-api/module.html: module-slots

c-api/stable.html: c-api-stability

c-api/sys.html: c.PySys_ResetWarnOptions

c-api/weakref.html: c.PyWeakref_GET_OBJECT
c-api/weakref.html: c.PyWeakref_GetObject

extending/extending.html: a-simple-example
extending/extending.html: back-to-the-example
extending/extending.html: backtoexample
extending/extending.html: compilation-and-linkage
extending/extending.html: extending-python-with-c-or-c
extending/extending.html: extending-simpleexample
extending/extending.html: intermezzo-errors-and-exceptions
extending/extending.html: methodtable
extending/extending.html: the-module-s-method-table-and-initialization-function

extending/index.html: creating-extensions-without-third-party-tools

howto/perf_profiling.html: python-support-for-the-linux-perf-profiler

library/dis.html: opcode-LOAD_CONST_IMMORTAL

library/ftplib.html: ftplib.FTP_TLS.ssl_version

library/http.server.html: cmdoption-http.server-cgi
library/http.server.html: http.server.CGIHTTPRequestHandler
library/http.server.html: http.server.CGIHTTPRequestHandler.cgi_directories
library/http.server.html: http.server.CGIHTTPRequestHandler.do_POST

library/importlib.html: importlib.abc.FileLoader.load_module
library/importlib.html: importlib.abc.InspectLoader.load_module
library/importlib.html: importlib.abc.Loader.load_module
library/importlib.html: importlib.abc.SourceLoader.load_module
library/importlib.html: importlib.machinery.SourceFileLoader.load_module
library/importlib.html: importlib.machinery.SourcelessFileLoader.load_module

library/pathlib.html: pathlib.PurePath.is_reserved

library/platform.html: java-platform
library/platform.html: platform.java_ver

library/profile.html: cmdoption-cProfile-m
library/profile.html: cmdoption-cProfile-o
library/profile.html: cmdoption-cProfile-s
library/profile.html: instant-user-s-manual
library/profile.html: introduction-to-the-profilers
library/profile.html: module-cProfile
library/profile.html: module-pstats
library/profile.html: profile-and-cprofile-module-reference
library/profile.html: profile-cli
library/profile.html: profile-instant
library/profile.html: profile-stats
library/profile.html: profiler-introduction
library/profile.html: pstats.Stats
library/profile.html: pstats.Stats.add
library/profile.html: pstats.Stats.dump_stats
library/profile.html: pstats.Stats.get_stats_profile
library/profile.html: pstats.Stats.print_callees
library/profile.html: pstats.Stats.print_callers
library/profile.html: pstats.Stats.print_stats
library/profile.html: pstats.Stats.reverse_order
library/profile.html: pstats.Stats.sort_stats
library/profile.html: pstats.Stats.strip_dirs
library/profile.html: the-python-profilers
library/profile.html: the-stats-class

library/typing.html: typing.no_type_check_decorator

library/wave.html: wave.Wave_read.getmark
library/wave.html: wave.Wave_read.getmarkers

library/zipimport.html: zipimport.zipimporter.load_module

reference/datamodel.html: module.__cached__

draft collection/checking script
from compression import gzip
import concurrent.futures
from pathlib import Path
import html.parser
import functools
import argparse
import json
import sys
import re


IGNORED_ID_RE = re.compile(r"""
    index-\d+
    | id\d+
    | [_a-z]+_\d+
""", re.VERBOSE)


class IDGatherer(html.parser.HTMLParser):
    def __init__(self, ids):
        super().__init__()
        self.__ids = ids

    def handle_starttag(self, tag, attrs):
        for name, value in attrs:
            if name == 'id':
                if not IGNORED_ID_RE.fullmatch(value):
                    self.__ids.add(value)


def get_ids_from_file(path):
    ids = set()
    gatherer = IDGatherer(ids)
    with path.open() as file:
        while chunk := file.read(4096):
            gatherer.feed(chunk)
    return ids


def gather_ids(htmldir, *, verbose_print):
    if not htmldir.joinpath('objects.inv').exists():
        raise ValueError(f'{htmldir!r} is not a Sphinx HTML output directory')

    pool = concurrent.futures.ThreadPoolExecutor()
    tasks = {}
    for path in htmldir.glob('**/*.html'):
        relative_path = path.relative_to(htmldir)
        if '_static' in relative_path.parts:
            continue
        if 'whatsnew' in relative_path.parts:
            continue
        tasks[relative_path] = pool.submit(get_ids_from_file, path=path)

    ids_by_page = {}
    for relative_path, future in tasks.items():
        verbose_print(relative_path)
        ids = future.result()
        ids_by_page[str(relative_path)] = future.result()
        verbose_print(f'    - {len(ids)} ids found')

    common = set.intersection(*ids_by_page.values())
    verbose_print(f'Filtering out {len(common)} common ids')
    for key, page_ids in ids_by_page.items():
        ids_by_page[key] = sorted(page_ids - common)

    return ids_by_page


def do_check(baseline, checked, excluded, *, verbose_print):
    successful = True
    for name, baseline_ids in sorted(baseline.items()):
        try:
            checked_ids = checked[name]
        except KeyError:
            successful = False
            print(f'{name}: (page missing)')
            print()
        else:
            missing_ids = set(baseline_ids) - set(checked_ids)
            if missing_ids:
                missing_ids = {
                    a for a in missing_ids
                    if not IGNORED_ID_RE.fullmatch(a)
                    and (name, a) not in excluded
                }
            if missing_ids:
                successful = False
                for missing_id in sorted(missing_ids):
                    print(f'{name}: {missing_id}')
                print()
    return successful


def main(argv):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '-v', '--verbose', action='store_true',
        help='print out more information')
    subparsers = parser.add_subparsers(dest='command', required=True)

    collect = subparsers.add_parser(
        'collect',
        help='collect IDs from a set of HTML files')
    collect.add_argument(
        'htmldir', type=Path,
        help='directory with HTML documentation')
    collect.add_argument(
        '-o', '--outfile',
        help='File to save the result in; default <htmldir>/html-ids.json.gz')

    check = subparsers.add_parser(
        'check',
        help='check two archives of IDs')
    check.add_argument(
        'baseline_file', type=Path,
        help='file with baseline IDs')
    check.add_argument(
        'checked_file', type=Path,
        help='file with checked IDs')
    check.add_argument(
        '-x', '--exclude-file', type=Path,
        help='file with IDs to exclude from the check')

    args = parser.parse_args(argv[1:])

    if args.verbose:
        verbose_print = functools.partial(print, file=sys.stderr)
    else:
        def verbose_print(*args, **kwargs):
            """do nothing"""

    if args.command == 'collect':
        ids = gather_ids(args.htmldir, verbose_print=verbose_print)
        if args.outfile is None:
            args.outfile = args.htmldir / 'html-ids.json.gz'
        with gzip.open(args.outfile, 'wt') as zfile:
            json.dump(ids, zfile)

    if args.command == 'check':
        with gzip.open(args.baseline_file, 'rt') as zfile:
            baseline = json.load(zfile)
        with gzip.open(args.checked_file, 'rt') as zfile:
            checked = json.load(zfile)
        excluded = set()
        if args.exclude_file:
            with open(args.exclude_file, 'rt') as file:
                for line in file:
                    line = line.strip()
                    if line and not line.startswith('#'):
                        name, sep, excluded_id = line.partition(':')
                        if sep:
                            excluded.add((name.strip(), excluded_id.strip()))
        if do_check(baseline, checked, excluded, verbose_print=verbose_print):
            verbose_print('All OK')
        else:
            sys.stdout.flush()
            print(
                'ERROR: Removed IDs found',
                'The above HTML IDs were removed from the documentation, '
                + 'resulting in broken links. Please add them back.',
                sep='\n',
                file=sys.stderr)
            if args.exclude_file:
                print(f'Alternatively, add them to {args.exclude_file}.')


if __name__ == '__main__':
    main(sys.argv)

8 Likes

Looks great!

How many removes were there in older versions?

1 Like

This looks very useful.

To clarify, is this discussion focused on only the identification of IDs or anchors in rendered HTML documentation, and not about how to actually handle them as redirects such as with a new Sphinx extension or web server configuration? I’m interested in both of these topics for other Sphinx projects myself, and I think identification is the first required step.

2 Likes

See this gist.

Yes, I’d like to keep this focused. Feel free to open a new topic :‍)

1 Like

Just for clarification, do I understand correctly that you are referring to the URI fragment
As in RFC 3986 ?

As with any URI, use of a fragment identifier component does not
imply that a retrieval action will take place. A URI with a fragment
identifier may be used to refer to the secondary resource without any
implication that the primary resource is accessible or will ever be
accessed.

That appears to allow for that, though I am unsure how to implement that in a static site as normally produced by Sphinx