Search quality for statements and built-in functions

nchammas · January 2, 2024, 6:09pm

I noticed that the results for the built-in documentation search seem to be off for two categories of items:

statements
built-in functions

It’s not a huge deal, but is there some relatively easy way to tweak how search works so that exact matches on these two categories are prioritized in the search results?

Here are some examples to illustrate the problem.

Example: `match`

Here are the results for match, which I searched for because I wanted to see how to use this relatively new bit of syntax.

The information I was looking for is nowhere on the first “page” of results:

It’s here under “More Control Flow Tools” on the second “page”:

Example: `open`

Here are the results for open, which I’ve search for many times because I wanted to review this or that detail about encoding or newlines or what have you.

Information about the built-in function is buried beneath many other results that arguably should have less priority than the built-in function.

Example: `zip`

Here are the results for zip.

Not as bad a situation here, but the built-in function is again getting bumped down by other functions that are not as relevant.

tjreedy · January 3, 2024, 3:00am

The generic problem is that Python uses short everyday words to name things and the specific uses get lost among the hits corresponding to their everyday meanings. There have been previous complaints and suggestions about this and there may be one or more open issues on the tracker. But someone would have to write and maintain custom search code on top of whatever generic code we are using. Perhaps people are asking too much of a generic search tool instead of using the specific search aids we provide.

For syntax, look in the syntax (reference) manual. The ‘match statement’ is found in TOC in the Compound Statements chapter.

The library manual (built-in) functions chapter has an index at the top. Both open and zip are easily clicked.

There are separate module and everything-else indexes. The latter includes a page indexing Python uses of symbols, which are otherwise notoriously hard to search for.

CAM-Gerlach · January 3, 2024, 4:30am

No need. Rather:

The first example is already fixed in recent Sphinx (used in Python 3.13+), as keyword objects are now indexed, and the matching section is returned as the first hit (albeit with a slightly imprecise title).
To fix the latter two, we just need to tweak Sphinx’s index search result ordering algorithm to prioritize fully qualified matches before partially/unqualified ones, which would fix the last two cases (and many similar).

bryevdv · January 3, 2024, 6:31am

How can this be done, I have a similar issue on a Sphinx docs site I maintain.

pf_moore · January 3, 2024, 12:09pm

Nevertheless, the point @tjreedy made stands. I rarely have trouble finding information in the docs, because I tend to use the contents pages and indexes to navigate, and not rely on search. Google has trained us all to rely heavily on searching over other means of navigating through documents, but it’s not always the best way, and it’s entirely reasonable to point out to the OP that the problem here is less about how search results are ordered than about using search for something it’s not optimised for.

flyinghyrax · January 3, 2024, 6:27pm

I’ve experienced the same issue with the search results as OP. I tend to do something like Paul, where I navigate to a table of contents and ctrl+f for what I want from there. But personally I consider that a workaround, not a solution.

FWIW I’ve observed that people usually gravitate toward one method of navigating or another, e.g. I tend to navigate spatially while my partner always tries search first because it is more natural for them. This makes not sufficiently supporting both navigation methods a kind of accessibility problem (again, just my opinion).

I think it would be nice to have category filters on the search results page, so I could re-execute a search limited to the library docs, keywords, or builtins for example. That introduces those categories to users who don’t naturally seek out index pages to help them refine their results. I don’t know how possible or complex that is to implement, though.

CAM-Gerlach · January 5, 2024, 6:11am

Ideally, fixing the Sphinx search result scoring algorithm upstream, but for now you can patch it locally by adding a scorer.js file of the appropriate form to your static assets as described here. Some months back I was starting to investigate fixing this myself, but then I got sidetracked into working on another issue and it slipped through the cracks.

I initially thought the root issue was the Sphinx scoring algorithm not having a separate level for fully-qualified exact index matches, but looking deeper into it now I see that it actually does, but for some reason it doesn’t appear to be working correctly, scoring everything as an exact match (Scorer.objNameMatch) rather than a partial match (Scorer.objPartialMatch). Therefore, resolving this should just be a matter of tracking down and fixing this apparent bug. @AA-Turner am I correct here? Should I open an issue on the Sphinx repo?

Right, but as of Sphinx 5.3.0 the search uses and in fact prioritizes those same indices (previously, it just used the “main” index of API/object names rather than every index term). There’s no reason users should have to manually dig through those indices themselves instead of being able to rely on the search to intelligently return results from them, modulo a few outstanding bugs in that department.

Yup, agreed. I understand there are ways around the current limitations, but my point is that those limitations are almost entirely fixable (and in some cases are already fixed in latest Sphinx) without requiring users to (know to) resort to such.

With some work, mostly on the JS and theme side, it should be possible to filter matches by result type, i.e. document name, object index, general index, document context, etc, (though to note, it is already quasi-possible because the results are nominally sorted into most of these categories already by the scoring algorithm). With some more work, it might be possible to more work even object/index entry type (function, module, class, keyword, etc)—this would probably require some changes to the search index storage. Filtering by bespoke categories like “library docs”, “language reference” or top-level (builtin) objects would require a bunch of custom site-specific changes, though, which may not be easy or maintainable.

Search quality for statements and built-in functions

Example: match

Example: open

Example: zip

Example: `match`

Example: `open`

Example: `zip`