Mark an HTML page to NOT parse it as distribution listing

uranusjr · August 2, 2019, 6:58pm

A recent exchange on distutils-sig got me thinking about how we can improve UX when the user supplies a wrong value as --find-links or --[extra-]index-url. I also remember an incident where a user got very confused why pip can’t find a package version because they incorrectly supplied pypi.org/project to --index-url (instead of the correct pypi.org/simple).

So the problem here is that find-links and index pages are defined very loosely (basically any HTML page works), and a tool has no choice but to consume anything HTML the user passes in, leading to confusing behaviour or cryptic errors. This probably can never be fully amended as long as we stick to the Simple API (backwards compatibility), but I wonder maybe it’d be possible to at least introduce a way to mark an HTML page as not a listing page, adding it to pages commonly supplied by mistake (e.g. pypi.org/simple/ and pypi.org/project/<name>/), so pip can error out as soon as possible with a clear error message.

My idea right now:

Specify a custom attribute to mark an HTML page as non-listing (something like <html data-pypackage-not-listing="true">).
Add that attribute to pages commonly supplied by mistake.
(Optional) Amend PEP 503 to require the index root page to contain that attribute?
Implement pip so it stops reading/processing the page as soon as it encounters that attribute.
Error out (or emit a message like it does when Content-Type is not HTML) when 4. happens.

I can draft a PEP if that’s needed.

brettcannon · August 2, 2019, 7:14pm

Why not invert that and say that index pages must provide a custom attribute? It will be much easier to get index pages updated versus all other pages on the internet . And pip could raise an exception for a while to let people know their index page will be considered invalid in the future.

uranusjr · August 2, 2019, 7:43pm

I agree it’s much nicer in theory that way, but that’d need a looong (like in terms of 3+ years; I just read about an issue on pypa/pip on pip 9 the other day) transition period during which people would continnue to suffer (although it’d indeed be sightly easier to debug the issue).

Or maybe we can do both? Blacklist the disable attribute right now, warn when the explicit enable attribute is missing for a while, and require the explicit whitelist eventually (plus retiring the blacklist attribute).

dstufft · August 3, 2019, 1:05pm

Requiring a custom attribute would mean that chucking some files in the file system and using any old web server would no longer work. We could look at stopping support for that— but it’s good to be explicit that’s what it would mean.

brettcannon · August 6, 2019, 11:21pm

Why would that be? It’s still valid HTML at the end of the day so I don’t understand how that would prevent using http.server.

cjerdonek · August 7, 2019, 5:37am

Are there any heuristics that pip can use to detect likely mistakes, with a low chance of false positives? If so, pip could log a warning in those cases. What are the cryptic errors / confusing behavior that users currently see?

uranusjr · August 7, 2019, 5:53am

The problem is that the simple API allows virtually anything. We can identify characteristics of common mistakes that we know of, but there’s no way to know someone’s custom source actually does have the same characteristic, but is invisible right now because everything works perfectly. Adding a new blacklisting attribute guarentees those dark matter cases continue to work.

pf_moore · August 7, 2019, 8:05am

No, the point is that you can run python -m http.server on any directory and get an index page served. That’s a valid (and recommended for simple cases) way of serving an index, but the actual index HTML page is generated by the webserver, not explicitly created. So there’s no custom attribute, unless the web server is modified to generate it.

leorochael · August 7, 2019, 2:24pm

No, the point is that you can run python -m http.server on any directory and get an index page served.

One way to keep that simplicity but still start on a road to avoid dumb mistakes is to say that one of the links on the index page (e.g. one of the files in a directory served by any web server) is to a text file named:

0_THIS_IS_A_PYTHON_PACKAGING_INDEX.txt

This way, you could dump a bunch of packages in a directory, create (or just touch) that file and be done.

It could of course contain a recommended text explaining what those links mean (or in the case of a filesystem directory, what is the content of that directory), but no packaging tools would actually bother following that link or reading that file.

pip and other packaging tools could start warning that they didn’t find that link in a supposed index page, and with a switch outright refuse to work.

In the future the link could become mandatory.

You’re all welcome to bikeshed the name of the file to your hearts content

pradyunsg · August 7, 2019, 3:41pm

That’s a honking good idea @leorochael!

uranusjr · August 7, 2019, 4:58pm

Agreed, this sounds like a good practical solution to the problem.

Topic		Replies	Views
Potential inconsistency w/ PEP 503 (Simple Repo API) Packaging	6	1536	March 2, 2020
Community testing of packaging tools against non-Warehouse indexes Packaging	17	1152	February 4, 2022
Is setting the pip index enough to avoid accidentally installing the public package Python Help	1	317	October 20, 2023
Can apache generate an HTML 5 index? Packaging	8	531	November 8, 2022
How to specify extra-index in a pyproject.toml for pip and pip-tools? Packaging	23	16914	March 18, 2024

Mark an HTML page to NOT parse it as distribution listing

Related Topics