Fastly interfering with PyPI search

Thonny Python IDE has a feature which allows free text package search by making simple GET request to PyPI (just like the HTML form did) and parsing the results.

This doesn’t seem to work anymore. When searching in Chrome, I see briefly something about “Fastly”.

Does it mean PyPI is discouraging machine use of its search feature? Is there an alternative? I would love to use a proper API instead. I know about the API which gives information about a distribution by its exact name, but I miss the free text search capability.

I just noticed that today in my browsers. I think it’s new behavior. Does it cause problems for you? My browser shortcut works fine after the brief flash.

Following approach used to give HTML listing of distributions mentioning “blah”:

from urllib.request import urlopen, Request

req = Request(
    "https://pypi.org/search/?q=blah",
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
        "Accept-Encoding": "gzip, deflate",
        "Cache-Control": "no-cache",
    },
)

with urlopen(req) as fp:
    print(fp.read().decode("utf-8"))

Now it gives

...
    <div id="loading-error">
      A required part of this site couldn’t load. This may be due to a browser
      extension, network issues, or browser settings. Please check your
      connection, disable any ad blockers, or try using a different browser.
    </div>
...

Is it really necessary that the request is made via a real browser and ad-blockers are disabled?

Disabling ad-blocker is not necessary, but noscript is a problem it seems, so you have to use a real browser with JavaScript enabled. At least for the first time you search, then a cookie is set that makes search work even without JavaScript, and you can copy that cookie from your browser in your request headers to make it work from python too.

Alternatively, the Index API can be used to list names of all packages. Then filter by the criteria you want in your own code, then use the project API to get the metadata.

Something like this:

import json
from urllib.request import Request, urlopen

simple_req = Request(
    "https://pypi.org/simple",
    headers={"Accept": "application/vnd.pypi.simple.v1+json"},
)

with urlopen(simple_req) as fp:
    simple_response = json.load(fp)

key = "blah"
matches = [p["name"] for p in simple_response["projects"] if p["name"].startswith(key)]

# Get info for first match
req = Request(
    f"https://pypi.org/pypi/{matches[0]}/json",
    headers={"Accept": "application/json"},
)

with urlopen(req) as fp:
    response = json.load(fp)

print(response["info"])

Yes this is new behavior which was put in place to protect PyPI from automated tools issuing large numbers of (expensive to respond) searches. That search mechanism was only intended for use by humans using interactive tools, and those tools will need to support acting like a browser (including executing JavaScript). It’s unfortunate that such things are necessary but service abuse on the Internet is a never-ending battle.

CC @EWDurbin

9 Likes

This less than ideal outcome is indeed the result of needing to protect PyPI against automated/scripted access/scraping.

Over the last week we saw a dramatic increase in this kind of activity against the (relatively) expensive to serve project, release, and search endpoints. Worse, this activity was coming from over 1000 unique IP addresses with randomized user-agents.

This caused two major and one minor outage that coincided with the floods.

Availability of a search API is a sore spot for PyPI, there’s no doubt. We used to have the XMLRPC search API which was ultimately disabled due to having no way to communicate with the end-users who ware flooding it.

It seems that the same kinds of automation were eventually pointed at the HTML search page at https://pypi.org/search/ and led to the same kinds of abuse.

Protecting availability of PyPI for known/intended/supported use-cases is a priority for myself and the team, and in this case browser validation was the only tool we had to reach for to ensure it in cases like this. I’m grateful for it, but understand the impact it has on use-cases like Thonny’s.

I don’t think this work is “complete” and feedback like this is helpful in understanding use-cases. We flipped the switch on Friday to keep PyPI stable over the weekend, and will be looking at this and other feedback in the future to understand the impact and consider how we might support use-cases like this down the line.

19 Likes

I just realized that this breaks our scripts too. I’m not aware of a good way to get the list of all packages owned by a user, except I think the deprecated XMLPRC API. We had a script that scraped the website because that information is available via the web UI. That script is now broken, so I think the only option is to go back to using the XMLRPC API for now.

1 Like

To be clear, we weren’t using the general search page, but instead https://pypi.org/user/myuser and https://pypi.org/project/myproject pages, and were not running this script continuously, so I don’t believe we were contributing to the original impetus of the abuse. That said, I totally still get why PyPI has to do this, and want to be the best possible citizens for our scripts.

FWIW it’s pretty trivial to use something like playwright-python or requests-html’s JavaScript support, and many others, to either grab the cookie values and then fallback to a smaller library, or to use those libraries to automate the whole process.

The more you dive into this world the more you will find quite mature libraries publicly available (and often hosted on PyPI!) designed to bypass bot detection, such as captcha solvers, user agent generation based on real world distributions, randomly redirecting traffic through different cloud providers, etc.

I will be interested to see if the measures currently taken remain effective, and if not how quickly they will be overcome. Clearly there is a real demand for this, and if there’s no blessed solution some actors will force a solution however they can.

5 Likes

It’s much worse than that. I’ve just realized I can’t use PyPI anymore using elinks, because the essential subpages now require JavaScript. This is a major accessibility problem, and a serious drawback to people who can’t use a fully featured browser for any reason, ranging from people running over SSH connection that need to quickly look up package versions (especially given that pip search no longer works), to disabled people for whom special browsers provided much better experience than your average Chrome/Firefox fork.

7 Likes

Thank you (and the team) for all the work on this.

In future it would be great to have the option to use the website without JS enabled :slight_smile:
Right now can not even open the project page directly typing the URL.

5 Likes

You can still use either:

curl -L https://pypi.org/simple/packagename
curl -L https://pypi.org/pypi/packagename/json

if you’re completely stuck without a browser.

There’s also pip index command for queries related to the index.

5 Likes

This change has broken my python app that shows users available plugins and extensions for my app (aprsd). I have a command in the app that shows installed/available extensions and plugins for the app, and this change has broken that ability.

I understand the need to stop the malicious attacks, but the community needs the ability to programatically query pypi for packages.

Can’t valid accounts get a authid or something to authenticate against the search page?

fwiw, my app does this to fetch the list of available plugins/extensions.

1 Like

I tried to use requests-html to hit the pypi.org/search page and in order to have support for javascript it installs chromium. This isn’t a viable option especially for running on small devices. We need a real API for pypi.org to make requests for packages and searching for packages, like “aprsd*”

1 Like

If you know the package name, the simple API does the job (there’s a JSON version, scroll down a bit in that section to find it).

A search API used to exist, but it (and subsequently the workaround of scraping the web UI as well) was removed because it was abused. See the post by @EWDurbin above for more details. Providing something is still important to the PyPI team, but offering a stable service is (quite rightly) more important.

Development resource for PyPI is in critically short supply, so while this might be a possibility, it’s not likely to happen in the short term (there are important security-related features that haven’t happened yet due to lack of resource, and honestly I hope they are given a higher priority than a new search API…)

3 Likes

if the problem is with abuse, wouldn’t an expanding rate limit through a proxy in front of the backend do better? Say 30 requests a minute, with a 5 minute cool-down, and everytime a new request is made within the cool-down window, another 5 minutes gets added? I don’t know if expanding cool-downs is a feature availabe within most reverse proxies, but rate limits certainly are. At least as a mid term solution?

1 Like

In case someone finds it useful – I created following work-around in Thonny IDE. Instead of turning to PyPI search, Thonny now downloads the list of 5000 most popular PyPI packages and filters this list according to the search query. In order to allow some typos, I’m comparing the words in the search query to words in the package name using Jaro similarity score. I also test the query string against simple API so that the less popular packages can be found by entering the exact package name.

2 Likes

That still requires work and resources which are in short supply.

2 Likes

Additionally, grabbing data from BigQuery Datasets - PyPI Docs could be another workaround for those who don’t really need to display package versions released 1 second prior. Not sure exactly how much detail in contains, though. But could be viable.