Any chance to an issue tracker for pypi.org operational problems?

While I am fully aware that there is no SLA regarding pypi.org, I cannot refrain to observe few problems include:

https://pypi.org/help/#feedback — does not mention anything about who to contact, which issue tracker to use about infrastructure problems. Mainly the entire page is made in such way that the only issues tracker linked is the warehouse one, which is clearly not the right place for tracking operations, stuff specific to pypi.org deployment. I really doubt this was accidental as a lot of effort was spent into writing the about section.

Fastly is mentioned as a sponsor but nothing more than a polite way of saying “if you are not pleased with pypi.org uptime, build your own”.

For the last ~5 hours, openstack seen a huge number of bad results from fastly mirrors which affected our CI, visible on http://status.openstack.org/elastic-recheck/ – fastly does not even report any kind of issue, everything is green on their side! The same happens with status.pypi.org, which also reports everything green.

What happens is that the CDN reports 404 for packages that were published weeks ago.

After checking on irc, I was asked to raised https://github.com/pypa/warehouse/issues/8260 – which was shortly closed in a …questionable manner.

What makes is even more frustrating is the fact that build your own mirror is is suggested as a solution to the problem but there is direct refusal of providing a non CDN endpoint for those that do want to build a mirror. As one would know, using fastly CDN as a source for building a mirror would only produce a mirror that is less reliably than fastly.

That is not necessarily true: you can use something like devpi that caches the distributions and serves stale ones if it can’t connect to the upstream server.
(I do only have personal experience with it – it’s great for train/plane trips and other places with sketchy connection.)

1 Like

All issues, including operational, can be opened on the Warehouse repository. The talk of having no SLA is to set expectations that we don’t have an on call rotation or people getting paged for downtime. Opening an issue means we’ll (hopefully) see it and take a look, but if some business process relies on having high uptime on PyPI, you’re better off taking ownership of that up time yourself.

That isn’t to mean we don’t care about it, but just that we don’t make any promises.

404’s don’t sound like something that would show up in a status dashboard at this time, so that makes sense. Maybe we could improve our metrics such that it would, but currently it would not.

I don’t understand what the actual issue is happening here. The launchpad that is being linked from that page is 5 years old, and suggests random networking issues, but you’re saying it’s 404 errors? The Kibana URL suggests the problem might lie with oslo.log, but I just checked all 15 pages of CDN nodes, and they all have the same 200 response cached for https://pypi.org/simple/oslo-log/.

If there’s some issue here, I suggest opening an issue with reproduction steps that don’t involve going through openstack’s infra if at all possible.

As we said, we can’t afford or manage to operate a public endpoint that isn’t behind a CDN, it would scale our operations significantly. The handful of endpoints we have that can somewhat bypass the CDN are regularly problems for us, and they’re not nearly as popular as the repository API.

1 Like

Thanks for the quick answer on this. I am still trying to get more info from opendev infra folks as I do not have direct access to the proxies used to read the logs.

Debugging these kind of issues is real PITA as usually everything works fine when you try locally but for some specific geographies you endup with errors, the kind of errors where the index has packages which are not available to download.

Regarding the ticket being old, that is because that is a tracking ticket. There is regex that matches failures and links them to this bug, so it should not be taken as a regular one-off bug. As you can guess, the CDN is not the only thing that can produce this category of failures (it always needs a human to look at it).

Obviously a small number of errors is expected, as some people may put wrong dependencies but a spike like this is clearly some kind of infra issue.

That sounds fine, basically once you have some kind of URL on PyPI that is not giving the result you expect, then opening an issue with that is the best path forward.