Thoth - an enhanced server-side resolution offered to the Python community

fridex · May 23, 2021, 9:22pm

Hi everyone,

tldr; we at Red Hat have developed a server-side resolver for Python packages and would like to offer it to the community. See our project page for more info.

Our mission is to offer an alternative to resolvers implemented in pip/pip-tools/Pipenv/Poetry and enhance the resolution process with “guidance” on which Python packages should be installed when developing Python applications. The guidance is done based on knowledge of packages - one can think of various issues that can arise when using packages, generally installation time issues or runtime issues (e.g. performance, security-based issues but also fatal runtime issues, …). Latest packages are not always the greatest choice and having some smart mechanism to guide the resolution process can save time-consuming debugging, project maintenance, or looking for the perfect set of packages to be installed for a specific application.

The whole resolution is based on reinforcement-learning techniques and has an extensible interface - the actual resolution process is seen as a resolution pipeline made out of pipeline units. These units can be written as Python classes. Alternatively, we also offer so-called “prescriptions” that declaratively (YAML) state how the resolver should behave in specific cases. The actual resolver implementation is available in the thoth-station/adviser repository.

We welcome the Python community and Python package maintainers in contributing to the open knowledge about open-source Python packages - either by reporting issues spotted or directly by opening a pull request to thoth-station/prescriptions or thoth-station/adviser repositories. We hope that we can create a better and more sustainable software that is powered by the Python programming language together.

Using this type of resolution process, we are able to resolve Python software stacks that require a certain set of ABI symbols being available in the runtime environment, fixing overpinning/underpinning issues, issues spotted after a release (less strict yanking that can be done just in some cases), security checks or perform cross-package index resolution. We also tried to address the limited express power of wheel tags (using a configuration file maintained by the user). As the resolution process requires pre-aggregated data, we run a background package analysis for packages hosted on PyPI or our optimized builds of TensorFlow.

The resolver is hosted on a publicly available Massachusetts Open Cloud so anyone can try it using Thamos CLI, jupyterlab-requirements extension or OpenShift S2I container image build process. We are already using this type of resolution process at Red Hat and we were able to resolve some Python software stacks for which Pipenv resolution failed; moreover, we try to give a better selection of packages to Python developers. On the other hand, we still have things to improve - note we are still in the development phase, fixing issues on multiple layers so please bear with us. Nevertheless, we are open to any feedback from the Python community and expertise possibly exchanged (any direction). If you have any questions or would like to participate feel free to reach out to us.

You can find more info on our project page thoth-station.ninja. We periodically update our Twitter account and release demos to the community on our YouTube channel.

On behalf Thoth team,

Fridolin

Blackward · May 23, 2021, 10:31pm

Howdy Fridolin,

great, thank you very very very much!

So true!!! But a truth, which a lot of people here might not like very much…
I like it! Correction: I love it…

Cool idea! And for all, who do not know the term reinforcement-learning techniques: they are used to train neural networks…

That’s absolutely great! So I should definitely try it eventually, as I deem to have the ideal test setup:

blythooon · PyPI

The packages of Blythooon had to be / were assembled manually (a lot of criterions were involved, about which the current resolver has no specific information, too) - the standard resolver failed / fails for many packages comprised and not yet comprised (e.g. try to install opencv-python with pip in Blythooon’s ‘PysideGui’ environment).

At the moment I am quite busy, but count on my feedback as soon as I have enough spare time!

There recently also was a discussion about vulnerabilities here:

Proposing a community maintained database of PyPI package vulnerabilities - Packaging - Discussions on Python.org

I would love, if you guys could somehow take that project into account - might be very useful for Toth as well as for the guys maintaining said vulnerabilities database?

A topic which would interest me concerning Toth is the question: does resp if so, how does Toth take user feedback alike “I tried packet A version X in conjunction with paket B version Y - they installed but do not work properly together in this combination” into account? I would love, if you could elaborate a little about that? How is the information flow…?

Cheers to you and the Toth team, Dominik

fridex · May 24, 2021, 8:28am

Thanks Dominik. Briefly checking the linked project, it looks like it is not available for linux-x86_64 platform we currently support. But any feedback or use-cases you possibly bring are valuable for us.

Definitely. As of now we experiment with pyup.io’s safety-db. Resolver takes into account CVEs stated there and acts on them based on the recommendation type used (e.g. when asking for a secure stack, resolver does not allow having a package with a CVE in the resolved set of dependencies). Once the community-maintained database of package vulnerabilities is available, we can switch to it and use it as a source.

We use Dependency Monkey + Amun to run experiments (so called “inspections”) where we derive such knowledge. Another source is a database of such known issues similar to the vulnerability database linked - we call this “GitHub - thoth-station/prescriptions: ⚕️💊 Prescriptions to heal your applications and application dependencies 💊⚕️”. We also analyze builds happening in clusters we run internally to obtain such knowledge. If the Python community is open to contribute to such a database, we are open to incorporate such knowledge and provide guidance on software packages used.

Thanks for your reply and interest,
Fridolin

uranusjr · May 24, 2021, 10:00am

Thanks for releasing this! Both pipenv and pip contributors have thrown around the idea of a cloud-based dependency resolution solution, and it’s really exciting to see it realised. What I’m wondering is, would it be possible to interact with the resolution without Thamos, but directly with the API instead? I see there is a Swagger available for Thoth, but the documentation seems sparse. That’s no necessarily a problem since I can always dig into the source of Thamos, but somehow raises the doubt whether the API is intended to be used directly (yet).

fridex · May 24, 2021, 10:22am

Thanks for your interest!

To be honest, I (personally) am not very happy about the current API design and it could be definitely improved (it was created on the fly as the project evolved). As of now, we cannot promise its stability (actually, there is already a planned redesign, but that’s a long-term task as our priorities are elsewhere as of now).

Thamos offers thamos.lib module which is basically a Python interface on top of an automatically generated Python swagger client. Using it might be better than direct calls to API as we use it across other parts of the project and want to keep compatibility on this layer internally. But no promises even on this front.

Some more info can be found in docs:
https://thoth-station.ninja/docs/developers/adviser/integration.html#integrating-with-thoth

If you find anything that can be improved even from the docs side, feel free to raise issues or directly pull requests.

uranusjr · May 25, 2021, 2:04pm

Thanks. Opening up an API for public consumption is a very tricky thing (even more than programmic APIs), so it’s definitely reasonable. thamos.lib unfortunately still depends on a lot of things (also very understandable). An alternative would be making the interface more Sans I/O so users can “bring their own network stack” to interact with Thoth, but that’s a lot more design work, unfortunately. There’s no really a perfect solution to this.

fridex · May 25, 2021, 2:46pm

Thanks for the suggestion. I understand your concern with thamos dependencies. Maybe to be more explicit from our side on this - the API is serving its purpose well from a feature point of view but could be fancier and things could be improved. We do not want to treat it as stable as of now, as the project is still evolving and we do not want to block ourselves from quickly extending the functionality. The current API works for us and we hope it will work for the Python community too (if not, let us know), if there will be any big redesign or large backwards-incompatible changes, we will introduce a new version (possibly grpc interface could be also good here). So no promises on no changes on API but a silent promise not to break all the consumers can be a friendly offer we can do from our side.

pf_moore · May 25, 2021, 3:00pm

Sorry to be dense here, but I’m not at all clear what this is intended to be. My naive expectation is that a resolver would take a set of requirements (essentially a requirements.txt file in pip terms) and return a list of exact package pins (name/version). That’s a hard, but deterministic, problem¹ so I’m not clear where an AI-based approach would help, or how “guidance” is involved.

Maybe I’m thinking of the wrong problem here? But you mentioned pip, so my immediate reaction was to think how something like this would fit into what pip’s trying to do.

¹ To find any solution, at least. Picking which is the “best” solution from a set of valid resolutions possibly has scope for intelligence, but I guess my expectations are low, and I’m happy just to get one solution

fridex · May 25, 2021, 4:30pm

Yes - finding any solution should work fine in pip cases where, for example, a backtracking algorithm would do its job. But what if you would like to have the best (whatever that means - performance+secure+stable+abi requirements+gpu requirements+…) possible set of dependencies pinned? As the knowledge about packages grows, coming up with a good candidate is a manner of exploring the state space of possible candidates (dependencies pinned) observe what packages are good or bad, and learn from the steps taken in the resolution process. This resolution process is specific to hardware/software available to the application (that are taken into account as well). A basic overview of this should be expressed in docs. Our experiments proved there is a large number of possibilities on how to resolve software stacks in real-world applications. Having some smart, controlled, and centralized way of picking software packages with higher quality can help with the better overall quality of the applications shipped.

Besides this, resolver offers a pluggable interface to perform actions on the dependency graph which offers an ability to fix dependency issues or tell the resolver how the resolution rules should look like - either by configuring client-side configuration file (addressing tags limitation) or server-side by an authority that configures the resolver (see resolution pipeline units).

DavHau · May 25, 2021, 4:53pm

Thanks for this very interesting piece of technology.
I am interested in this, as I maintain mach-nix which is a declarative python environment manager focused on reproducible dependency resolution and installation.

I’m specifically interested in how you maintain your dependency database as I maintain a similar database myself (pypi-deps-db).
I keep this data updated by regularly crawling pypi. For wheel-releases the metadata is inspected. Though, sdist-releases must be fake-installed via a modified setuptools to extract their requirements.

I know that extracting dependency information from sdist-releases can be hard as their setup can run arbitrary code and therefore depend on arbitrary system packages.

How do you deal with these kind of situtations? Can you guarantee completeness of your data?
I have seen your package-extract project from which I conclude that your dependency data stems from inspecting container images.
Where do you get these container images from? If you build them yourselves, how do you successfully install/analyze some hard-to-build sdist packages?

I am also interested in the thooth API. As I understand it, the advises made by the API are based on a database which is constantly updated. That means, when I issue the same query at different times, I might get different results. Because of this, reproducibility has to be solved on the client side by additional locking mechanisms. If the API would support a reproducible mode, for example by allowing to pass a date which made the resolver operate on a dependency snapshot for that date, then client side locking would become a non issue which would be a huge advancement in my opinion.
Is this a feature that tooth might be interested in providing in the future?
Is the database on which thooth bases its advises publicly accessible so that developers could download it and run alternative resolution strategies?

Blackward · May 25, 2021, 5:46pm

An algorithm just can be as good as it’s source of information. If your source of information JUST is the requirements.txt files of the concerned packages, you just take the knowledge of the developers of said packages into account.

Let’s explain the problem by an example:

The METADATA file of the package pyqtgraph-0.11.1-py2.py3-none-any.whl just contains on “requirement” line:

Requires-Dist: numpy (>=1.8.0)

I tested with which PySide version said pyqtgraph-0.11.1 works fine (for my specific tasks) and wrote the following email to the developer:

Dear Mr. Campagnola,
 
 
thank you very much for pyqtgraph - I enjoy your excellent work very much! I hope the following informations can help you.
 
PyQtGraph (0.10.0 as well as 0.11.1) seems to be incompatible with PySide 1.2.4 (or the other way round:) - it looks like signals are broken in this combination; interacting with plots for example does not work (completely/properly) - please have a look at the attached screenshot Combination-1* and ErrorExample_Combi-1* (ignore the green lines, which are commented-out by '#').
 
PyQtGraph 0.10.0 with PySide 1.2.2 seems to work properly (screenshot Combination-3*)!
 
But PyQtGraph 0.11.1 with the same PySide 1.2.2 does not - there seems to be an incompatibility with the default casting policy, in particular in ScatterPlots - please have a look at screenshots Combination-2* and ErrorExample_Combi2* for that.
 
By the way, the combination 0.11.1 with 1.2.4 shows BOTH aforementioned error types - so they seem to be somehow independent.
 
These tests were made with a Windows 7 system, using the 'official' pypi.org downloads. If you have questions, please do not hesitate to ask:)
 
 
Best Regards
Dominik

I did not get any answer as well as I cannot find said information anywhere in said package or anywhere else in the internet. So, I must assume, that my informations are somehow lost to the community yet, right?

So, my first point is, the informations in the concerned requirement files from the developers obviously can be incomplete or wrong; the quality of said files obviously depend on whether / how much the developer cares.

The normal reaction, when PySide 1.2.2 is involved in a discussion is: we are talking about an outdated package - why don’t you take a wrapper for Qt6?

Shortened answer: the license resp. terms and conditions have changed significantly:

So, my second point is: there are other arguments pro and contra package versions, which cannot (at least not easily) be found in the requirements files of packages - alike license considerations.

Fridolin has named some more:

So, the best version for a specific purpose might not simply be the latest version of a package deemed by the belonging developer to be fitting resp. generally the best.

We all know that the weakness of security systems most often is not the system, but the human operating said system.

It is the same with the requirements system.

Artificial intelligence is a way, to fill the gaps, the humans leave open. In the next generation purely deterministic solutions might be obsolete…

Let us welcome artificial intelligence as a way to improve/extend deterministic algorithms (not to replace them!) !

hooray

Cheers, Dominik

PS: @Fridolin: I am just a little sad, that it will not serve the Windows party…

fridex · May 25, 2021, 9:45pm

We use thoth-solver which was introduced in another topic. The tool can extract package metadata that are complete with respect to dependencies in the given point in time. Another component (we call it revsolver) then makes sure data are in sync and up to date. thoth-solver runs in specific restricted container environments to aggregate information about packages (containers are subsequently thrown away). The dependency information is then specific to container environment in which thoth-solver was run in.

package-extract is another component in Thoth; its focus is on analyzing the content of container images, not related to the dependency case discussed.

We check the already existing lock file in the project that the user uses (if any) and recommend a better one only if a better one is found.

Not yet, but we plan to release a limited dump to the community (even the limited dump has few Gi) :

github.com/thoth-station/datasets

Provide database dump to the community

opened 09:25PM - 10 May 21 UTC

closed 03:42PM - 01 Feb 22 UTC

fridex

kind/feature priority/important-soon triage/accepted

**Is your feature request related to a problem? Please describe.** As an exte…rnal Thoth contributor, I would like to setup my environment so that I can develop components of Thoth. **Describe the solution you'd like** Provide a minimal database dump that can be used to test changes in components. - [ ] make sure the dump does not hold any sensitive information (e.g. GitHub repos we maintain) - [ ] remove all solver results, except for rhel:8+py38 **Acceptance Criteria** - [ ] database dump is provided in a public bucket, so that it can be accessed by anyone - [ ] instructions how to access the public bucket are present in the thoth-station/storages and thoth-station/datasets REAMDE files

Blackward:

PyQtGraph (0.10.0 as well as 0.11.1) seems to be incompatible with PySide 1.2.4 (or the other way round:) - it looks like signals are broken in this combination; interacting with plots for example does not work (completely/properly) - please have a look at the attached screenshot Combination-1* and ErrorExample_Combi-1* (ignore the green lines, which are commented-out by '#').

This is an example of an issue we can tell the resolver to avoid by adjusting prescriptions stated earlier. This way, the knowledge about the mentioned package is not lost and the Python community could benefit from such aggregated package issue database.

fridex · November 26, 2021, 2:31pm

In case you are interested in this topic, we published an article explaining how the resolver works and what are the benefits when using it. Unlike pip’s current resolver that implements resolution using backtracking, the resolver learns the correct resolution path and comes up with a lock file with the desired quality (so it can for example update or downgrade dependencies based on vulnerabilities). More info can be found at: