Drawing a line to the scope of Python packaging

uranusjr · February 13, 2019, 9:11am

Another topic in the Big Picture thread I found interesting

This problem is also not specific to data science (and Python) IMO. In web, Django packages provide front end stuff, and how should they be managed together with npm/yarn and various precompilers? In GUI land we got Qt plugins; PyQt (and PySide IIRC) bundle Qt in the module, but why can’t I link them with my already-installed Qt instead to save hundreds of MB (for each venv), like rust-qt? People in those areas seem to feel satisfied enough with pip, but I feel it is legitimate to draw parallels to the data science world.

In the end, every domain-specific package manager would need to draw a line somewhere to keep itself on the slippery slope of dependencies. The question is, then, where and how should the line be decided?

takluyver · February 13, 2019, 10:43am

Interesting question. Technically, the examples you mention are in two distinct directions:

Web frontend stuff hasn’t needed any special affordances in packaging, as far as I know. It’s just opaque data files from our perspective, which are valuable for all sorts of purposes. If it ever did, I’d be inclined to say no: JS has various package managers already which solve broadly similar problems to Python tooling. I’m sympathetic to tool fatigue in frontend development, but I don’t think it’s reasonable to ask the Python packaging ecosystem for extra work to avoid the JS ecosystem.

C and C++ (both for data science and GUIs) is a different story, for two reasons. First, the C API and the ease of using extension modules have always been a strength of CPython (the reference implementation and most widely used Python interpreter), and it’s crucial to be able to effectively distribute extension modules. Second, C/C++ doesn’t have a generally accepted standard package manager of its own. Package managers like apt and homebrew aren’t easy to integrate with, because they’re designed to install packages systemwide rather than for a specific environment.

Conda is the exception here. It looks like the holy grail of packaging: a cross-language, cross-platform package manager which knows about environments. The reluctance I’ve seen to use it comes from two angles: the perception that it’s for data science (somewhat self-reinforcing, as general purpose libraries may not be published for conda) and concerns about its tight connection to Anaconda, Inc. I have enough sympathy with this that I think the main Python packaging ecosystem should continue providing a practical alternative, not just point to conda for anything difficult.

I haven’t worked out exactly where the lines should be, but that rationale explains why I think they should be drawn further out in one direction than in another.

steve.dower · February 13, 2019, 2:32pm

Let’s not dive too deep down this hole ourselves. @willingc and @pzwang are both interested/actively looking at this area for all of Python, including packaging, and have experience with the various models used for it (e.g. Personas, customer development, etc.). This is a great opportunity to figure it out for all of the things we do, and having relative outsiders (from packaging) make the start is going to negate a lot of our biases.

dstufft · February 13, 2019, 2:43pm

Or introduce different biases.

steve.dower · February 13, 2019, 2:58pm

True, but since they can’t actually force us to do anything, another point of view won’t hurt

dstufft · February 13, 2019, 5:56pm

Sure, different PoVs are fine, as long as we don’t pretend they’re not just differently biased, because if we pretend they’re going to somehow undo our own biases by being neutral, then we’re likely to just end up with a poor result.

willingc · February 13, 2019, 7:26pm

I am so glad to see @takluyver on this thread. He understands packaging so well from both a pip and conda perspective. He also has the respect of conda and pip maintainers. I hope you keep adding your insights.

teoliphant · April 6, 2019, 8:18am

Thanks for speaking up and discussing the issues that people might have with conda. I have been really confused as to why people don’t use and/or recommend conda (or other system packaging solutions like spack) as opposed to trying to make pip into a general packaging solution.

If I understand correctly, the two concerns are 1) conda is only for data-science, and 2) conda is seen as controlled by a company? Are there other concerns?

You mention that you have sympathy with these concerns. Do you have concerns with both of these or just with the latter? As far as conda only being for data-science, there are packages for many other domains provided both by Anaconda as well as the community-driven conda-forge.

Regarding the governance (or association concern), Conda is a BSD-licensed open-source project with many contributors. The governance (or better said stewardship) of conda is indeed currently held by Anaconda. I really don’t see the problem with that, though. Could you articulate what is the problem you are actually concerned with?

It seems to me that Anaconda’s current involvement with conda is actually a strength because it provides funding to ensure community-minded, and community-oriented people can work full-time to make sure conda continues to improve.

There is also a large and separate community (1200 contributors last I heard) of people that create conda packages via conda-forge. If (for some crazy reason that I cannot see happening) Anaconda ever does something to abuse its current role as steward of conda, it is a straightforward thing for the open and community-driven conda-forge community (which is a NumFOCUS fiscally sponsored project) can easily fork the conda project in order to pull control from Anaconda.

So, I really don’t understand the concern and I would welcome your help in understanding it better. I do, however, regularly see the long-term problems that organizations and people have right now because they have been led to believe pip will solve all their installation issues. From my perspective, pip is trying to solve a problem that it can’t ultimately solve without actually becoming a general-purpose packaging solution.

There are a lot of things pip helps with (developer workflow, source distributions, and getting started quickly), but it would go a long way to helping users of Python in organizations far and wide if it were emphasized that pip is not a general purpose packaging solution and should not be attempted to be used as such.

My view is that the PyPA should provide recommendations for integrations with general packaging solutions instead of trying to come up with different solutions for problems that are already solved using more general tools (like conda, spack, brew, etc.)

cjerdonek · April 6, 2019, 8:43am

Thanks a lot for chiming in! Can you elaborate on these two points so people can understand better? How are people being led to think pip will solve all their installation issues, and roughly what problems do you think is pip trying to solve that it can’t (and why)? Also, is the issue more that you think pip’s scope is off, or is it that even with a more correct scope, you think it has fundamental issues that can’t be solved?

pf_moore · April 6, 2019, 9:45am

For me, the big problem with conda is that it creates its own little “world” that’s independent of “standard” Python. So, for example, I’m unclear on how well tools like pipenv, pew, virtualenv or even pip will interact with a conda-based environment.

I’ve just recently had a colleague who is a relative beginner with Python, who had needed a new PC. He decided to try conda as an “all in one” solution, but then struggled because a lot of things he’d learned (about how to manage his PATH, and how to install packages) no longer applied. Ultimately he decided that learning a whole new ecosystem was not worth it, and went back to standard tools.

Yes, as above - it’s an independent ecosystem with its own rules and approaches. Integrating two ecosystems is never an easy task, so until basically “everything” is available under conda, there’s a risk of hitting some tool that you need which requires you to do that integration job.

Can you explain how you see that working? To give a concrete example, how do I use pipenv with conda? For the purposes of the example, say that the reason I want to use pipenv is that colleagues use it and there’s an amount of in-house knowledge that I want to rely on - so “use some conda replacement for pipenv” is not a realistic option.

takluyver · April 6, 2019, 9:29pm

There’s nothing that I’m aware of that technically restricts conda to data science. But its main usage is in the data science community, and that inevitably means that the packages available through conda directly skew towards data science. If you try using it in other domains, you’re more likely to end up needing pip for the long tail of less popular packages. The text on both anaconda.com and anaconda.org clearly highlights data science.

It’s a catch 22 situation: the available packages reflect what people are using it for, and the people who’ll want to use it reflect the available packages.

To be clear, I think Anaconda Inc is currently acting as a good citizen in the Python community. But companies can change - Google certainly doesn’t have the favourable impression it did 10 years ago, for instance. So while I don’t have any short term worries about conda, I wouldn’t ignore the possibility that in a few years time, what’s in Anaconda’s interest might not be in the community’s interests.

Here I must disagree. While anyone can technically make a fork of conda, maintaining a fork of a tool used by probably millions of people is a huge job, and building up the ecosystem and mindshare around it without corporate backing would be all but impossible. E.g. if the package formats diverged, the maintainers of a fork would have to try to convince packagers to build packages in their format rather than Anaconda Inc’s.

Forking a widely used project is never easy, especially for something with the kind of network effects that a package manager has.

I think discussion of this topic gets tangled up because people only consider some seismic event - your “crazy reason you cannot see happening” - where a company does something so obviously nefarious that the community rises up as one to fork its open source projects. This is almost never the way reality works. Rather, what I am afraid of is a series of less visible decisions, each of which can be justified. At every stage, a few people will complain, but others will defend the company, point to its past record, ask us to understand that a company must make money, and point out problems with the alternatives. After a while, the dissatisfied people drift away to something else, so at no point does the momentum build up to spin off a fork.

If the company doing this has spent a while beforehand sucking up all the available oxygen that would have gone to alternatives, the convenience of playing along with it will be very attractive for a long time after people stop thinking of it as unambiguously good.

PS edit: the end of this post got a bit dramatic, so I want to reiterate that I don’t think Anaconda Inc is currently doing anything like this. Hopefully it never will. But I hope it explains why I’m wary of the idea that we should limit the scope of the community maintained packaging tools because of the availability of conda.

njs · April 6, 2019, 11:33pm

I often recommend conda to folks when I think it’ll be useful to them, but I don’t use it myself, so I guess I can answer that part.

The reason is pretty simple… all the packages I need are distributed through PyPI first, and have wheels or are otherwise easy to install on my system. And, since my packages are also distributed through PyPI first, all my package metadata (e.g. dependencies) is expressed in terms of PyPI packages. So if I switched to conda, the only differences to my life would be that I’d have access to fewer packages, and they’d be more out-of-date, and I’d have to maintain two copies of my metadata (one for conda and one for PyPI).

These are inevitable downsides of conda acting as a middle-man. There are also lots of advantages to having a middle-man doing integration testing etc., but those are solving problems that I don’t personally have, so for me it’s all downsides.

You should also check out this post I wrote that’s ostensibly about lockfiles and workflow tools, but has a long digression about how pip and conda interrelate, why the two sides aren’t good at talking to each other, and how that could be fixed:

teoliphant · April 8, 2019, 5:06am

Thanks for the feedback. I really appreciate the dialogue. I should mention that since 2012 I have been largely using and promoting the conda ecosystem when Guido told me directly at the very first PyData conference that Python would not solve the packaging problem for the scientific ecosystem and that we would have to solve it ourselves. This is why we wrote conda in the first place (Continuum Analytics which later became Anaconda was not actually started to create conda). The conda system solved all my problems I was having with packaging when I asked Guido about it — I was able to ship easily and reproducibly things like scikit-learn with different versions of MKL or OpenBlas, Numba with LLVM, NumPy, and all the other packages I used to really struggle with shipping to people reliably.

However, with the gains that “pip install” is making (and it truly has made amazing progress which I admire) — I’ve seen more people start to have problems installing these more complex collections of software. Worse, I see large companies recommend “pip install tensorflow” or “pip install pytorch” because they think it’s the only “standard” way to install Python things — and then struggle when the manylinux standard doesn’t quite work or multiple versions of software get installed, and then the same questions we used to see about installation show up again.

So, part of the reason I’m chiming in is that it’s rather sad to see people try to solve problems again that have already been solved — and available in open-source communities.

Now, that’s not to say that all packaging problems have been solved. There is much that can be done to have “pip install” work better with general-purpose packaging solutions, but you have to first acknowledge that such a thing is a worthwhile goal and then go about doing it with a few of them and then telling users to use those system package managers as the “community-recommend” approach.

Note, that I don’t work on conda anymore nor am I working at Anaconda. I participate as a member of that community, but don’t have any authority there. (I do still own shares in the company and am on the board, but I don’t speak for the company). My purpose of jumping in here is to understand better and perhaps help if I can.

Again, my main concern is that I don’t see how a language-specific package manager can ever entirely solve the problem of installation. I think that is fundamental. Yet, I don’t see that argument being discussed enough. It seems that people just keep trying to make “pip install” work in all cases and with all situations and I think this is will never work until you decided to make “pip” a general-purpose package manager (i.e. use it to install R and Java and anything else — which is another can of worms which we “bit-off” with conda). Also, this is also not unique to Python. The Go community, the Rust community, the Julia community, and many other communities all try to figure out packaging from their perspective and end up in the same abyss if they have any notion of interoperability (well Go is actually easier because it’s self-contained and so they don’t have the shared-library problem).

Comments about users “going back to standard tools” after they try conda proves my point. When the language-specific package manager is viewed as the only standard to consider, that this the problem. The PyPA should acknowledge that other standards exist and may in fact be what the user is needing rather than trying to solve all packaging problems using pip.

The PyPA is discussing packaging in an environment of other packaging systems. Python’s ability to glue other languages together means that at least some Python users will have to interact with installing tools written with other languages — and thus work with general packaging.

So, encouraging everyone to just “pip install” to get whatever they might need in Python land is eventually a disservice. This is my main complaint. In general, pip should acknowledge it lives in a world where other libraries may need to be installed before “pip install” even works — instead of trying to encourage users to bundle those things in as “data”.

On the other hand, I would say that general-purpose package managers (including conda) don’t put enough effort into how they integrate with the development experience (not just the deployment experience).

I have not kept up with all of the tools that have shown up around pip (largely because by using conda I have never found a need for them — though even as a conda user and developer I still use pip itself).

Thanks for reading this far. I appreciate time taken to respond to me as I’m trying to understand better.

teoliphant · April 8, 2019, 5:40am

pf_moore:

teoliphant:

Thanks for speaking up and discussing the issues that people might have with conda. I have been really confused as to why people don’t use and/or recommend conda (or other system packaging solutions like spack) as opposed to trying to make pip into a general packaging solution.

For me, the big problem with conda is that it creates its own little “world” that’s independent of “standard” Python. So, for example, I’m unclear on how well tools like pipenv, pew, virtualenv or even pip will interact with a conda-based environment.

I’ve just recently had a colleague who is a relative beginner with Python, who had needed a new PC. He decided to try conda as an “all in one” solution, but then struggled because a lot of things he’d learned (about how to manage his PATH, and how to install packages) no longer applied. Ultimately he decided that learning a whole new ecosystem was not worth it, and went back to standard tools.

I think this illustrates my point quite well. Why are “standard” tools not acknowledging that there is such a thing as a “general” package manager which may have different approaches that will help the user. In other words, why is Python documentation telling new users how to manage their PATH and how to install all packages without recognizing that some packages are simply not installable as only Python.

We will never solve “the packaging problem” until we acknowledge the different facets of that problem. Is pip and virtualenv a tool for new users or experienced users. Is it for both? I actually don’t really know what pipenv nor pew even do as I’ve never needed them.

I agree with your point, conda is a separate packaging solution. There are others as well and they will all have their approaches. Which one should someone use? It really depends on their use-case. But this is what the PyPA should acknowledge. I beleive the PyPA should focus on source-distributions and helping developers with integration (and binary installs should have a way to depend on things already installed that it does not manage). Leave production and new users to other systems (like conda but there are and can be others).

Unless the PyPA is going to create a general-purpose (OS-style) packaging solution, then it will not be able to solve all the problems of packaging. Yet because it has the blessing of authority and ownership of the documentation pages, many new people come to the Python ecosystem and believe “pip install” is how everybody is doing it and they believe it is how to do packaging.

Right now, most of the people trying to fix packaging in Python would be better served by joining forces with conda-forge (or perhaps a similar tool like spack) to perhaps create more “community-focused” packaging.

Indeed, I think this is unavoidable. Trying to pretend that you won’t have an integration job is the disservice that “pip install all the things” does. The PyPA should bless a few additional general-purpose packaging solutions and then people will work to improve those.

I don’t agree with the framing. I would say why wasn’t pipenv told to explain how it would work with conda before it was accepted as a standard PyPA solution? The danger of having a PyPA is that you can over-specify solutions. If you don’t have consensus across computing as to how to package things, then the PyPA should be very careful about what it encourages.

I don’t have a concrete suggestion with respect to pipenv at this point and would have to study pipenv in more detail to understand what it actually does (like I said, I’ve never had a need for it).

My suggestions right now are: 1) Strongly discourage “vendored” libraries in wheels and instead have a way for pip to check for the existence of libraries that the wheel might need. 2) Message to users that “pip install” is not especially suited for installing large complex software with many non-Python components.

teoliphant · April 8, 2019, 5:45am

Thanks for the explanation. I can definitely see that as a developer, conda may not help you, and pip has gotten to the point where it works for you. That sounds like a win.

Other users, of course, aren’t developers, and the fact that every documentation page tells people to “pip install” (even if they aren’t developers or are working in a system that needs what conda is providing), is the problem.

Thanks for providing the link. I had not seen that — it looks interesting. There surely are things that can be done to improve interop and I think that is a good place to work.

takluyver · April 8, 2019, 7:37am

I don’t see pip install as any more ‘developer’ centric than conda install. Conda will let you install some things that pip won’t, and pip will let you install some things that conda won’t.

pf_moore · April 8, 2019, 8:31am

I see your point, but you could equally say why wasn’t pipenv told to explain how it would work with Ubuntu’s packager, Fedora’s, Arch’s, with ActiveState’s package manager, etc. Multiply that by all the Python tools out there and you have a combinatoric explosion.

You could just as easily reverse the argument, and put the responsibility on the packaging solutions (that have already taken on a role of curation and integration) to ensure that they work with their chosen set of tools. And note that qualification - if conda (or Ubuntu, or…) doesn’t want to do that integration job, all they have to do is say “we don’t support using pipenv with our packager”. What frustrates me is that it seems like you’re suggesting that conda solves all the packaging problems, but then when faced with one (how to work with pipenv) that you don’t solve, you say that’s pipenv’s problem.

On pip’s part (or the PyPA’s, frame it how you want) pipenv works with existing Python tools (pip, virtualenv, etc) so the integration issue is simple for us. Conda replaces pip and virtualenv, though, so the problem lies there. If you want to argue that “pip and virtualenv¹ shouldn’t have special status”, then that’s a whole different debate (which should probably be a different thread, so I won’t go into it here).

Or to put it another way, the PyPA standards effort is hoping to avoid the combinatoric explosion problem by providing ways for tools to work together, without needing to dump the whole problem onto one party to solve for everyone. We’ve yet to get much inroad into the conda (or Linux distribution) ecosystems, because few people in those communities are participating, but that’s where I see progress being made here - not by different solutions trying to “stake off” areas where they want to be the only contender. (For example, we need to work out how to get user installs to work cleanly with system package managers on Linux - see Default to --user · Issue #1668 · pypa/pip · GitHub)

¹ Technically, it’s venv that has special status, not virtualenv, but that’s a migration that’s still in progress, and won’t really be possible until Python 2 is no longer relevant.

With regard to (1), how do you propose that pip do that? That’s essentially the problem that everyone has to reinvent. How does conda check for the existence of shared libraries (on Windows, for example). The answer (as I understand it) is that it doesn’t, rather it ships its own infrastructure for managing shared libraries, that you seem to be expecting pip to know about? What about other managers, like Chocolatey? Everyone can say “use our tool” as a means of reducing complexity, and it really does - but only for the people willing to use that tool. For better or worse, PyPA is trying to support people for whom the existing integrated tools don’t work.

And your point (2) hits on this issue. You say that “pip install is not especially suited for installing large complex software”. OK, let’s take a specific example, tensorflow (which you mentioned in your post). I’d contend that the message to use a packaged solution like conda when it’s appropriate is out there - we could maybe push it harder, but saying that we don’t do anything is incorrect. So the question is, has the conda community looked at why some people who want to use tensorflow aren’t using conda? Because if those people have genuine reasons for not wanting to use conda, then they are left unsupported unless PyPA tries to offer even a partial solution for them.

In my own case, for example, I don’t use tensorflow, but I do make occasional use of various data science packages, and there are occasions when I’d have been interested to experiment with tensorflow. But I have a significant investment in the pip/virtualenv/pipenv toolsets, and there’s no practical way I would learn a whole new, fundamentally different toolset like conda, just to try out a possibly useful package. Are you saying that the Python packaging community should say that I’m out of scope and we don’t offer any solution for me? (Hint: As a member of the packaging community, I’m not going to agree to that )

Let me reframe the question. Take someone who wants to “learn Python” (this is a precise description of a group of people at my work, so it’s not a theoretical example). Maybe they don’t yet have a particular use case in mind - automation, data analysis, database work are all possible and likely. They don’t think of “learning Conda” as that doesn’t appear on the python.org front page, so they download Python and start investing their time in learning. But there’s a lot of articles out there about data science, and it’s a hot topic within the company, so they want to do some data analysis on existing information. So they get involved in jupyter, numpy and pandas. At what point in that process should they have been directed towards conda (or whatever other solution you would view as being appropriate for their needs)? Note that for jupyter, numpy and pandas, pip is still working fine - but they are starting to get to the areas where they may hit issues. If it’s “at the beginning”, then you’re suggesting that conda get front billing on python.org, which isn’t a packaging question. If not, then at some point they would need to switch from the tools they started to learn, and the question becomes when they should be advised to make that switch. Remember that these are people who don’t have a strong initial interest in data science - someone wanting to work on data science, or science in general, already gets a strong message “use conda”. And that’s why people tend to approach conda as a “data science solution”, and not as a general Python distribution.

Sorry, this became quite long. Thanks for taking the time to read it - I hope it clarifies some of the areas we’re struggling to understand each other on.

jdemeyer · April 8, 2019, 12:51pm

+1 to this. The “problem” (between quotes since it’s arguable) with pip is that it creates an expectation that every Python package should ship wheels and that packages that don’t are somehow doing it wrong. For example, https://pythonwheels.com/ seems to reinforce that idea. But in some cases, packages just cannot ship wheels. One important reason, as described by Travis Oliphant, is having non-Python dependencies. pynativelib would help here, but I don’t know if that proposal is going anywhere.

pf_moore · April 8, 2019, 3:02pm

My perspective is that there is a requirement from users that they can easily install packages (and from projects - it’s not like a project that can’t be installed is much use!). For better or worse, a lot of users (particularly in Windows environments¹) don’t have the tools available that are needed to build projects they want to use - and it’s not reasonable to expect them to, to be honest.

So, I don’t think it’s at all unreasonable for users to expect prebuilt packages. And let’s be honest, conda and Anaconda are proof of that, their business is built on supplying prebuilt packages.

I’m not sure there is an expectation that prebuilt packages means wheels. Rather, I think that the people complaining know that there are other options available (conda, distribution supplied packages, …) but that they are unsuitable for them - and I think we should be asking why that is, rather than simply dismissing the problem as just being one of inaccurate user expectations.

I’m not saying that there isn’t a perception that prebuilt binary = wheel, but rather I’m saying that characterising it as nothing more than a perception issue doesn’t give us a useful way of addressing it. Understanding the user requirements behind that expectation might do so - and it’s the projects that can’t ship wheels that have access to users who can answer those questions, so maybe they could do more to understand the problem and communicate it to the people designing the installation tools and standards? After all the “packaging community” by definition includes the people producing actual packages, so it’s not like their views would be unwelcome (at least I’d hope not!)

¹ I’d actually love to see a breakdown of how badly the “shared library” issue hits users, based on platform. My (biased) perception is that Windows mostly doesn’t have an issue, but Linux has huge problems. But that’s as a Windows user for whom everything just works fine, seeing bug reports from Linux users but never success reports from them. So I’m pretty sure my perspective is inaccurate - as, I suspect, would be any individual’s, so we need more objective measures to help us get data here

willingc · April 8, 2019, 3:09pm

One thing that would be very helpful for conda would be to change the default channel to conda-forge instead of Anaconda channels. Ref The burden to add the community maintained conda-forge should not be placed on the end user.