Another topic in the Big Picture thread I found interesting
This problem is also not specific to data science (and Python) IMO. In web, Django packages provide front end stuff, and how should they be managed together with npm/yarn and various precompilers? In GUI land we got Qt plugins; PyQt (and PySide IIRC) bundle Qt in the module, but why canât I link them with my already-installed Qt instead to save hundreds of MB (for each venv), like rust-qt? People in those areas seem to feel satisfied enough with pip, but I feel it is legitimate to draw parallels to the data science world.
In the end, every domain-specific package manager would need to draw a line somewhere to keep itself on the slippery slope of dependencies. The question is, then, where and how should the line be decided?
Interesting question. Technically, the examples you mention are in two distinct directions:
Web frontend stuff hasnât needed any special affordances in packaging, as far as I know. Itâs just opaque data files from our perspective, which are valuable for all sorts of purposes. If it ever did, Iâd be inclined to say no: JS has various package managers already which solve broadly similar problems to Python tooling. Iâm sympathetic to tool fatigue in frontend development, but I donât think itâs reasonable to ask the Python packaging ecosystem for extra work to avoid the JS ecosystem.
C and C++ (both for data science and GUIs) is a different story, for two reasons. First, the C API and the ease of using extension modules have always been a strength of CPython (the reference implementation and most widely used Python interpreter), and itâs crucial to be able to effectively distribute extension modules. Second, C/C++ doesnât have a generally accepted standard package manager of its own. Package managers like apt and homebrew arenât easy to integrate with, because theyâre designed to install packages systemwide rather than for a specific environment.
Conda is the exception here. It looks like the holy grail of packaging: a cross-language, cross-platform package manager which knows about environments. The reluctance Iâve seen to use it comes from two angles: the perception that itâs for data science (somewhat self-reinforcing, as general purpose libraries may not be published for conda) and concerns about its tight connection to Anaconda, Inc. I have enough sympathy with this that I think the main Python packaging ecosystem should continue providing a practical alternative, not just point to conda for anything difficult.
I havenât worked out exactly where the lines should be, but that rationale explains why I think they should be drawn further out in one direction than in another.
Letâs not dive too deep down this hole ourselves. @willingc and @pzwang are both interested/actively looking at this area for all of Python, including packaging, and have experience with the various models used for it (e.g. Personas, customer development, etc.). This is a great opportunity to figure it out for all of the things we do, and having relative outsiders (from packaging) make the start is going to negate a lot of our biases.
Sure, different PoVs are fine, as long as we donât pretend theyâre not just differently biased, because if we pretend theyâre going to somehow undo our own biases by being neutral, then weâre likely to just end up with a poor result.
I am so glad to see @takluyver on this thread. He understands packaging so well from both a pip and conda perspective. He also has the respect of conda and pip maintainers. I hope you keep adding your insights.
Thanks for speaking up and discussing the issues that people might have with conda. I have been really confused as to why people donât use and/or recommend conda (or other system packaging solutions like spack) as opposed to trying to make pip into a general packaging solution.
If I understand correctly, the two concerns are 1) conda is only for data-science, and 2) conda is seen as controlled by a company? Are there other concerns?
You mention that you have sympathy with these concerns. Do you have concerns with both of these or just with the latter? As far as conda only being for data-science, there are packages for many other domains provided both by Anaconda as well as the community-driven conda-forge.
Regarding the governance (or association concern), Conda is a BSD-licensed open-source project with many contributors. The governance (or better said stewardship) of conda is indeed currently held by Anaconda. I really donât see the problem with that, though. Could you articulate what is the problem you are actually concerned with?
It seems to me that Anacondaâs current involvement with conda is actually a strength because it provides funding to ensure community-minded, and community-oriented people can work full-time to make sure conda continues to improve.
There is also a large and separate community (1200 contributors last I heard) of people that create conda packages via conda-forge. If (for some crazy reason that I cannot see happening) Anaconda ever does something to abuse its current role as steward of conda, it is a straightforward thing for the open and community-driven conda-forge community (which is a NumFOCUS fiscally sponsored project) can easily fork the conda project in order to pull control from Anaconda.
So, I really donât understand the concern and I would welcome your help in understanding it better. I do, however, regularly see the long-term problems that organizations and people have right now because they have been led to believe pip will solve all their installation issues. From my perspective, pip is trying to solve a problem that it canât ultimately solve without actually becoming a general-purpose packaging solution.
There are a lot of things pip helps with (developer workflow, source distributions, and getting started quickly), but it would go a long way to helping users of Python in organizations far and wide if it were emphasized that pip is not a general purpose packaging solution and should not be attempted to be used as such.
My view is that the PyPA should provide recommendations for integrations with general packaging solutions instead of trying to come up with different solutions for problems that are already solved using more general tools (like conda, spack, brew, etc.)
Thanks a lot for chiming in! Can you elaborate on these two points so people can understand better? How are people being led to think pip will solve all their installation issues, and roughly what problems do you think is pip trying to solve that it canât (and why)? Also, is the issue more that you think pipâs scope is off, or is it that even with a more correct scope, you think it has fundamental issues that canât be solved?
For me, the big problem with conda is that it creates its own little âworldâ thatâs independent of âstandardâ Python. So, for example, Iâm unclear on how well tools like pipenv, pew, virtualenv or even pip will interact with a conda-based environment.
Iâve just recently had a colleague who is a relative beginner with Python, who had needed a new PC. He decided to try conda as an âall in oneâ solution, but then struggled because a lot of things heâd learned (about how to manage his PATH, and how to install packages) no longer applied. Ultimately he decided that learning a whole new ecosystem was not worth it, and went back to standard tools.
Yes, as above - itâs an independent ecosystem with its own rules and approaches. Integrating two ecosystems is never an easy task, so until basically âeverythingâ is available under conda, thereâs a risk of hitting some tool that you need which requires you to do that integration job.
Can you explain how you see that working? To give a concrete example, how do I use pipenv with conda? For the purposes of the example, say that the reason I want to use pipenv is that colleagues use it and thereâs an amount of in-house knowledge that I want to rely on - so âuse some conda replacement for pipenvâ is not a realistic option.
Thereâs nothing that Iâm aware of that technically restricts conda to data science. But its main usage is in the data science community, and that inevitably means that the packages available through conda directly skew towards data science. If you try using it in other domains, youâre more likely to end up needing pip for the long tail of less popular packages. The text on both anaconda.com and anaconda.org clearly highlights data science.
Itâs a catch 22 situation: the available packages reflect what people are using it for, and the people whoâll want to use it reflect the available packages.
To be clear, I think Anaconda Inc is currently acting as a good citizen in the Python community. But companies can change - Google certainly doesnât have the favourable impression it did 10 years ago, for instance. So while I donât have any short term worries about conda, I wouldnât ignore the possibility that in a few years time, whatâs in Anacondaâs interest might not be in the communityâs interests.
Here I must disagree. While anyone can technically make a fork of conda, maintaining a fork of a tool used by probably millions of people is a huge job, and building up the ecosystem and mindshare around it without corporate backing would be all but impossible. E.g. if the package formats diverged, the maintainers of a fork would have to try to convince packagers to build packages in their format rather than Anaconda Incâs.
Forking a widely used project is never easy, especially for something with the kind of network effects that a package manager has.
I think discussion of this topic gets tangled up because people only consider some seismic event - your âcrazy reason you cannot see happeningâ - where a company does something so obviously nefarious that the community rises up as one to fork its open source projects. This is almost never the way reality works. Rather, what I am afraid of is a series of less visible decisions, each of which can be justified. At every stage, a few people will complain, but others will defend the company, point to its past record, ask us to understand that a company must make money, and point out problems with the alternatives. After a while, the dissatisfied people drift away to something else, so at no point does the momentum build up to spin off a fork.
If the company doing this has spent a while beforehand sucking up all the available oxygen that would have gone to alternatives, the convenience of playing along with it will be very attractive for a long time after people stop thinking of it as unambiguously good.
PS edit: the end of this post got a bit dramatic, so I want to reiterate that I donât think Anaconda Inc is currently doing anything like this. Hopefully it never will. But I hope it explains why Iâm wary of the idea that we should limit the scope of the community maintained packaging tools because of the availability of conda.
I often recommend conda to folks when I think itâll be useful to them, but I donât use it myself, so I guess I can answer that part.
The reason is pretty simple⌠all the packages I need are distributed through PyPI first, and have wheels or are otherwise easy to install on my system. And, since my packages are also distributed through PyPI first, all my package metadata (e.g. dependencies) is expressed in terms of PyPI packages. So if I switched to conda, the only differences to my life would be that Iâd have access to fewer packages, and theyâd be more out-of-date, and Iâd have to maintain two copies of my metadata (one for conda and one for PyPI).
These are inevitable downsides of conda acting as a middle-man. There are also lots of advantages to having a middle-man doing integration testing etc., but those are solving problems that I donât personally have, so for me itâs all downsides.
You should also check out this post I wrote thatâs ostensibly about lockfiles and workflow tools, but has a long digression about how pip and conda interrelate, why the two sides arenât good at talking to each other, and how that could be fixed:
Thanks for the feedback. I really appreciate the dialogue. I should mention that since 2012 I have been largely using and promoting the conda ecosystem when Guido told me directly at the very first PyData conference that Python would not solve the packaging problem for the scientific ecosystem and that we would have to solve it ourselves. This is why we wrote conda in the first place (Continuum Analytics which later became Anaconda was not actually started to create conda). The conda system solved all my problems I was having with packaging when I asked Guido about it â I was able to ship easily and reproducibly things like scikit-learn with different versions of MKL or OpenBlas, Numba with LLVM, NumPy, and all the other packages I used to really struggle with shipping to people reliably.
However, with the gains that âpip installâ is making (and it truly has made amazing progress which I admire) â Iâve seen more people start to have problems installing these more complex collections of software. Worse, I see large companies recommend âpip install tensorflowâ or âpip install pytorchâ because they think itâs the only âstandardâ way to install Python things â and then struggle when the manylinux standard doesnât quite work or multiple versions of software get installed, and then the same questions we used to see about installation show up again.
So, part of the reason Iâm chiming in is that itâs rather sad to see people try to solve problems again that have already been solved â and available in open-source communities.
Now, thatâs not to say that all packaging problems have been solved. There is much that can be done to have âpip installâ work better with general-purpose packaging solutions, but you have to first acknowledge that such a thing is a worthwhile goal and then go about doing it with a few of them and then telling users to use those system package managers as the âcommunity-recommendâ approach.
Note, that I donât work on conda anymore nor am I working at Anaconda. I participate as a member of that community, but donât have any authority there. (I do still own shares in the company and am on the board, but I donât speak for the company). My purpose of jumping in here is to understand better and perhaps help if I can.
Again, my main concern is that I donât see how a language-specific package manager can ever entirely solve the problem of installation. I think that is fundamental. Yet, I donât see that argument being discussed enough. It seems that people just keep trying to make âpip installâ work in all cases and with all situations and I think this is will never work until you decided to make âpipâ a general-purpose package manager (i.e. use it to install R and Java and anything else â which is another can of worms which we âbit-offâ with conda). Also, this is also not unique to Python. The Go community, the Rust community, the Julia community, and many other communities all try to figure out packaging from their perspective and end up in the same abyss if they have any notion of interoperability (well Go is actually easier because itâs self-contained and so they donât have the shared-library problem).
Comments about users âgoing back to standard toolsâ after they try conda proves my point. When the language-specific package manager is viewed as the only standard to consider, that this the problem. The PyPA should acknowledge that other standards exist and may in fact be what the user is needing rather than trying to solve all packaging problems using pip.
The PyPA is discussing packaging in an environment of other packaging systems. Pythonâs ability to glue other languages together means that at least some Python users will have to interact with installing tools written with other languages â and thus work with general packaging.
So, encouraging everyone to just âpip installâ to get whatever they might need in Python land is eventually a disservice. This is my main complaint. In general, pip should acknowledge it lives in a world where other libraries may need to be installed before âpip installâ even works â instead of trying to encourage users to bundle those things in as âdataâ.
On the other hand, I would say that general-purpose package managers (including conda) donât put enough effort into how they integrate with the development experience (not just the deployment experience).
I have not kept up with all of the tools that have shown up around pip (largely because by using conda I have never found a need for them â though even as a conda user and developer I still use pip itself).
Thanks for reading this far. I appreciate time taken to respond to me as Iâm trying to understand better.
I think this illustrates my point quite well. Why are âstandardâ tools not acknowledging that there is such a thing as a âgeneralâ package manager which may have different approaches that will help the user. In other words, why is Python documentation telling new users how to manage their PATH and how to install all packages without recognizing that some packages are simply not installable as only Python.
We will never solve âthe packaging problemâ until we acknowledge the different facets of that problem. Is pip and virtualenv a tool for new users or experienced users. Is it for both? I actually donât really know what pipenv nor pew even do as Iâve never needed them.
I agree with your point, conda is a separate packaging solution. There are others as well and they will all have their approaches. Which one should someone use? It really depends on their use-case. But this is what the PyPA should acknowledge. I beleive the PyPA should focus on source-distributions and helping developers with integration (and binary installs should have a way to depend on things already installed that it does not manage). Leave production and new users to other systems (like conda but there are and can be others).
Unless the PyPA is going to create a general-purpose (OS-style) packaging solution, then it will not be able to solve all the problems of packaging. Yet because it has the blessing of authority and ownership of the documentation pages, many new people come to the Python ecosystem and believe âpip installâ is how everybody is doing it and they believe it is how to do packaging.
Right now, most of the people trying to fix packaging in Python would be better served by joining forces with conda-forge (or perhaps a similar tool like spack) to perhaps create more âcommunity-focusedâ packaging.
Indeed, I think this is unavoidable. Trying to pretend that you wonât have an integration job is the disservice that âpip install all the thingsâ does. The PyPA should bless a few additional general-purpose packaging solutions and then people will work to improve those.
I donât agree with the framing. I would say why wasnât pipenv told to explain how it would work with conda before it was accepted as a standard PyPA solution? The danger of having a PyPA is that you can over-specify solutions. If you donât have consensus across computing as to how to package things, then the PyPA should be very careful about what it encourages.
I donât have a concrete suggestion with respect to pipenv at this point and would have to study pipenv in more detail to understand what it actually does (like I said, Iâve never had a need for it).
My suggestions right now are: 1) Strongly discourage âvendoredâ libraries in wheels and instead have a way for pip to check for the existence of libraries that the wheel might need. 2) Message to users that âpip installâ is not especially suited for installing large complex software with many non-Python components.
Thanks for the explanation. I can definitely see that as a developer, conda may not help you, and pip has gotten to the point where it works for you. That sounds like a win.
Other users, of course, arenât developers, and the fact that every documentation page tells people to âpip installâ (even if they arenât developers or are working in a system that needs what conda is providing), is the problem.
Thanks for providing the link. I had not seen that â it looks interesting. There surely are things that can be done to improve interop and I think that is a good place to work.
I donât see pip install as any more âdeveloperâ centric than conda install. Conda will let you install some things that pip wonât, and pip will let you install some things that conda wonât.
I see your point, but you could equally say why wasnât pipenv told to explain how it would work with Ubuntuâs packager, Fedoraâs, Archâs, with ActiveStateâs package manager, etc. Multiply that by all the Python tools out there and you have a combinatoric explosion.
You could just as easily reverse the argument, and put the responsibility on the packaging solutions (that have already taken on a role of curation and integration) to ensure that they work with their chosen set of tools. And note that qualification - if conda (or Ubuntu, orâŚ) doesnât want to do that integration job, all they have to do is say âwe donât support using pipenv with our packagerâ. What frustrates me is that it seems like youâre suggesting that conda solves all the packaging problems, but then when faced with one (how to work with pipenv) that you donât solve, you say thatâs pipenvâs problem.
On pipâs part (or the PyPAâs, frame it how you want) pipenv works with existing Python tools (pip, virtualenv, etc) so the integration issue is simple for us. Conda replaces pip and virtualenv, though, so the problem lies there. If you want to argue that âpip and virtualenv1 shouldnât have special statusâ, then thatâs a whole different debate (which should probably be a different thread, so I wonât go into it here).
Or to put it another way, the PyPA standards effort is hoping to avoid the combinatoric explosion problem by providing ways for tools to work together, without needing to dump the whole problem onto one party to solve for everyone. Weâve yet to get much inroad into the conda (or Linux distribution) ecosystems, because few people in those communities are participating, but thatâs where I see progress being made here - not by different solutions trying to âstake offâ areas where they want to be the only contender. (For example, we need to work out how to get user installs to work cleanly with system package managers on Linux - see Default to --user ¡ Issue #1668 ¡ pypa/pip ¡ GitHub)
1 Technically, itâs venv that has special status, not virtualenv, but thatâs a migration thatâs still in progress, and wonât really be possible until Python 2 is no longer relevant.
With regard to (1), how do you propose that pip do that? Thatâs essentially the problem that everyone has to reinvent. How does conda check for the existence of shared libraries (on Windows, for example). The answer (as I understand it) is that it doesnât, rather it ships its own infrastructure for managing shared libraries, that you seem to be expecting pip to know about? What about other managers, like Chocolatey? Everyone can say âuse our toolâ as a means of reducing complexity, and it really does - but only for the people willing to use that tool. For better or worse, PyPA is trying to support people for whom the existing integrated tools donât work.
And your point (2) hits on this issue. You say that âpip install is not especially suited for installing large complex softwareâ. OK, letâs take a specific example, tensorflow (which you mentioned in your post). Iâd contend that the message to use a packaged solution like conda when itâs appropriate is out there - we could maybe push it harder, but saying that we donât do anything is incorrect. So the question is, has the conda community looked at why some people who want to use tensorflow arenât using conda? Because if those people have genuine reasons for not wanting to use conda, then they are left unsupported unless PyPA tries to offer even a partial solution for them.
In my own case, for example, I donât use tensorflow, but I do make occasional use of various data science packages, and there are occasions when Iâd have been interested to experiment with tensorflow. But I have a significant investment in the pip/virtualenv/pipenv toolsets, and thereâs no practical way I would learn a whole new, fundamentally different toolset like conda, just to try out a possibly useful package. Are you saying that the Python packaging community should say that Iâm out of scope and we donât offer any solution for me? (Hint: As a member of the packaging community, Iâm not going to agree to that )
Let me reframe the question. Take someone who wants to âlearn Pythonâ (this is a precise description of a group of people at my work, so itâs not a theoretical example). Maybe they donât yet have a particular use case in mind - automation, data analysis, database work are all possible and likely. They donât think of âlearning Condaâ as that doesnât appear on the python.org front page, so they download Python and start investing their time in learning. But thereâs a lot of articles out there about data science, and itâs a hot topic within the company, so they want to do some data analysis on existing information. So they get involved in jupyter, numpy and pandas. At what point in that process should they have been directed towards conda (or whatever other solution you would view as being appropriate for their needs)? Note that for jupyter, numpy and pandas, pip is still working fine - but they are starting to get to the areas where they may hit issues. If itâs âat the beginningâ, then youâre suggesting that conda get front billing on python.org, which isnât a packaging question. If not, then at some point they would need to switch from the tools they started to learn, and the question becomes when they should be advised to make that switch. Remember that these are people who donât have a strong initial interest in data science - someone wanting to work on data science, or science in general, already gets a strong message âuse condaâ. And thatâs why people tend to approach conda as a âdata science solutionâ, and not as a general Python distribution.
Sorry, this became quite long. Thanks for taking the time to read it - I hope it clarifies some of the areas weâre struggling to understand each other on.
+1 to this. The âproblemâ (between quotes since itâs arguable) with pip is that it creates an expectation that every Python package should ship wheels and that packages that donât are somehow doing it wrong. For example, https://pythonwheels.com/ seems to reinforce that idea. But in some cases, packages just cannot ship wheels. One important reason, as described by Travis Oliphant, is having non-Python dependencies. pynativelib would help here, but I donât know if that proposal is going anywhere.
My perspective is that there is a requirement from users that they can easily install packages (and from projects - itâs not like a project that canât be installed is much use!). For better or worse, a lot of users (particularly in Windows environments1) donât have the tools available that are needed to build projects they want to use - and itâs not reasonable to expect them to, to be honest.
So, I donât think itâs at all unreasonable for users to expect prebuilt packages. And letâs be honest, conda and Anaconda are proof of that, their business is built on supplying prebuilt packages.
Iâm not sure there is an expectation that prebuilt packages means wheels. Rather, I think that the people complaining know that there are other options available (conda, distribution supplied packages, âŚ) but that they are unsuitable for them - and I think we should be asking why that is, rather than simply dismissing the problem as just being one of inaccurate user expectations.
Iâm not saying that there isnât a perception that prebuilt binary = wheel, but rather Iâm saying that characterising it as nothing more than a perception issue doesnât give us a useful way of addressing it. Understanding the user requirements behind that expectation might do so - and itâs the projects that canât ship wheels that have access to users who can answer those questions, so maybe they could do more to understand the problem and communicate it to the people designing the installation tools and standards? After all the âpackaging communityâ by definition includes the people producing actual packages, so itâs not like their views would be unwelcome (at least Iâd hope not!)
1 Iâd actually love to see a breakdown of how badly the âshared libraryâ issue hits users, based on platform. My (biased) perception is that Windows mostly doesnât have an issue, but Linux has huge problems. But thatâs as a Windows user for whom everything just works fine, seeing bug reports from Linux users but never success reports from them. So Iâm pretty sure my perspective is inaccurate - as, I suspect, would be any individualâs, so we need more objective measures to help us get data here
One thing that would be very helpful for conda would be to change the default channel to conda-forge instead of Anaconda channels. Ref The burden to add the community maintained conda-forge should not be placed on the end user.