I’ve been following and working on the search portion of PyPI for some time and noticed it’s a part of Warehouse that generates some frustration among users.
Some time ago opened a meta issue to try to establish some use cases and give narrow down what is it that users expect from searching in PyPI. I thought it be good to open up this discussion to a wider audience.
Here are the use cases I distilled from the different issues in Warehouse:
Project name searches : users that have a vague recollection of the name of a package or want to make sure of the spelling before installing. I believe this is the main use case for pip search . (#5506).
Solution searches : users that would like to know what the best package is for a particular task. This could be covered by a “popularity” metric however it’s hard to get right as there’s a lot of aggregated community wisdom that is just not reflected in terms of project metadata as names, classifiers and descriptions are sometimes lacking or misleading. (#3932, #3860)
Meta searches : users that would like to explore the project ecosystem based project metadata like interpreter versions, license, contributors, etc. (#727, #1971)
The goal would be to define these or more use cases and a set of requirements for them to start creating some issues in Warehouse.
Personally, I only ever really want project name searches, and typically I either know the full project name, and just want to go to the project page, or I know a partial name and want to confirm the full actual name.
However, it’s possible that I’m “too close to the problem”, and because I know the limitations of the current search feature I’ve simply never tried it for anything more complex But I do tend to head straight to Google for broader searches - e.g., “python library for extracting images from pdf” - mainly because if there isn’t a library, it will often give you useful references anyway.
Functionality search: users that are looking for a specific, precise functionality. For example “parsing ISO datetimes” or “RTMP protocol decoder” or “SAT solver”.
I think it’s worth mentioning a lot of users assume package with names similar or equal to a particular technology or service are somehow reserved to the “best package” for it.
This, unfortunately, is not necessarily the case anymore. One very flagrant case is aws while others are more subtle. Recently there was a PEP 541 case for grpc versus grpcio which was resolved but unfortunately not updated.
Regarding use-case 2, I really like the “ecosystem” section of the wikis in Marshmallow projects. It allows users to see a list of publicly-available functionality, and developers can update the list themselves.
Unfortunately, I can’t think of a way to integrate this intuitively with PyPI’s search itself
+1 for functionality search - this is by far the most frequent and important search I do. When I search by name, since google search is in my browser address bar: e.g. pypi bokeh
New user here, so I can only insert 2 links into my post, so apologies for the code-formatted links.
Speaking on behalf of the Dash community, we’d like to use PyPI search to better understand which PyPI packages are published by Dash community members. Here is our imperfect system right now:
We added a Dash framework classifier https://github.com/pypa/warehouse/issues/6273 and included that framework classifier in our cookiecutter plugin https://github.com/plotly/dash-component-boilerplate/pull/92 so that any Dash component packages / plugins created after October 28, 2019 will be searchable on PyPI: https://pypi.org/search/?q=&o=&c=Framework+%3A%3A+Dash. There are currently 49 projects here. This works pretty well, but it:
Excludes components that were created before October 28, 2019. Component authors need to opt-in to including this in their setup.py
Excludes packages or libraries that don’t start from the cookie cutter by authors that didn’t take the time to look up framework classifiers
FWIW, we face the same issue with GitHub project topics since it’s opt-in: https://github.com/topics/plotly-dash
In Dash-land, we had a convention early on to prefix our libraries with dash-, e.g. dash-core-components, dash-html-components, dash-renderer, dash-table. It seems this implicit naming convention has been adopted by some community members and there are many more (over 1,000) packages on pypi that start with "dash-" and seem related to our project: https://pypi.org/search/?q=dash-&o=. dash is a common name though and so not all of these packages are related to Plotly Dash. It’s relatively easy for me to determine if it’s related to Plotly Dash by reading the description results.
Besides searching for actual published packages, I use the search to find the GitHub repository from some package I’m interested in. For example, today I searched for dash-extensions:
I’m pretty much always interested in checking out the source code, so I immediately click on the “Homepage” when I discover one of these packages. Historically, I’ve found it a little confusing that Homepage is synonymous with GitHub repo. I think it would be nice if there was a link that explicitly said “Source Code” or “Repository”, but that’s not a big deal.
Historically, I’ve found it a little confusing that Homepage is synonymous with GitHub repo. I think it would be nice if there was a link that explicitly said “Source Code” or “Repository”, but that’s not a big deal.
“Homepage” is not synonymous with “GitHub repo”; the “Homepage” link is simply whatever URL the project author chose to best represent the project, which is often — but not always — a link to the public repository. Having the link’s label change based on the structure of the URL would lead to a confusing and inconsistent UI. If you really want a link in your project to be labelled “Source Code”, use the project_urls argument to setup() in setup.py like so:
Transactional – Here the user wants to get to a website where there will be more interaction, e.g. buying something, downloading something, signing up or registering etc.
Informational – This is when the user is looking for a specific bit of information.
Navigational – The user is looking to reach a particular website. There’s only one likely destination that they’re looking to reach.
Google, in their human rater guidelines, call these three categories:
Do
Know
Go
That is a narrow view of what I use search engine for that is inflicted upon us because of the business model of Google that is selling ads. In particular the “informational” or “know” point can be much more extrapolated. The words that comes to my mind are “discovery” and “exploration”. The above article does extend the conversation into chatbots. Chatbots and the underlying AI / NLP tooling is too immature for the kind of things I use search engines or pypi for, namely: unstructured, semi-structured and structured multi-document analysis and summaries. And if you think that is crazy, look at any financial application or travel search engine.
In the scope of current pypi, there is two use cases:
typo correction of packages names
free form text search engine
Other use-cases have been mentioned but they all fall into a much broader scope than simple package index.
If we extend the conversation into something more complex and possibly more useful and that includes missing pieces not only in python ecosystem, but also in the larger software engineering field, many ideas comes to mind, the most interesting to me remains: reviews.
CodeMeta wraps SoftwareApplication - Schema.org Type with additional properties and maps from {distutils, PKG-INFO} to JSON-LD; which search engines could support