Use cases for search functionality in PyPI

Hello all,

I’ve been following and working on the search portion of PyPI for some time and noticed it’s a part of Warehouse that generates some frustration among users.

Some time ago opened a meta issue to try to establish some use cases and give narrow down what is it that users expect from searching in PyPI. I thought it be good to open up this discussion to a wider audience.

Here are the use cases I distilled from the different issues in Warehouse:

  1. Project name searches : users that have a vague recollection of the name of a package or want to make sure of the spelling before installing. I believe this is the main use case for pip search . (#5506).
  2. Solution searches : users that would like to know what the best package is for a particular task. This could be covered by a “popularity” metric however it’s hard to get right as there’s a lot of aggregated community wisdom that is just not reflected in terms of project metadata as names, classifiers and descriptions are sometimes lacking or misleading. (#3932, #3860)
  3. Meta searches : users that would like to explore the project ecosystem based project metadata like interpreter versions, license, contributors, etc. (#727, #1971)

The goal would be to define these or more use cases and a set of requirements for them to start creating some issues in Warehouse.

Thanks in advance.

5 Likes

Personally, I only ever really want project name searches, and typically I either know the full project name, and just want to go to the project page, or I know a partial name and want to confirm the full actual name.

However, it’s possible that I’m “too close to the problem”, and because I know the limitations of the current search feature I’ve simply never tried it for anything more complex :slightly_smiling_face: But I do tend to head straight to Google for broader searches - e.g., “python library for extracting images from pdf” - mainly because if there isn’t a library, it will often give you useful references anyway.

4 Likes

Functionality search: users that are looking for a specific, precise functionality. For example “parsing ISO datetimes” or “RTMP protocol decoder” or “SAT solver”.

7 Likes

I think it’s worth mentioning a lot of users assume package with names similar or equal to a particular technology or service are somehow reserved to the “best package” for it.

This, unfortunately, is not necessarily the case anymore. One very flagrant case is aws while others are more subtle. Recently there was a PEP 541 case for grpc versus grpcio which was resolved but unfortunately not updated.

User @MiloslavPojman replied on twitter with:

  1. Looking for available names for a new library.
  2. Checking correct spelling (e.g. sklearn vs. scikit-learn)

Regarding use-case 2, I really like the “ecosystem” section of the wikis in Marshmallow projects. It allows users to see a list of publicly-available functionality, and developers can update the list themselves.

Unfortunately, I can’t think of a way to integrate this intuitively with PyPI’s search itself

+1 for functionality search - this is by far the most frequent and important search I do. When I search by name, since google search is in my browser address bar: e.g. pypi bokeh

Hi @yeraydiazdiaz - Thanks for reaching out for feedback! Chris P here, author of Plotly Dash.

New user here, so I can only insert 2 links into my post, so apologies for the code-formatted links.

Speaking on behalf of the Dash community, we’d like to use PyPI search to better understand which PyPI packages are published by Dash community members. Here is our imperfect system right now:

  • We added a Dash framework classifier https://github.com/pypa/warehouse/issues/6273 and included that framework classifier in our cookiecutter plugin https://github.com/plotly/dash-component-boilerplate/pull/92 so that any Dash component packages / plugins created after October 28, 2019 will be searchable on PyPI: https://pypi.org/search/?q=&o=&c=Framework+%3A%3A+Dash. There are currently 49 projects here. This works pretty well, but it:
    • Excludes components that were created before October 28, 2019. Component authors need to opt-in to including this in their setup.py
    • Excludes packages or libraries that don’t start from the cookie cutter by authors that didn’t take the time to look up framework classifiers
    • FWIW, we face the same issue with GitHub project topics since it’s opt-in: https://github.com/topics/plotly-dash
  • In Dash-land, we had a convention early on to prefix our libraries with dash-, e.g. dash-core-components, dash-html-components, dash-renderer, dash-table. It seems this implicit naming convention has been adopted by some community members and there are many more (over 1,000) packages on pypi that start with "dash-" and seem related to our project: https://pypi.org/search/?q=dash-&o=. dash is a common name though and so not all of these packages are related to Plotly Dash. It’s relatively easy for me to determine if it’s related to Plotly Dash by reading the description results.

Besides searching for actual published packages, I use the search to find the GitHub repository from some package I’m interested in. For example, today I searched for dash-extensions:

I’m pretty much always interested in checking out the source code, so I immediately click on the “Homepage” when I discover one of these packages. Historically, I’ve found it a little confusing that Homepage is synonymous with GitHub repo. I think it would be nice if there was a link that explicitly said “Source Code” or “Repository”, but that’s not a big deal.

Hope this is helpful!

Historically, I’ve found it a little confusing that Homepage is synonymous with GitHub repo. I think it would be nice if there was a link that explicitly said “Source Code” or “Repository”, but that’s not a big deal.

“Homepage” is not synonymous with “GitHub repo”; the “Homepage” link is simply whatever URL the project author chose to best represent the project, which is often — but not always — a link to the public repository. Having the link’s label change based on the structure of the URL would lead to a confusing and inconsistent UI. If you really want a link in your project to be labelled “Source Code”, use the project_urls argument to setup() in setup.py like so:

setup(
    project_urls={
        "Source Code": "INSERT URL HERE",
    },
    ...
)
1 Like

I disagree with that definition. It is not necessary and sufficient to be popular to be a good package.

1 Like

Some of the discussion here reminds me of this article about taxonomy of search queries, maybe you’ll find it useful:

That is an extract from the previous page:

here is a quick recap of the three categories:

  • Transactional – Here the user wants to get to a website where there will be more interaction, e.g. buying something, downloading something, signing up or registering etc.
  • Informational – This is when the user is looking for a specific bit of information.
  • Navigational – The user is looking to reach a particular website. There’s only one likely destination that they’re looking to reach.

Google, in their human rater guidelines, call these three categories:

  • Do
  • Know
  • Go

That is a narrow view of what I use search engine for that is inflicted upon us because of the business model of Google that is selling ads. In particular the “informational” or “know” point can be much more extrapolated. The words that comes to my mind are “discovery” and “exploration”. The above article does extend the conversation into chatbots. Chatbots and the underlying AI / NLP tooling is too immature for the kind of things I use search engines or pypi for, namely: unstructured, semi-structured and structured multi-document analysis and summaries. And if you think that is crazy, look at any financial application or travel search engine.

In the scope of current pypi, there is two use cases:

  • typo correction of packages names
  • free form text search engine

Other use-cases have been mentioned but they all fall into a much broader scope than simple package index.

If we extend the conversation into something more complex and possibly more useful and that includes missing pieces not only in python ecosystem, but also in the larger software engineering field, many ideas comes to mind, the most interesting to me remains: reviews.

CodeMeta wraps SoftwareApplication - Schema.org Type with additional properties and maps from {distutils, PKG-INFO} to JSON-LD; which search engines could support

Here’s this from codemeta/Python PKG-INFO.csv at master · codemeta/codemeta · GitHub :

codeRepository
programmingLanguage
runtimePlatform
targetProduct
applicationCategory
applicationSubCategory
downloadUrl Download-URL
fileSize
installUrl
memoryRequirements
operatingSystem
permissions
processorRequirements
releaseNotes
softwareHelp
softwareRequirements Requires
softwareVersion
storageRequirements
supportingData
author Author
citation
contributor
copyrightHolder
copyrightYear
dateCreated
dateModified
datePublished
editor
encoding
fileFormat
funder
keywords Keywords
license License
producer
provider
publisher
sponsor
version Version
isAccessibleForFree
isPartOf
hasPart
position
description Summary / Description
identifier
name Name
sameAs
url Home-Page
relatedLink
givenName
familyName
email Author-email
affiliation
identifier
name
address
type
id
softwareSuggestions
maintainer
contIntegration
buildInstructions
developmentStatus
embargoDate
funding
issueTracker
referencePublication
readme

class SoftwareApplication(CreativeWork(Thing)):

Managing reviews would be a new expectation for PyPI project maintainers.

See: “Help compare Comment and Annotation services: moderation, spam, notifications, configurability”

Search cards with structured data (Linked Data) are one use case for python package search.

CPE identifier metadata could enable vuln db search use cases.

The package detail template on Warehouse could include CodeMeta JSON-LD and or RDFa.