What methods should we implement to detect malicious content?

I was pointed here by one of our users, suggesting I should probably mention it. You could implement package reviewing system - crev for Python/pip, and let users review their dependencies and share their findings. The Rust integration comes with libraries, and AFAIK the Rust<->Python bindings are easy to use, so you could probably reuse most of the language-independent logic, and just add the actual Python/pip-specific stuff, but alternatively you could re-implement these too, since it’s all rather simple (that’s the route the developer working on Node/npm integration is taking). If anyone is interested or has more questions, come join crev’s matrix channel and I’m happy to help.

One approach might be to integrate code quality metrics into PyPI so that packages with docstrings and test coverage might sort higher than those packages without. While this may not prevent diligent trouble makers from improving their code quality, having test coverage may make it easier to detect malicious packages.

Further along the line of thought, that it is easier to detect good content than bad, perhaps the Linux distros can be leveraged as curators.

$ apt-cache search ^python3\- | grep ^python3\- | wc -l

In addition to changing their search rank, it may also make sense to mark these as “featured” packages.

Curating packages is the only way to guarantee goodness, and this is exactly what distros do and why people use them.

For PyPI, it’s likely a very good start to scan sdists for suspicious setup.py files (eg. network access, obfuscation) and flag them for manual review.

Maybe more can be done for actual library code, but until we have a good database of malware samples it’ll be very hard to flag anything reliably.

Well, a lot of distros are run by volunteers: anyone can become a packager and start curating. You don’t really get a guarantee of goodness.
Where distros* do better than PyPI is enabling audits: if you trust the distro’s build infrstructure, you know the binaries actually correspond to the source code.

* Fedora, at least – I’m not all that familiar sure about others

I’m not sure that any of the results are malicious, but neither https://pypi.org/search/?q=yaml nor pip search yaml return https://pypi.org/project/PyYAML/, which IMHO is what most people are looking for.

Using something like the SourceRank to order search results could be an improvement. Even having a sort by “Number of Dependent packages” would be a huge UX improvement and would help steer users away from malicious packages.

$ apt-cache search python3 | grep yaml
python3-yaml - YAML parser and emitter for Python3
python3-yaml-dbg - YAML parser and emitter for Python3 (debug build)
python3-pretty-yaml - module to produce pretty and readable YAML-serialized data (Python 3)
python3-ruamel.yaml - roundtrip YAML parser/emitter (Python 3 module)

I think this is another area where the distros are doing better.

In part I agree that distros might be gamed in theory, but in practice I see it as a potential source of trust information that can be integrated even if it isn’t flawless. I’m not sure about the state of windows package management, but if both Fedora and Debian maintain consensus about which packages are important that has to be worth something.

$ apt-cache show python3-yaml
Package: python3-yaml
Architecture: amd64
Version: 3.12-1build2
Priority: important
Section: python
Source: pyyaml
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Python Modules Team <python-modules-team@lists.alioth.debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 459
Depends: python3 (<< 3.7), python3 (>= 3.6~), python3:any (>= 3.3.2-2~), libc6 (>= 2.14), libyaml-0-2
Filename: pool/main/p/pyyaml/python3-yaml_3.12-1build2_amd64.deb
Size: 109068
MD5sum: 6e4bc596601817de791c141d1af6605f
SHA1: 2c978e511284b2ff996efb704584219a7dc12b8e
SHA256: 6c893d278b4e5a4a02289633c1867cd64ae33fa9ce31b351d2b8e6c63f7d8449
Homepage: https://github.com/yaml/pyyaml
Description-en: YAML parser and emitter for Python3
 Python3-yaml is a complete YAML 1.1 parser and emitter for Python3.  It can
 parse all examples from the specification. The parsing algorithm is simple
 enough to be a reference for YAML parser implementors. A simple extension API
 is also provided.  The package is built using libyaml for improved speed.
Description-md5: 6b427841deb10f77a5f50e5f6b5a05d8
Task: minimal, ubuntu-core
Supported: 5y

I wonder if it would be worth talking to Debian et al and seeing if they would be willing to include a link to PyPI in the distro package metadata. Then PyPI can automatically scan APT/RPM to flag “featured” packages and possibly link back to the distro versions. APT could also be used as a source of dependency graph information for trust metrics or to figure out which developers would be most important to get buy-in from for a developer side code signing process.

I came here to suggest using some kind of quality metrics to rank search results in light of the recent RubyGems compromise where 10 of the 11 malicious gems discovered were copies of existing libraries uploaded with a new name and a cryptominer included.

The metrics in SourceRank seem like a good start. I wonder if there are other metrics we might consider using. Should E2E package signing be implemented, the presence/absence of signing could be a factor. But what else? Perhaps a ranking based on what links to quality indicators exist such as code coverage reports, CI logs, etc. These items are often used in attempt to gauge the quality of a project.

TUF and in-toto should go a long way to solve this problem. To use a pharmaceutical drug analogy, in-toto is the tool tells you who made which ingredients, and how they were all put together, whereas TUF is the tool that tells you who to trust in the first place, wraps it all up, and delivers them in a trustworthy seal.

Disclosure: I am involved with both projects.

Cc @JustinCappos

As per the other thread, TUF doesn’t do this, it just tells you that the package hasn’t changed since it was uploaded, and taken to its extreme that the uploader hasn’t changed between uploads. So far, we haven’t had a problem that this would have prevented or notified us about.

Contents of new packages being malicious is a real issue that is being dealt with daily and manually. Automating that process is important.

1 Like

I meant curation in the general sense, and my first thoughts tend to go to ActiveState and Anaconda rather than the purely volunteer driven ones.

Either way, it’s much harder to fall victim to typosquatting when someone has to manually import the malicious package into a separate index first.

Hi everyone!

I hope you’ll pardon the long post. I’m excited about this effort, since it touches on a topic I’ve explored and thought about quite a bit in the past. I am looking forward to seeing the exciting results!

Note that my original post wasn’t allowed since new users can only include two links in their post. I’ve pasted the full post (including citations and links) in this gist.

When it comes to malware prevention, my thoughts can be divided into sections covering prevention and detection.

Preventing Malware

As we’ve seen, malware on package managers frequently comes from:

  • Hijacking existing packages through account compromise
  • Hijacking existing packages that have been abandoned or deleted
  • Registering typo-squatted packages

I’d like to take a look at what might be done to help mitigate each of these.

Hijacking via Account Compromise

Encourage 2FA Adoption

It’s very exciting to see the strides PyPI has already made in enabling 2FA for accounts, which is a great first step. But I would also consider - after 2FA is both fully in production and stable - encouraging maintainers to turn on 2FA by prompting a warning during a package upload or login to PyPI if the account doesn’t have 2FA enabled.

Enforce 2FA for Maintainers

I’ve seen some package managers, like npm, offer owners of a package the ability to force other maintainers to enable 2FA in order to publish a new version of a package. This would be a useful addition to PyPI as well. I didn’t see anything on this thread that suggested this was in the works, but let me know if I’m missing something.

Monitoring for Leaked API Tokens

It’s exciting to see the work being done leveraging Macaroons as API tokens. As this becomes a more widely used feature, I would recommend signing up for Github’s Token Scanning service to identify and revoke API tokens that might be accidentally leaked in commits to Github. Since you’re using a prefix “pypi”, you should be able to craft a regex that reliably identifies the API tokens. It looks like this has been suggested in #6051, so consider this a +1.

Abandoned Packages

After the left-pad incident a while back, npm created an unpublish policy which led to the following rules:

  • You can unpublish a package as long as it’s less than 72 hours old
  • Otherwise, deprecation is highly recommended. I think you can still unpublish by contacting support

I wasn’t able to find a similar policy for PyPI, but the one from npm seems reasonable. I like that it offers an org like PSF the chance to transfer the package to a holding space or otherwise find a middle-ground with the original author. That said, I don’t have metrics to indicate how many support tickets this would have caused in the past x months.

Registering Typo-Squatted Packages

There have been discussions around using metrics like Levenshtein distance to determine if a package being registered is too similar to an existing package. A response on a different thread suggests that this would result in too many false positives.

Instead, here’s an alternative approach that may be worth considering: there are already metrics on (roughly) the number of downloads for each package. Assuming you don’t have this already, adding internal metrics for the number of non-existent packages that people are attempting to download would give a prioritized list of things to consider blacklisting. My guess is that there will be entries that surface that would not have been caught using standard typo-squatting measures, like people trying to install a package called requirements.txt because the -r was missed.

Hopefully some of these changes could raise the barrier required for malware to both be uploaded to PyPI and be effective. From here, I’d like to talk about detecting what makes it through the cracks.

Detecting Malware

Right now, there’s a fair bit of magic that goes into detecting malicious packages uploaded to package managers. In a post from a while back, I downloaded the metadata for all npm packages and essentially grep'd through the postinstall, preinstall, and install values. This is in line with the static analysis done by other folks to find malicious packages on PyPI. There have also been reports from people doing compelling work looking for specific syscalls during dynamic analysis of npm modules which looks promising.

But in general, I think it’s important to decide and enforce what’s in scope in terms of where PyPI wants to look for malware, and then what behavior is explicitly disallowed within that scope. Anything else will be a much more difficult task, and runs the risk of confusing users.

So let’s talk about what, in my opinion, should be in scope.

Where to Look for Malware

In my personal opinion (I’m very open to changing my mind here - this is strictly where my head is at), a good boundary to set is what behavior occurs at the time of installation without a user’s reasonable knowledge and without the user having an option to opt-out.

While some languages have very clear places where malicious code could be executed during the installation process, with Python things are a bit less clear. Some malware has resorted to simply including executable code directly in the setup.py file, though it’s unclear if this executes during installation. Instead, it seems the “recommended” approach to get code execution during installation is by using the cmdclass flag to specify your own install class as shown here (with a blog post here). For example, this approach appears to have been used by the malicious colourama package here.

Alternatively, you could create your own eggsecutable script as mentioned here though I’m not exactly sure when that fires.

Just from the outset, I’d see value in more closely scrutinizing commands executing as part of the cmdclass overrides, since it seems to be a widely used method for existing malware. But more broadly, to find issues I’d probably consider leveraging dynamic analysis in a sandboxed installation, leading us to talk about what it is we’d look for.

What to Look For

At a high-level, I think there should be some guidelines on what behavior is allowed (or, more likely, disallowed) during the installation process. Just recently, for example, npm decided that ads cannot be shown during installation as a response to a package using them as a potential source for OSS income. Some examples of things that might be considered are:

  • What data should the installation be allowed to access?
  • What data should the installation be allowed to modify?
  • Should network connections be allowed? If so, to where?

I don’t have all the answers, but defining what behavior is expected and allowed will set the tone for the larger project to identify what constitutes abuse of the platform.

Learning from Others

Last but far from least, I was happy to read in the RFI outline that there was a goal to review what other package managers are doing in this space. In these notes I’ve mentioned the work from npm a few times, but more broadly I’d highly encourage us to proactively reach out to the maintainers of other package managers to collaborate on solutions. For example, I really enjoyed this talk from Adam Baldwin at npm that discusses some of the ongoing work they’re doing in this space.

This is a problem where package managers have many overlapping goals, many of the seemingly same problems, and as such would benefit from learning and building together.


Steve: this is why I said TUF and in-toto. You use both to get transparent end-to-end authenticity and integrity of your packages, from the moment developers checked in source code, CI built a package and uploaded it to PyPI, to the moment users download it from PyPI. See this blog post for an example of how Datadog used both to secure the packaging and distribution of our Agent integrations. By using both, you get very strong guarantees that, unless the original developers went rogue, packages were developed and built correctly. Does this clarify my point?

Please see this thread where we are trying to lay the foundation for TUF on PyPI, so that we can integrate in-toto to detect malicious content in the future.

Aren’t most of the current issues with malicious packages from rogue developers? Are developers currently being targeted by MiTM attacks when uploading to NPM or wherever? I don’t understand how in-toto or TUF solves what looks to be the primary issue.

Without end-to-end signing, I’m also not sure how TUF and in-toto protect against cases where the PyPI account credentials are stolen. How is developer key rotation handled with in-toto? How are forgotten signing keys disambiguated from a key update from a compromised developer account?

I can see how it makes it easier to revoke multiple uploads signed by the same key and can improve auditability, but I don’t see how this really effects the cost of uploading malicious content.

TUF protects against compromises of the publishing infrastructure, which do happen (see this list of compromises).

While PEP 458 “only” protects against malicious CDNs/mirrors (artifacts are signed by PyPI), PEP 480 protects against compromises of PyPI too (artifacts are signed by developers before uploading and by PyPI).

in-toto extends end-to-end signing further up the supply chain and protects against rogue developers with signature thresholds for any step of the supply chain (release signing, building, packaging, etc…).

Rotation of in-toto keys that are authorized to sign for a step is baked into in-toto. It is done by releasing a new in-toto supply chain definition, which defines these steps, how they depend on each other, who is authorized to provide signed evidence for them, and how much evidence it needs (thresholds!).
Rotation of in-toto root of trust keys OTOH can be done with TUF, as Trishank describes in his blog post.

PEP 458 should be seen as stepping stone for PEP 480, which could be seen as stepping stone for full supply chain protection with TUF and in-toto.
However, each of these additions does not only pave the way for the next addition, but provides meaningful security guarantees by itself.

Maybe I’m misunderstanding something. Who generates and signs the root.layout PyPI or the developer?

I can see how TUF can be used for distribution of in-toto public keys from PyPI to end-users, I think I’m having trouble understanding the key rotation mechanism between developers and PyPI.

In Trishank’s blog post it looks like the wheels-signer key is signed by the snapshot key, in which there is no issue updating the keys as they are internal to the same organization. Rereading PEP480 it looks like some of my misunderstanding is about PyPI running build infrastructure.

As for malicious content, I don’t understand how TUF or in-toto solves the problem of someone creating a new PyPI account and uploading notacryptominer/0.1.0. Which as I understand it is the most common issue.

If administrators use offline keys to distribute, revoke, and rotate developers keys (which should also be offline themselves), then the problem is solved. For example, in the Datadog implementation, the top-level targets role uses offline keys to distribute public keys for the in-toto root.layout (these private keys are also offline). Is this clearer?

How does a developer revoke an Ed25519 key or authenticate a new key to PyPI?
What prevents an account from being compromised and having a new developer key uploaded to sign a malicious update?

I’m also still not sure I understand who signs the root.layout.

These questions do have good answers, but I feel like we are digressing here. These are best answered in a thread about PEP 458 / 480, not here.


I feel like VirusTotal could be useful here. Given that this is probably the most comprehensive collection of malware in existence, it’s probably worth uploading source files to run them again multiple AVs to detect potentially malicious content. This would still require manual review however, since VirusTotal only returns # of detections, not a definitive yes/no.

Also, the VirusTotal public API is free but rate-limited, which might be a deal-breaker.

On the topic of scope for this task, I think we want to distinguish it from these other goals (which are fine goals, just, I think they are out of scope):

  • detecting security vulnerabilities in the Python being published (although there is some prior art on this, and services that will scan an open-source project for free)
  • scoring the trustworthiness or risk factors of the Python being published (e.g., whether it carries an unacceptable amount of dependencies, makes ill-advised system configuration changes, etc.)
  • any issues related to the hijacking of existing packages, and typosquatting existing package names

What I think is in scope related to the last point is discovering that a package author’s account has been hijacked because we detected something suspicious about a package that the attacker attempted to publish. There are other ways to prevent and to detect account compromise and there are great suggestions for those in this thread, but I think only content analysis should be in scope for this Q4 RFP.

Unfortunately, even when reduced to the goal of analyzing content, I think this is too big of a problem to be solved categorically. We advise a heuristic approach for defining malice and detecting it in Python packages. We’ll have to think about what are the most impactful analysis heuristics that can be implemented in the timeframe and budget of this effort.

There are a bunch of strategies to choose from, and they each have pros and cons: a statistical classifier approach, a pattern/signature approach, a scoring-and-threshold approach, etc. There is an unavoidable maintenance cost in running a system for adversarial detection, because the adversary will keep adapting to evade it. I think proposers should be asked to honestly estimate the maintenance burden of their proposed solution.

I conferred with @woodruffw about what other package managers have done for malicious content detection and the answer was that we were not aware of much. What we are aware of is that the Google and Apple app stores have both invested heavily in runtime analysis sandboxes and static analysis approaches for detecting malice in their app stores. The difference there being, they can run their detections in secret, and adversaries can’t develop an evasion in advance without disclosing it in a submission. So that’s another question for any proposed solution: Will it be effective even if the methods are open-source? Or does it require partial secrecy? Will an attacker be able to lab-test their evasions in secret or will they have to risk detection by submitting them to a live service?

Lastly, we think a good question for proposed solutions is regarding the cost of handling false positives. Who reviews the alerts and approves packages in the case of a false positive? Can you put the burden on the package author to explain why their package is benign, or instruct them what to change in order not to be regarded as malicious?

1 Like

Thanks to everyone for participating in this discussion! The RFI period has closed, and replies in this category have been disabled.

Based on the feedback, we’ll be updating our scope before opening the Request for Proposals period next week along with a new discussion category.

If you’re interested in participating in the RFP sign up at https://forms.gle/redWdNhwMqzRG1jC8 to be notified when it launches.