Let’s discuss here: what methods should Warehouse use to detect malicious content?
Methods are going to depend a lot on what is an acceptable false positive rate, false negative rate, how much resources are available for training and for live detection (and how much of a delay is acceptable). At least ballpark figures are needed.
I ran into typosquatting of python packages at a customer site (local pypi) last year, and I’ve gone from regex and soundex to building models to detect this, but I suspect something lightweight would be needed for real-time.
Another thing to consider is what data is available at training time and what data in real-time. When I searched the local repository, I didn’t have things like IP addresses, when a user registered etc.
Warehouse probably stores non public information that could be used. Some can be estimated (ie. Geolocation based on ip) but again there are restrictions in terms of cpu, memory, time to evaluate etc.
Then there is the whole features derived from code aspect. And when I say code, It’s not just Python.
One other point I haven’t brought up yet (was waiting a bit to see if there would be some feedback) is that of archive types and file types. Malware can hide in all kinds of places and require different techniques to deal with them. The original question is not just about typosquating, but about malware detection, which is much broader. Hence, a much broader discussion is needed.
Pure python module archive: (example: cpu-temperature-monitor) this should be easy in theory by using a vectorizer, but the python code itself might not hold any malware directly and might fetch it through what looks like an innocent “data” download from github. Or the code might be obfuscated (but then, that would be a feature, just like use of encryption / decryption on files).
Python module with C or C++ extensions archive: (example: APNGlib) this is a bit trickier because there is a lot less code examples and because of the use of pointers, it might leave a vectorizer step without features that would be helpful (the goal being to compare and cluster “like” modules - if too many packages are by themselves in whatever “distance” metric space that is not a good discriminant)
Even though the above 2 typically do not include binary executable files, they can. Binary files in those should definitely be a feature (a risk component). Other binaries might not appear to be executable files at first (.jpg, png etc) but might also hold an executable payload. Then, there might be executables in PE, ELF or Mach-O format (covering Windows, Linux, Mac OS). In the next paragraph I expand a bit on this.
Python wheel archives: (example: Camelot) These are often architecture specific, and in that case will include binary files, sometimes a lot of them. This is basically designed to support C extensions. So, on Linux, this would be some .so presumably. But for a malware developer, they can once more hide the principal payload in all kinds of binary files included in the wheel file.
That’s static code features. Other additional features that would be useful, but expensive to work through would be dependency and reverse dependency graphs for all modules, levenstein distance, binary fingerprinting and code disassembly / vectorizing, etc. At the end of the day, this is going to be an incremental process that will have to cover first the basics, then add features as time go by, and require some pipeline to retrain model(s), supervised or unsupervised.
Finally, going back to the original question as to methods, each new feature added will probably impact the methods used. What I mean by that is for one, we have to deal with the curse of dimensionality, and with the no free lunch theorem:
One hot encoders and vectorizers will explode the feature space (and some algorithms do not scale well at all, especially given how many rows (python packages)
One method might do well for a set of features and data subset, but might do poorly on a slightly different set of features (especially when dealing with one hot encoding)
Well, that’s a start.
I was pointed here by one of our users, suggesting I should probably mention it. You could implement package reviewing system - crev for Python/pip, and let users review their dependencies and share their findings. The Rust integration comes with libraries, and AFAIK the Rust<->Python bindings are easy to use, so you could probably reuse most of the language-independent logic, and just add the actual Python/pip-specific stuff, but alternatively you could re-implement these too, since it’s all rather simple (that’s the route the developer working on Node/npm integration is taking). If anyone is interested or has more questions, come join crev’s matrix channel and I’m happy to help.
One approach might be to integrate code quality metrics into PyPI so that packages with docstrings and test coverage might sort higher than those packages without. While this may not prevent diligent trouble makers from improving their code quality, having test coverage may make it easier to detect malicious packages.
Further along the line of thought, that it is easier to detect good content than bad, perhaps the Linux distros can be leveraged as curators.
$ apt-cache search ^python3\- | grep ^python3\- | wc -l 2352
In addition to changing their search rank, it may also make sense to mark these as “featured” packages.
Curating packages is the only way to guarantee goodness, and this is exactly what distros do and why people use them.
For PyPI, it’s likely a very good start to scan sdists for suspicious setup.py files (eg. network access, obfuscation) and flag them for manual review.
Maybe more can be done for actual library code, but until we have a good database of malware samples it’ll be very hard to flag anything reliably.
Well, a lot of distros are run by volunteers: anyone can become a packager and start curating. You don’t really get a guarantee of goodness.
Where distros* do better than PyPI is enabling audits: if you trust the distro’s build infrstructure, you know the binaries actually correspond to the source code.
* Fedora, at least – I’m not all that familiar sure about others
I’m not sure that any of the results are malicious, but neither https://pypi.org/search/?q=yaml nor
pip search yaml return https://pypi.org/project/PyYAML/, which IMHO is what most people are looking for.
Using something like the SourceRank to order search results could be an improvement. Even having a sort by “Number of Dependent packages” would be a huge UX improvement and would help steer users away from malicious packages.
$ apt-cache search python3 | grep yaml python3-yaml - YAML parser and emitter for Python3 python3-yaml-dbg - YAML parser and emitter for Python3 (debug build) python3-pretty-yaml - module to produce pretty and readable YAML-serialized data (Python 3) python3-ruamel.yaml - roundtrip YAML parser/emitter (Python 3 module)
I think this is another area where the distros are doing better.
In part I agree that distros might be gamed in theory, but in practice I see it as a potential source of trust information that can be integrated even if it isn’t flawless. I’m not sure about the state of windows package management, but if both Fedora and Debian maintain consensus about which packages are important that has to be worth something.
$ apt-cache show python3-yaml Package: python3-yaml Architecture: amd64 Version: 3.12-1build2 Priority: important Section: python Source: pyyaml Origin: Ubuntu Maintainer: Ubuntu Developers <email@example.com> Original-Maintainer: Debian Python Modules Team <firstname.lastname@example.org> Bugs: https://bugs.launchpad.net/ubuntu/+filebug Installed-Size: 459 Depends: python3 (<< 3.7), python3 (>= 3.6~), python3:any (>= 3.3.2-2~), libc6 (>= 2.14), libyaml-0-2 Filename: pool/main/p/pyyaml/python3-yaml_3.12-1build2_amd64.deb Size: 109068 MD5sum: 6e4bc596601817de791c141d1af6605f SHA1: 2c978e511284b2ff996efb704584219a7dc12b8e SHA256: 6c893d278b4e5a4a02289633c1867cd64ae33fa9ce31b351d2b8e6c63f7d8449 Homepage: https://github.com/yaml/pyyaml Description-en: YAML parser and emitter for Python3 Python3-yaml is a complete YAML 1.1 parser and emitter for Python3. It can parse all examples from the specification. The parsing algorithm is simple enough to be a reference for YAML parser implementors. A simple extension API is also provided. The package is built using libyaml for improved speed. Description-md5: 6b427841deb10f77a5f50e5f6b5a05d8 Task: minimal, ubuntu-core Supported: 5y
I wonder if it would be worth talking to Debian et al and seeing if they would be willing to include a link to PyPI in the distro package metadata. Then PyPI can automatically scan APT/RPM to flag “featured” packages and possibly link back to the distro versions. APT could also be used as a source of dependency graph information for trust metrics or to figure out which developers would be most important to get buy-in from for a developer side code signing process.
I came here to suggest using some kind of quality metrics to rank search results in light of the recent RubyGems compromise where 10 of the 11 malicious gems discovered were copies of existing libraries uploaded with a new name and a cryptominer included.
The metrics in SourceRank seem like a good start. I wonder if there are other metrics we might consider using. Should E2E package signing be implemented, the presence/absence of signing could be a factor. But what else? Perhaps a ranking based on what links to quality indicators exist such as code coverage reports, CI logs, etc. These items are often used in attempt to gauge the quality of a project.
TUF and in-toto should go a long way to solve this problem. To use a pharmaceutical drug analogy, in-toto is the tool tells you who made which ingredients, and how they were all put together, whereas TUF is the tool that tells you who to trust in the first place, wraps it all up, and delivers them in a trustworthy seal.
Disclosure: I am involved with both projects.
As per the other thread, TUF doesn’t do this, it just tells you that the package hasn’t changed since it was uploaded, and taken to its extreme that the uploader hasn’t changed between uploads. So far, we haven’t had a problem that this would have prevented or notified us about.
Contents of new packages being malicious is a real issue that is being dealt with daily and manually. Automating that process is important.
I meant curation in the general sense, and my first thoughts tend to go to ActiveState and Anaconda rather than the purely volunteer driven ones.
Either way, it’s much harder to fall victim to typosquatting when someone has to manually import the malicious package into a separate index first.
I hope you’ll pardon the long post. I’m excited about this effort, since it touches on a topic I’ve explored and thought about quite a bit in the past. I am looking forward to seeing the exciting results!
Note that my original post wasn’t allowed since new users can only include two links in their post. I’ve pasted the full post (including citations and links) in this gist.
When it comes to malware prevention, my thoughts can be divided into sections covering prevention and detection.
As we’ve seen, malware on package managers frequently comes from:
- Hijacking existing packages through account compromise
- Hijacking existing packages that have been abandoned or deleted
- Registering typo-squatted packages
I’d like to take a look at what might be done to help mitigate each of these.
Hijacking via Account Compromise
Encourage 2FA Adoption
It’s very exciting to see the strides PyPI has already made in enabling 2FA for accounts, which is a great first step. But I would also consider - after 2FA is both fully in production and stable - encouraging maintainers to turn on 2FA by prompting a warning during a package upload or login to PyPI if the account doesn’t have 2FA enabled.
Enforce 2FA for Maintainers
I’ve seen some package managers, like npm, offer owners of a package the ability to force other maintainers to enable 2FA in order to publish a new version of a package. This would be a useful addition to PyPI as well. I didn’t see anything on this thread that suggested this was in the works, but let me know if I’m missing something.
Monitoring for Leaked API Tokens
It’s exciting to see the work being done leveraging Macaroons as API tokens. As this becomes a more widely used feature, I would recommend signing up for Github’s Token Scanning service to identify and revoke API tokens that might be accidentally leaked in commits to Github. Since you’re using a prefix “pypi”, you should be able to craft a regex that reliably identifies the API tokens. It looks like this has been suggested in #6051, so consider this a +1.
left-pad incident a while back, npm created an unpublish policy which led to the following rules:
- You can unpublish a package as long as it’s less than 72 hours old
- Otherwise, deprecation is highly recommended. I think you can still unpublish by contacting support
I wasn’t able to find a similar policy for PyPI, but the one from npm seems reasonable. I like that it offers an org like PSF the chance to transfer the package to a holding space or otherwise find a middle-ground with the original author. That said, I don’t have metrics to indicate how many support tickets this would have caused in the past x months.
Registering Typo-Squatted Packages
There have been discussions around using metrics like Levenshtein distance to determine if a package being registered is too similar to an existing package. A response on a different thread suggests that this would result in too many false positives.
Instead, here’s an alternative approach that may be worth considering: there are already metrics on (roughly) the number of downloads for each package. Assuming you don’t have this already, adding internal metrics for the number of non-existent packages that people are attempting to download would give a prioritized list of things to consider blacklisting. My guess is that there will be entries that surface that would not have been caught using standard typo-squatting measures, like people trying to install a package called
requirements.txt because the
-r was missed.
Hopefully some of these changes could raise the barrier required for malware to both be uploaded to PyPI and be effective. From here, I’d like to talk about detecting what makes it through the cracks.
Right now, there’s a fair bit of magic that goes into detecting malicious packages uploaded to package managers. In a post from a while back, I downloaded the metadata for all npm packages and essentially
grep'd through the
install values. This is in line with the static analysis done by other folks to find malicious packages on PyPI. There have also been reports from people doing compelling work looking for specific syscalls during dynamic analysis of npm modules which looks promising.
But in general, I think it’s important to decide and enforce what’s in scope in terms of where PyPI wants to look for malware, and then what behavior is explicitly disallowed within that scope. Anything else will be a much more difficult task, and runs the risk of confusing users.
So let’s talk about what, in my opinion, should be in scope.
Where to Look for Malware
In my personal opinion (I’m very open to changing my mind here - this is strictly where my head is at), a good boundary to set is what behavior occurs at the time of installation without a user’s reasonable knowledge and without the user having an option to opt-out.
While some languages have very clear places where malicious code could be executed during the installation process, with Python things are a bit less clear. Some malware has resorted to simply including executable code directly in the
setup.py file, though it’s unclear if this executes during installation. Instead, it seems the “recommended” approach to get code execution during installation is by using the
cmdclass flag to specify your own
install class as shown here (with a blog post here). For example, this approach appears to have been used by the malicious
colourama package here.
Alternatively, you could create your own eggsecutable script as mentioned here though I’m not exactly sure when that fires.
Just from the outset, I’d see value in more closely scrutinizing commands executing as part of the
cmdclass overrides, since it seems to be a widely used method for existing malware. But more broadly, to find issues I’d probably consider leveraging dynamic analysis in a sandboxed installation, leading us to talk about what it is we’d look for.
What to Look For
At a high-level, I think there should be some guidelines on what behavior is allowed (or, more likely, disallowed) during the installation process. Just recently, for example, npm decided that ads cannot be shown during installation as a response to a package using them as a potential source for OSS income. Some examples of things that might be considered are:
- What data should the installation be allowed to access?
- What data should the installation be allowed to modify?
- Should network connections be allowed? If so, to where?
I don’t have all the answers, but defining what behavior is expected and allowed will set the tone for the larger project to identify what constitutes abuse of the platform.
Learning from Others
Last but far from least, I was happy to read in the RFI outline that there was a goal to review what other package managers are doing in this space. In these notes I’ve mentioned the work from npm a few times, but more broadly I’d highly encourage us to proactively reach out to the maintainers of other package managers to collaborate on solutions. For example, I really enjoyed this talk from Adam Baldwin at npm that discusses some of the ongoing work they’re doing in this space.
This is a problem where package managers have many overlapping goals, many of the seemingly same problems, and as such would benefit from learning and building together.
Steve: this is why I said TUF and in-toto. You use both to get transparent end-to-end authenticity and integrity of your packages, from the moment developers checked in source code, CI built a package and uploaded it to PyPI, to the moment users download it from PyPI. See this blog post for an example of how Datadog used both to secure the packaging and distribution of our Agent integrations. By using both, you get very strong guarantees that, unless the original developers went rogue, packages were developed and built correctly. Does this clarify my point?
Please see this thread where we are trying to lay the foundation for TUF on PyPI, so that we can integrate in-toto to detect malicious content in the future.
Aren’t most of the current issues with malicious packages from rogue developers? Are developers currently being targeted by MiTM attacks when uploading to NPM or wherever? I don’t understand how in-toto or TUF solves what looks to be the primary issue.
Without end-to-end signing, I’m also not sure how TUF and in-toto protect against cases where the PyPI account credentials are stolen. How is developer key rotation handled with
in-toto? How are forgotten signing keys disambiguated from a key update from a compromised developer account?
I can see how it makes it easier to revoke multiple uploads signed by the same key and can improve auditability, but I don’t see how this really effects the cost of uploading malicious content.
TUF protects against compromises of the publishing infrastructure, which do happen (see this list of compromises).
While PEP 458 “only” protects against malicious CDNs/mirrors (artifacts are signed by PyPI), PEP 480 protects against compromises of PyPI too (artifacts are signed by developers before uploading and by PyPI).
in-toto extends end-to-end signing further up the supply chain and protects against rogue developers with signature thresholds for any step of the supply chain (release signing, building, packaging, etc…).
Rotation of in-toto keys that are authorized to sign for a step is baked into in-toto. It is done by releasing a new in-toto supply chain definition, which defines these steps, how they depend on each other, who is authorized to provide signed evidence for them, and how much evidence it needs (thresholds!).
Rotation of in-toto root of trust keys OTOH can be done with TUF, as Trishank describes in his blog post.
PEP 458 should be seen as stepping stone for PEP 480, which could be seen as stepping stone for full supply chain protection with TUF and in-toto.
However, each of these additions does not only pave the way for the next addition, but provides meaningful security guarantees by itself.
Maybe I’m misunderstanding something. Who generates and signs the
root.layout PyPI or the developer?
I can see how TUF can be used for distribution of in-toto public keys from PyPI to end-users, I think I’m having trouble understanding the key rotation mechanism between developers and PyPI.
In Trishank’s blog post it looks like the
wheels-signer key is signed by the
snapshot key, in which there is no issue updating the keys as they are internal to the same organization. Rereading PEP480 it looks like some of my misunderstanding is about PyPI running build infrastructure.
As for malicious content, I don’t understand how TUF or in-toto solves the problem of someone creating a new PyPI account and uploading
notacryptominer/0.1.0. Which as I understand it is the most common issue.
If administrators use offline keys to distribute, revoke, and rotate developers keys (which should also be offline themselves), then the problem is solved. For example, in the Datadog implementation, the top-level
targets role uses offline keys to distribute public keys for the in-toto
root.layout (these private keys are also offline). Is this clearer?
How does a developer revoke an Ed25519 key or authenticate a new key to PyPI?
What prevents an account from being compromised and having a new developer key uploaded to sign a malicious update?
I’m also still not sure I understand who signs the