Typosquatting, dependency confusion, supply chain attack, call it as you wish

Firstly, I am quite a beginner with python and to this day I am still confused by the fact that python scripts provided on the web come along import statements like “import foo” where “foo” is a dependency whose origin needs to be investigated and guessed when no requirements.txt is provided. Why can’t we just make scripts that will refer to modules from their pypi package names already : pypi.mypackage.foo ?

Secondly, this week a workmate that is far more skilled than me with Python wanted to make a “quick and dirty” debugging job on an embedded device using a Python script of his own. I could watch the full process over his shoulder. He wanted to access the device using a serial link. Thus he stated his script with “import serial”. Then he’s got an exception because the dependency was missing from his system. He rushed for a “pip install serial” only to end up having unexpected errors with his 10 lines script. Then he recalled that the proper PyPi package was not named “serial” but “pyserial”. He immediately uninstalled the package serial and after a new “pip install pyserial” everything was fine. The debugging was done. Hooray for Python and it’s incredible agility allowing a quick debugging in less than 15mn!

But then I’ve opened the following discussion with my workmate : what about the fact that you ran a random package named serial on your company computer. Are you sure your computer is not compromised now ? His answer was… well I’am sure packages on PyPi are properly reviewed by the community because it is open-sourced and I am quite confident that it should be OK. I was not convinced.

Later on I did my own research only to find out that the PyPi package “serial” is hosted on bitbucket and the project and its sources is not public (anymore?) meaning that it is not possible to know if malicious content is hidden in the package that was unintendedly run on my workmate’s computer.

Now I am not here to ask for help investigating on this particular suspicious package. I am more worried about the fact that in the past I have indeed installed PyPi packages using lucky guesses when I was missing a dependency. An I suddenly realized that some people actually count on people like me to propagate their malicious payloads. And even though I am now aware of this risk and will be careful in the future, I still feel vulnerable since even an experienced engineer like my workmate could easily be mistaken due to approximations on the package name versus import statement.

From the forum I could see that the issue is know as “typosquatting” : Improving risks and consequences against typosquatting on pypi

Or even “dependency confusion” (which better fits with my example of pyserial vs serial) :

Now I would like to have some thoughts and guidelines on this issue. Is there an official working group I can follow ? What are the official recommendations to mitigate the risks ? For example can we be informed when a malicious code has indeed been discovered under a PyPi package that was once installed on a computer ? Is there a book, help file, article, teaching what need to be known ?

In the meantime what shall we do to avoid our company network being compromised by a new intern that is unaware of this threat like I was (or even by my own human weaknesses) : block PyPi by our firewall maybe ? setup some whitelisted PyPi packages mechanism ?

2 Likes

There’s a recent standard for that and it’s starting to be supported by tools like Hatch and pipx.

That is an utter misconception. There is no review of packages published on PyPI. “It is open source” guarantees that you can read the source code. You still have to do it!

What PyPI does have is a posteriori mitigation: when malware is found and reported, it is deleted from the index. That’s what it is: a mitigation.

There’s also a mitigation against packages with too similar names. I suspect pyserial and serial were both created before that was put in place.

That’s the best you get. Curation like manually verifying each newly created project is provided by other distributors (e.g., Anaconda and conda-forge), but as a counterpart, you get fewer packages. There is no silver bullet for this problem.

Of course it is possible: get the package contents from serial · PyPI and inspect them. (The .whl file is a ZIP archive in disguise.)

Well, dependency confusion is a bit different, it’s rather about getting a package with a given name from an index you didn’t expect, when you use several indices at the same time (e.g., PyPI plus an internal corporate index). Typosquatting is the appropriate term here.

On that specific topic, not that I know. These issues are handled by the general PyPI staff (which is extremely scarce, unfortunately). You might want to read the PyPI blog.

There is no automated way at present, see Pip must notify people that they have been compromised by a malicious package EDIT: Wrong, have a look at GitHub - pypa/pip-audit: Audits Python environments, requirements files and dependency trees for known security vulnerabilities, and can automatically fix them

Yes, you can block PyPI and require all installs to be done through an internal index, run using, e.g., pypiserver or simpleindex.

4 Likes

As @jeanas mentioned, there’s been work on a system for this - specially formatted comments in a source file will tell an external tool which packages are needed from PyPI. It would still be up to the user to name those packages; they can’t be inferred from the import statements.

Making it so that “referring to modules” in the import statement uses PyPI names, is a complete non-starter - for several reasons:

  • A PyPI package (distribution package) can provision more than one name to import in the code (import package). It can provision individual modules, too.
  • A PyPI package name can have hyphens in it, which wouldn’t be valid Python syntax there.
  • When you install something, you may also have to specify a version (or range of acceptable versions) for the package, and that again wouldn’t fit in the existing Python syntax.
  • Installing a package can be long and time consuming and produce pages and pages of debugging output. It would be disruptive to trigger that from within a script.
  • Pip, and PyPI, aren’t actually all that privileged in the Python ecosystem. Python came first, and wasn’t originally designed with any of those systems in mind. In fact, one of the big original selling points of Python was that the standard library was considered quite comprehensive.
  • Developers get to choose a name to use on PyPI for the project, and are more or less stuck with it. If they decide for whatever reason that the importable package name should change, that won’t change the PyPI name.
  • Sometimes, projects get forked, or the maintenance gets taken over by another person or organization that wants to use a different PyPI name in order to make clear who’s providing the code. In these cases, users may want a choice of which PyPI package to use, and this way they don’t have to change the import in the code. For example, something like this happened with PIL, which is why you pip install pillow now (the original fork disappeared into the ether many years ago).
3 Likes

Complex topic.

Some random notes from me.

Yes, hosting your own server or mirror or proxy is a viable solution:

Seems like there are commercial solutions available with curated packages (at least that is what I understood from the descriptions, not endorsement from me):

There are some tools you could add to your infrastructure (CI/CD pipelines for example), just to name a few:

As far as I can tell, the major code forges (GitHub, GitLab) have built-in tools and tooling to warn against potential security issues in your code:

There is (was) a proposal to strengthen Python packaging ecosystem against “dependency confusion attacks”:

3 Likes

Thank you all for your valuable answers.

Well this is a big relief. I didn’t know thank ! Thank you.

I understand that the idea of this new feature is to include the “requirements.txt” in the header comments of the script (as metadata). This looks acceptable even though I find it quite unsexy compared with the rest of the language. I would have prefered an improved “import” statement capable of explicitly mapping PyPi packages like a “pip install” does. Well, @kknechtel has clearly explained in his answer that we are far from getting this feature in Python :frowning: .

My own belief is that if a package is actively used by thousands of other user, then it is probably safe (safe beyond to what can be reviewed by an average user). Pyserial for example has 3109 stars on github and many forks, while Serial is at release 0.0.97 and has a private repo hidden on bitbucket. The later would probably require a proper review prior to using it (or just be kept away). Yet I have to admit that now that you taught me that PyPi religiously keeps a version of the source code, my panic level has decreased. This brings some comfort to my mind and I guess time will tell if this package gets reported some day :slight_smile: .

I trend to doubt that conda-forge packages are checked against malwares. My understanding was that the focus is more on the consistency of a project and its dependencies. So I am guessing that a typosquatting package with no dependency would join conda-forge as easily as it joins PyPi. Only personal opinion (as a newbie).

All is said. thank you.

I will check this out.

1 Like

Thank you for your answer. I understand that this limitation is perfectly known.
Yet I keep the faith that someday it will become possible despite all the difficulties you’ve mentioned :slight_smile:
I may even try to contribute the effort some day since this dependencies management in Python is truely a pain point for me even thought I love the language.

1 Like

I am a simple python user in a big organization. I don’t even use Python to deliver software to our customers but to cover my own needs of automating simple tasks. I am tempted to forward your suggestions to my IT department for them to evaluate the threat. But I fear that their next move could be to completely ban PyPi with no other options provided.

Python is a powerful tool and like uncle Ben said “with great power comes great responsibility” :slight_smile: .
I’ve decided that my first intent should be to become a more responsible python user and do what can be done at my level first. Still, tomorrow I will go have a talk with the new interns…

3 Likes

If the language were being designed from scratch today, the choice might have been different (or not), but we’re building on an existing ecosystem where package names have been allowed to differ from import names forever. There are further arguments in this section of PEP 723.

Well, PyPI cannot verify that the GitHub repository associated with the package metadata is the correct one (search for “starjacking”).

All it has is download counts. These are not necessarily representative and can be manipulated easily.

1 Like

I think it would not have been, simply because I’m not aware of any other languages working this way (including ones that were developed quite recently). It’s essentially saying that everyone is allowed to contribute to the standard library without vetting, and it also causes problems when, say, a program must be denied an Internet connection for security reasons.

1 Like

There’s generally enough review of new contributions to conda-forge to rule out typosquatting attacks. Probably even the long-standing contributors to the repository wouldn’t be able to add a new recipe without someone else having to review it.

And once you rule out typosquatting (and other packages you never intended to install), you’ve avoided 99.9% of the malicious code on PyPI. The only other issues I’m aware of are 1-2 instances of credential theft (turn on MFA and use a reliable email provider!) and “protest-ware”,[1] which some consumers treat the same as malicious.

If you’re in an environment where you need further protection, Anaconda and ActiveState are two companies that will sell you support and a guarantee to check for malware before providing you packages. PyPI will never have this level of curation - at best, automated scans will block uploads, but that just means uploaders will tweak their code until it’s allowed and bypass the scans :man_shrugging:


  1. Essentially code that injects ads, but for a cause rather than a profit. ↩︎

3 Likes

I think the change under discussion was just “package names must match import names”, not anything as far-reaching as “automatically download packages on import”.

2 Likes

I believe it’s how go works - you do something like import "github.com/google/uuid". You still (I think) need to manually add the module to the project and trigger the download, but there’s a direct correspondence between what you type in your source code, and what gets downloaded.

I’m not personally convinced it’s a good idea - it presumably makes re-hosting a project a nightmare - but it does exist as an idea.

Actually, doesn’t Javascript do something similar?

3 Likes

OMG! This PEP 723 is gold! It perfectly targets all my concerns and especially the “single file scripts” requirements. Starting from now, I will comply with any existing or future recommendations arising from it without discussion!

I did and I’ve learnt : 3mn Youtube content here
Frightening!

That’s a quite convincing point… and disappointing at the same time. However, if it wasn’t for PEP 723, I would still not believe it :slight_smile:

I could have suggested something like
import serial availablein [“PyPi/pyserial/”, “condaforge/pyserial”, “myserver/pyserial”] withversion “>3.5”
But I won’t, since PEP 723 covers it all and I’ve said “no discussions”.

OK. I guess I will start to learn using conda then. I guess that sticking with Anaconda would be overkill for my usage.

Thank you all for your answers.

2 Likes