Firstly, I am quite a beginner with python and to this day I am still confused by the fact that python scripts provided on the web come along import statements like “import foo” where “foo” is a dependency whose origin needs to be investigated and guessed when no requirements.txt is provided. Why can’t we just make scripts that will refer to modules from their pypi package names already : pypi.mypackage.foo ?
Secondly, this week a workmate that is far more skilled than me with Python wanted to make a “quick and dirty” debugging job on an embedded device using a Python script of his own. I could watch the full process over his shoulder. He wanted to access the device using a serial link. Thus he stated his script with “import serial”. Then he’s got an exception because the dependency was missing from his system. He rushed for a “pip install serial” only to end up having unexpected errors with his 10 lines script. Then he recalled that the proper PyPi package was not named “serial” but “pyserial”. He immediately uninstalled the package serial and after a new “pip install pyserial” everything was fine. The debugging was done. Hooray for Python and it’s incredible agility allowing a quick debugging in less than 15mn!
But then I’ve opened the following discussion with my workmate : what about the fact that you ran a random package named serial on your company computer. Are you sure your computer is not compromised now ? His answer was… well I’am sure packages on PyPi are properly reviewed by the community because it is open-sourced and I am quite confident that it should be OK. I was not convinced.
Later on I did my own research only to find out that the PyPi package “serial” is hosted on bitbucket and the project and its sources is not public (anymore?) meaning that it is not possible to know if malicious content is hidden in the package that was unintendedly run on my workmate’s computer.
Now I am not here to ask for help investigating on this particular suspicious package. I am more worried about the fact that in the past I have indeed installed PyPi packages using lucky guesses when I was missing a dependency. An I suddenly realized that some people actually count on people like me to propagate their malicious payloads. And even though I am now aware of this risk and will be careful in the future, I still feel vulnerable since even an experienced engineer like my workmate could easily be mistaken due to approximations on the package name versus import statement.
From the forum I could see that the issue is know as “typosquatting” : Improving risks and consequences against typosquatting on pypi
Or even “dependency confusion” (which better fits with my example of pyserial vs serial) :
Now I would like to have some thoughts and guidelines on this issue. Is there an official working group I can follow ? What are the official recommendations to mitigate the risks ? For example can we be informed when a malicious code has indeed been discovered under a PyPi package that was once installed on a computer ? Is there a book, help file, article, teaching what need to be known ?
In the meantime what shall we do to avoid our company network being compromised by a new intern that is unaware of this threat like I was (or even by my own human weaknesses) : block PyPi by our firewall maybe ? setup some whitelisted PyPi packages mechanism ?