Help with malicious repository

I need to create a project in python which checks the maliciousness of a repository.

Define “maliciousness”.

I need to check the repository if it contains any harmful code.
It is malicious if it contains malicious code.

This is a very complex task. What is your experience?

  1. There is no simple and clear definition of a harmful code. It even differs based on your requirements.
  2. Even if you resolve 1. then by definition it is impossible to detect “any harmful code” because this is a very quickly moving target. (always new platforms, languages, libraries, APIs, vulnerabilities, obfuscation techniques…)
  3. Because of 2. antimalware programs use many techniques including: pattern matching, heuristics, AI, sandboxing and behaviour analysis.

Based on your question I am guessing you are not going to develop your own antimalware. Then your first step would be to check the existing solutions for malware and vulnerability detection in source code.

Unfortunately I think that almost none of the solutions aims at detection of intentional maliciousness. They are rather intended for detecting vulnerabilities caused by programmers negligence, lack of knowledge and use of vulnerable components.

Here are interesting lists I was able to find:

1 Like

For open source software focused on identifying known malware
payloads, I recommend ClamAV.

If you’re not looking for copies of malware someone else has already
identified though, it is indeed a tall order and probably not
something you’re going to do unless you already happen to do it
professionally after many years of hands-on experience taking apart
and understanding malware samples.

I have just minimal experience with ClamAV but I think its capabilities of detecting malware in a source code are almost zero. I think that by a “repository” @farhaan710 meant a source code repository.

Not all programming languages are compiled languages. I can think of
at least one off the top of my head, where the source code is
interpreted at runtime.

But you’re correct, basically all malware scanners are focused
primarily on signatures for compiled payloads, because malware is
generally only dangerous once it’s in a runnable form and most
malware is written in compiled languages.

Its easy enough to check whether ClamAV can detect hostile source code.

Create a file called “malicious-do-not-run-this.py”:

# Seriously don't run this code.
# ESPECIALLY not as root, but even as a regular user it will do bad things to your system.
import os
os.system('@rm -rf /')  # Really don't run this.

Now delete the @ symbol from the file, save the file, and run ClamAV over it. If ClamAV identifies that as malware, I will be impressed and surprised.

1 Like

Does GitHub not do this kind of operation with its codeql-analysis.yml, or is that something else entirely?

The bickering over what tools are appropriate is sort of pointless,
since the original question is beyond vague and so not all of us
agree on what was meant by “repository.” Some people are answering
based on an assumption that the question was about identifying
source code repositories (e.g. Git) for software which does
malicious things, while I assumed the question was about identifying
instances of known malware payloads served from a software package
repository (e.g. PyPI). But really, without a much more detailed
question, we’re all filling the vacuum with our own assumptions.

Since some software is both source and executable at the same time,
it stands to reason that malware scanners like ClamAV may contain
signatures for the source code of known malware which is in
circulation, though I agree that the amount of that is likely to be
quite low for a variety of reasons. As to whether it’s possible to
trivially create new malware which doesn’t match existing signatures
in such a scanner, well… duh. It’s designed to look for copies of
known malware already in broad circulation, not for identifying new
malware.

Github is not the only code hosting site.

Github does allow the repo owner to scan their own code for security vulnerabilities but that is not the same as scanning other people’s repos for malicious code.

1 Like

If one can identify malicious Python code, then it wouldn’t really matter whether you were looking at a source repo like Gitlab or Github, or a package repo like PyPI.

Not necessarily; some malware scanners use non-signature based heuristics which may (allegedly) be able to detect polymorphic malware and new, unknown attacks.

To get back to the original poster’s question: Python may be useful for this, as it can

  • connect to websites, including software repos;
  • download files;
  • read the files;
  • parse them looking for signatures;
  • and analyse them for non-signature based threat detection.

The hard parts are deciding what to look for and avoiding false positives.

1 Like

@vbrozik by repositories I mean like pypi, npm, etc

@steven.daprano I have made a program that checks repos with ‘.exe’, ‘.dll’, ‘.sys’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’, ‘.py’, ‘.xml’, ‘.cfg’, ‘.txt’, ‘.ppt’, ‘.pptx’, ‘.hwp’
I have edited GitHub - password123456/malwarescanner: Simple Malware Scanner written in python
but it does not work with py.
it reads as “0 files scanned”

By Farhaan Ustad Syed via Discussions on Python.org at 26Jul2022 19:34:

@steven.daprano I have made a program that checks repos with ‘.exe’,
‘.dll’, ‘.sys’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’, ‘.py’, ‘.xml’,
‘.cfg’, ‘.txt’, ‘.ppt’, ‘.pptx’, ‘.hwp’
I have edited GitHub - password123456/malwarescanner: Simple Malware Scanner written in python
but it does not work with py.
it reads as “0 files scanned”

That will be because Python programme files end in “.py”, which is not
one of the extensions listed above. I’ve run my eye over the code in
malwarescanner, and it simply ignores files with other extensions.

For others on this thread, malwarescanner is a very basic scanner for
files in a local file tree, which checks them against SHA256 hashes
which appear to be obtained from
https://bazaar.abuse.ch/export/txt/sha256/full/.

So this is a pure checksum approach with no code analysis.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like