Introduction:
This blog post discusses looking at process relationships, specifically from malware sandbox execution data. One of the essential functions of malware sandbox is to gather and display process execution, for example if winword launches powershell, you'd want to know that.
One issue that I run into while doing research is that many of the public/free malware sandboxes don't allow me to search based on process relationships. For example, if I have a sample on an endpoint that executed whoami, nslookup, systeminfo, i would like to be able to search sandbox reports to see which malware families or samples do that.
The other thing I'm interested in as a researcher is trends a long for initial access execution, for specific malware families or in general. One of the twitter accounts I follow is https://twitter.com/pr0xylife and they post information about how malware such as qakbot is doing command/process execution on a system.
I find the information interesting and obviously, the threat actors have made changes over time. Maybe the threat actors are using new LOLBINs more than before.
The final thing about process relationship data that can be useful is just looking for new things or rare executions. If you're collecting the data, you can do searches to look for rare executions.
All this research should be helpful with detection engineering too or with emulation, if you're trying to match a specific threat.
POC Implementation:
As a proof-of-concept, I decided to implement a searchable database that lets me collect data from malware sandbox report and lets me search for parent-child process relationships.
I acquired my data from Hybrid-Analysis Public Feed, which gives you JSON file with around 250 recent malware analysis results. I also got data from Zero2Auto CAPE sandbox (https://zero2auto.com/ Thanks for letting me use the data!)
I initially looked at graph databases but asking graph database questions/doing queries seemed annoying to me so I didn't look into them too much.
The second thing I tried was to join data from process execution in Python manually, which was a horrible idea. The code turned out horrible and dataset wasn't fun to work with. (https://github.com/BoredHackerBlog/sandbox_process_relationships/blob/main/hybrid-analysis_public_feed.py)
CAPE and Hybrid-Analysis both record process execution data differently but one thing they have in common is a process list json object. Each process object has process metadata and parent process id and obviously the process id.
I decided to use duckdb to analyze the data. (Usually I'd use sqlite but wanted to try out duckdb and it worked fine)
I created a table with:
Then I loaded the results from CAPE or Hybrid-Analysis to the table. I'm loading the same type of data but parsing their json reports is obviously different.
Finally, I created a view with join, where I ensure that report id is the same and parent process id and process id's match.
The resulting view contains:
Gathering data from the sandbox reports and putting it in the database allows me to ask questions like these:
The results look kinda like this:
If you have large enough dataset, you can extract more info like malware or campaign name and etc and keep track of the trends.
Other solutions:
If you already are doing malware execution in your sandbox, you can check if you are able to search based on process relationships.
You could also have a backend database that you can query, for example MongoDB or Elasticsearch, although I personally don't know about join capabilities of those databases.
Alternatively, if your sandbox supports either pushing data out to splunk or elasticsearch or any other place, you could try to work with that data. You can also maybe intercept that data and send it to a webhook or lambda for additional processing.
If you have a system that supports pulling data, maybe through an API, that's also a solution. Maybe have a script that pulls reports, parses data, and processes it.
You can store processed data in whatever database you feel comfortable utilizing. I would personally use Clickhouse or Postgresql if I was doing this.
Links/Resources:
Code: https://github.com/BoredHackerBlog/sandbox_process_relationships
https://courses.zero2auto.com/
https://www.hybrid-analysis.com/
Also check out Grapl - https://github.com/grapl-security/grapl