When malware source code is leaked into the wild, opportunistic malware authors will often be quick to analyze and repurpose the code to create new variants of their own malware, providing another avenue for them to escape detection. This post, condensed from a SANS webcast featuring SANS Analyst Jake Williams and VMRay Threat Analysis – Team Lead Tamas Boczan, discuss how leaked malware source code is co-opted and adapted by malware authors, explore new strategies for proactively identifying and mitigating new malware variants, and examine how to operationalize these source code leaks.
Studying and categorizing malware families might seem like an arcane academic exercise that would be of marginal utility to a busy security researcher, but SANS Analyst Jake Williams would likely disagree. As discussed in our joint SANS webcast last March (“Family Matters: Practical Malware Family Identification for Incident Responders”), identifying and tracking the activity and behavior of these malware families not only accelerates the analysis of individual samples but also helps incident responders think systematically about incoming attacks, achieving a more accurate high-level view of the threats they face.
Every security researcher knows that malware authors are continuously adapting their wares to evade detection and gain a foothold in the network. And when the source code of an established piece of malware is leaked into the public domain, savvy operators will often act quickly to copy and steal the useful parts of the leaked code to improve their own. For this reason, Jake advises that by investing the time to understand the family of origin of a particular malware strain, you’ll be better prepared when a new variant strikes.
Just like we see in nature, evolution happens incrementally. As Jake notes, “they’re adding capabilities on a piecemeal basis and this is useful because once the malware’s been discovered and triaged, you’re going to be able to reverse engineer it for analysis. Other organizations have likely been hit with some of these variants before so if you’re able to detect these slight drifts in these malware families then you’re able to be more proactive in your defense posture.”
Tamas weighs in on just how important it is to know when a new variant comes out and that it’s key in helping to not only be able to detect it better but to also be able to correctly execute it in a malware sandbox so you can observe all of its various behaviors and critically to stay ahead of new techniques that an attacker might adopt. Says Jake, “you want to make sure that everything that the malware is going to do on a target machine is also reflected in the sandbox, or else you might run into false negatives.
YARA is one of the most important tools in every malware researcher’s toolbox, providing a rule-based approach to create descriptions of malware families based on text or binary patterns. As Jake explains, “once the reverse engineer analyst identifies unique patterns in the malware, YARA is the way we create and apply these rules around those patterns in the malware itself.” He notes that these rules shouldn’t be conflated with a simple file hash as any slight modification made to the malware itself, will render any previous file hash useless.
“We’re talking about an attacker grabbing a chunk of code that’s been leaked from somewhere and trying to bring it into another malware sample because ‘why write from scratch what I can steal for free?’” By reverse-engineering the malware, we can instead build a signature for the specific new capability and write a new YARA rule for it. Of course, as Jake points out, YARA rules are not a perfect solution, as malware authors will work to obfuscate the malware using a packer.
Manually unpacking malware is a tedious, time-consuming task. VMRay Analyzer aims to automate unpacking by using heuristics to dump the right memory regions at the right time during the malware’s execution. This approach provides memory dumps that contain the deobfuscated malware, that YARA and VMRay Analyzer’s built-in anti-virus software can successfully detect.
Jake offers a useful analogy about analyzing a de-obfuscated malware sample as being similar to having to go through a secondary screening when going through a customs or TSA inspection: “What we’re talking about here is somebody going through and taking everything out of your suitcase and trying to see if something is hidden in the lining and digging through all your stuff. They don’t do this to everyone because they are resource-constrained which is the same reason why malware analysts don’t run a full analysis on every sample.”
In the last section of his presentation, Jake talks about how we can operationalize these practices and apply them to take a more proactive defensive stance in the fight against malware. Again he notes the challenge that all but the most well-funded security teams face: resource allocation, priorities, and determining whether the return on investment makes economic sense.
Jake concludes his section with the following advice: “by capitalizing on reverse engineer generated signatures, this is going to help analysts discover related samples – there’s just no question about it – and this is going to put you in the driver’s seat for proactive protection.
In the second half of the webinar, Tamas Boczan, VMRay’s Threat Analysis Team Lead explores the very challenging case of malware tracking when the malware source code leaks, shares some of his research that looks at how we can still track variance although they are based on the same source code, and how we can better defend against them in the future.
Tamas opens his portion of the webcast by discussing the interesting relationship that malware-as-a-service has to open-source software in general. “Most of the builders and the panels, the core components of the malware are still closed source. As Jake mentioned, it’s very typical to write a YARA signature for the closed source components that are generated by the builder.”
But threat actors are leveraging many open source components as well and that it’s very typical that when a sample is unpacked, you’ll find that one of the stages or components within it is based on open source, though it’s the closed source components that are usually being tracked.
“It’s sometimes worth noting which malware uses which open-source component internally such as which open source privilege escalation tool is used by each malware because they don’t change this very often. Tracking based on closed source components works well until the malware authors have a breach and their source code leaks – then what we assume to be “closed-source” is not anymore and it’s treated as an open-source project which can be used by anyone to make their own forks”, explains Tamas.
So how do you track and distinguish dozens or hundreds of slight variants from one another to the point where you’re not overwhelmed? As Tamas notes, “it’s really challenging because the code that we see is almost entirely the same and it’s often the client-side is almost completely identical except for some configurations. And it’s possible that only the configurations and the server-side changes but we still want to distinguish them from each other.” This is why Tamas argues that a different methodology than what we use to track closed source code is required.
To understand why a new tracking methodology is required, Tamas discusses his research into tracking the many variants of the Ursnif malware family. He notes that the individual names for these variants aren’t all that important – what matters is how the samples are organized into groups.
Tamas observes that Ursnif makes for a good case study example since the source code has been on GitHub since 2015, remarking that while its protracted availability has had many negative consequences for security teams, it’s been positive from the perspective of this research as we can observe its evolution over a substantial time horizon. Much like a commercial product, customer-facing changes drive the divergence from the leaked source code and consequently, we can orient our threat hunting based on these ‘customer feature requests.’
Tamas shows how certain modules of Ursnif were modified over time to respond to new security measures such as device fingerprinting and two-factor authentication. These modified modules facilitate browser attacks, keylogging, screen recording, and stealing data from email, instant messages, and FTP.
Based on his research, Tamas found that to successfully distinguish Ursnif variants, good data points are format strings used to create the network beacon at runtime, and even better are configuration items. Using VMRay Analyzer, the format strings can be observed during the malware’s execution and can be detected with YARA rules. The malware sample’s configuration can be extracted from the memory dumps and contains data useful for classification: the method the configuration is stored, the cryptographic keys used to communicate with the server and to encrypt certain parts of the configuration, the domains used to connect the C2 server, and the identifiers used to store each configuration item.
Analyzing all of these individual elements generates a lot of data points. Tamas points out that “running this on thousands of samples we finally have all the data that we need to track the variants, and to distinguish them from each other. However, the issue now is that we have a bit too much data so we need to select which data points are reliable for distinguishing the variants and for this, visualization is very useful.”
The colorful results look like something one might see in the Museum of Modern Art. But these visualizations of more than 4,000 malware samples tell a compelling story about how these samples are both connected and separate from one another. Notes Tamas: “what’s interesting to us is that although these samples all look and behave very similarly, based on this data they formed completely distinct clusters that do not overlap at all. On this graph, it means that within these two clusters no cryptographic keys are shared – not even the ones that are used internally.”
From this point, this data set becomes usable information that by discerning the slight variations between samples can be applied to track existing variants, and proactively detect novel malware strains.
To learn more about malware source code leaks, watch the full webcast: When Malware Source Code Leaks: Challenges & Solutions for Tracking New Variants/em>