In the past few years, a Python library for .NET file analysis has been developed internally at Palo Alto Networks. As the library is now stable, we decided it is mature enough to open source it and share it with the research community.
This blog post will serve as an introduction to the dotnetfile library. dotnetfile facilitates the extraction of vital information from .NET Portable Executable (PE) files. In addition to basic parsing, several advanced features are implemented by dotnetfile, which may help both automatic and manual analysis tasks. We also describe a new original fingerprinting technique called MemberRef Hash that is included in the library.
Those of you who wish to jump directly into the deep end are invited to review the code and the documentation.
Executive Summary
.NET Background
dotnetfile
Fields and Structure Parsing
ImplMap & ModuleRef - the hidden import table
Advanced Features
MemberRef Hash
TypeRef Hash re-implementation
Entry Point Discovery
Anti-Metadata
Conclusion
.NET is a free open source software framework, primarily developed by Microsoft. .NET is a managed software framework, meaning that code execution is managed by a runtime component and the original code isn’t executed directly by the CPU. While most of the programming languages that are supported are high-level, .NET offers an interface to directly interact with the operating system. All of the above make .NET an excellent platform for software development. While there are many legitimate uses for such a platform, malware authors often take advantage of good software development resources, and the .NET platform is no exception.
The diagram below schematically represents the .NET compilation and execution process. A programming language supported by .NET gets compiled by an appropriate compiler into a platform-neutral Common Intermediate Language (CIL) bytecode. The CIL is essentially the .NET binary instruction set. In runtime, a platform-specific Common Language Runtime (CLR) component compiles the CIL into native code, such as X86 or X64 Assembly. Besides translating the CIL bytecode into native instructions, the CLR is also in charge of executing the .NET program in a managed manner, handling memory management, type verification and much more.
In terms of file format, the common .NET PE file differs from its native PE counterpart. The difference begins at the .NET data directory, also called the COR20 directory. The directory points to the CLR header, which in turn points to the .NET metadata header. These two headers hold global information about the executable. The .NET metadata header usually points to five streams:
Any other stream is not allowed by .NET. If another string is present, it might have been inserted by obfuscators and protectors as “junk” to break parsers.
Various metadata tables under #~ point to offsets into the CIL code. One such table is the "Method" metadata. Each entry in this table will point to the starting offset of the method it describes in the CIL code. The CIL code references information held in the streams. For instance, this would occur when the code makes use of a string from the user strings heap (#US).
This section only provides a brief overview of .NET. The curious reader may check out ntcore's article or the official CLI spec.
Generally speaking, the overview is accurate, however, a disclaimer is appropriate. There are valid edge cases that divert from this “high-level” representation.
The dotnetfile library was named after the legendary pefile library. The usage is also very similar. In fact, the main object – DotNetPE – inherits from pefile's PE object. The dotnetfile library is a pure Python library, meaning there is no need to compile anything or rely on third-party .NET executables. We hope that fellow researchers will find the library useful for automating their file analysis workflows as well as for speeding up the manual investigation of samples.
dotnetfile extracts almost any field and structure from the .NET PE file. Among the parsed information, one can find:
In the most common case, the import table of a .NET PE file will only have entries referencing mscoree.dll, the .NET runtime library. Furthermore, these entries won’t reveal much about the program’s functionality, as oftentimes the same closed set of imports will be present in the import table.
The ImplMap metadata table stores information about unmanaged methods that can be reached from managed code, while the ModuleRef table contains the respective module information. P/Invoke relies on this information. Had the sample planned to use Windows API, the declared API will be stored in these tables. It is worth noting that accessing Windows API, or any other native code, can be achieved by resolving the API in runtime. Hence, in such cases, the ImplMap table won't necessarily store the API information.
dotnetfile provides a means to easily list unmanaged methods.
dotnetfile includes a few features that were developed in addition to the basic parsing ability that allows it to extract valuable information. We named this group of features "Advanced Features'' since they include logic and heuristics created by us. This logic conveys our own higher-level understanding of the .NET PE file and is not just straightforward parsing.
We believe that researchers can benefit from these features in a variety of tasks and applications.
MemberRef hash is an innovative new fingerprinting technique specifically developed for .NET samples. Like any other fingerprinting technique, it may help us to group, cluster and detect samples.
To better understand the technique, we first need to have a deeper look into the MemberRef metadata table. The MemberRef table contains mostly .NET runtime constructs such as methods, properties, fields and so on. We have found that this table can't be easily obfuscated, meaning the info in this table can help us with grouping obfuscated samples.
The MemberRef hash is computed over the MemberRef table using the following process:
The following diagram depicts the MemberRef hash computation process.
Let’s review an in-the-wild example. The naked eye can tell that the two samples shown in Figure 4 are very similar. However, it gets trickier from an automation point of view.
Fortunately, these two samples share the same MemberRef hash. That means that all of the members under the MemberRef table had the same name and the same table ID in both of the samples. Furthermore, the entries appear in the same order.
TypeRef hash is a fingerprinting technique for .NET samples that was invented by GDATA. The idea is to compute a hash over all of the referenced .NET types. To some degree, TypeRef hash resembles imphash, but there are some key differences.
The dotnetfile library includes implementation of the TypeRef hash functionality with a few “tweaks” compared to GDATA’s original implementation.
.NET DLLs don't need to have a defined entry point, in contrast to native PE DLLs that always have the DllMain function. Without knowledge of the real entry point, attempting to execute a .NET DLL will likely fail. In order to perform dynamic analysis of these files and fully detonate them, the appropriate entry point method(s) must be invoked.
The dotnetfile library provides an interface to list probable entry points based on tried and true heuristics. These heuristics are not bulletproof but work pretty well.
Looking at the analysis hash below, only five possible entry points were listed. This methodology drastically narrows down the search space for the right entry point. Upon manual examination of these five methods, it turns out that the second method is the real entry point.
Packers and protectors often try to break parsing by employing various techniques. The dotnetfile library comes with logic to identify such anomalies in the .NET metadata structures. This collection of anomaly detection methods was named Anti-Metadata. The anomalies that the library detects include:
This logic is still experimental but can point out anomalies in .NET PE samples, and it can be used as a feature for detection applications.
.NET is a widely used software development framework. Malware authors have adopted .NET, and this fact is reflected in the present threat landscape.
The dotnetfile library is a pure Python library that parses .NET PE files and extracts all sorts of useful information. Furthermore, dotnetfile provides higher-level logic on top of basic parsing functionality. We are proud to share it with the research community and we hope researchers will make use of it.
dotnetfile Github repository
dotnetfile Documentation
Sign up to receive the latest news, cyber threat intelligence and research from us