timwhitez starred Luwak

Table of Content

Overview
Background
Setup
- 1. Setup Virtual Env
- 2. Download and Merge Model Artifacts
Demo

Overview

Luwak TTP Extractor uses pre-trained models to extract Tactics, Techniques and Procedures (TTPs) from unstructured threat reports. It uses ERNIE pre-trained models to infer common TTPs from threat reports. Currently, we only open source the fine-tuned model for English content. It is expected that in H2, we will have an external TTP extraction service for trial use. It supports Chinese and English content. When the time comes, we will update the URL here.

Background

MITRE ATT&CK is a framework which uses TTPs to describe the operation modes in campaigns of threat actors. TTPs are valuable to Breach and Attack Simulation (BAS) system to assess defense capabilities, and are the most important parts of TTP-based Knowledge Graph. Most TTPs of threat actors in security community exist in unstructured threat reports, such as malware blogs and white papers.

The pre-trained language model is pre-trained on a large-scale corpus, thus it can learn a general language representation, and has excellent results in many downstream natural language processing tasks. Extracting TTPs from unstructured reports is essentially a text multi-classification task. Therefore, using the pre-trained language model for downstream TTP extraction task can achieve good results.

Setup

Key Requirements

Python (64 bit) >= 3.7
pip (64 bit) >= 20.2.2
PaddlePaddle (64 bit) >= 2.0
paddlenlp
nltk
Processor arch: x86_64 (arm64 is not supported)

1. Setup Virtual Env

Clone the repository and setup a virtual env with virthenv:

pip3 install virtualenv
mkdir <VENV_NAME>
cd <VENV_NAME>
virtualenv -p python3 .
source ./bin/activate
cd <path to this notebook>
pip install -r requirements.txt

2. Download and Merge Model Artifacts

Download nltk punkt and merge pretrained model:

python -c "import os, nltk; nltk_data_path = os.path.join(os.getcwd(),'nltk_data'); nltk.download('punkt', nltk_data_path);"

./merge_model.sh

Demo

Load the model and import predict function:

import inference
from inference import predict_text

Put the text to extract TTPs in the text variable:

text = """ACNSHELL is sideloaded by a legitimate executable.
It will then create a reverse shell via ncat.exe to the server closed.theworkpc.com"""

Then call the predict_text function to infer TTPs:

o_text, ttps = predict_text(text)

The o_text variable stores the original text which is the input of predict_text function.

The ttps variable stores the predicted TTPs, it looks like:

[{'sent': 'ACNSHELL is sideloaded by a legitimate executable.',
  'tts': [{'tactic_name': 'Persistence',
    'tactic_id': 'TA0003',
    'technique_name': 'DLLSide-Loading',
    'technique_id': 'T1574.002',
    'score': 0.9799978}]},
 {'sent': 'It will then create a reverse shell via ncat.exe to the server closed.theworkpc.com',
  'tts': [{'tactic_name': 'Execution',
    'tactic_id': 'TA0002',
    'technique_name': 'WindowsCommandShell',
    'technique_id': 'T1059.003',
    'score': 0.98543674}]}]

Every sentence (specified by the sent) may maps zero or serveral tactics and techniques (specified by tts). For each predicted tactic and technique, the fields tactic_name and tactic_id indicate the name and id of tactic, the fields technique_name and technique_id indicate the name and id of technique, the field score gives the model's score for the tactic and technique.