This blog post introduces a tool that extracts stolen credentials from text files coming in varying formats in order to address CTI and Red Teaming needs.
Information stealers (or more commonly “infostealers” or “stealers”) are malwares designed to collect sensitive data from infected systems such as:
In most cases, infostealers are sold as Malware-as-a-Service (MaaS). This is then no surprise they are simple to use. The most famous ones include RedLine, its successor META, Raccoon, or Stealc.
Infostealer attacks are motivated by financial gain. Stolen data, also called “logs” in this context, is put up for sale on the dark web or messaging platforms such as Telegram. Some threat actors build communities (“logs cloud” or “cloud of logs”) to share archived logs as ZIP or RAR files. Because there are as many paid-access channels as there are free ones, logs quality and recentness can vary a lot.
Automating infostealer logs processing could prove useful for two of our teams. It is easy to find scripts for that online, like on GitHub, but they do not handle enough file formats. Thus, developing our own tool seemed a reasonable choice.
First, LEXFO has expertise in data breach detection. Our specialized team searches for client information exposed on the different layers of the web and notices them if a leakage is confirmed. This task, when done manually, is very time-consuming as it involves many steps, such as browsing hundreds of channels, downloading tons of zipped archives, searching them for specific domain names and keywords, and so on. Therefore, automation would be a huge help.
Next, this kind of tool could provide valuable assets to Red Teamers. Since logs contain credentials, it would be interesting to build a search engine similar to Intelligence X or DeHashed.
To meet these requirements, we have to:
This blog post focuses only on the extraction part. It will detail how to create a stealer logs parser.
The first step is to create a Telegram account that will join logs clouds. The scraper will use it to connect and download every logs archive received thanks to an event listener.
Note that sometimes the archives are password-protected. The password can be found in the corresponding Telegram message's text content.
One archive can contain thousands of directories and subdirectories. Each stands for a compromised system and often looks like this:
├── Autofills
│ ├── Google_[Chrome]_Default.txt
│ └── Microsoft_[Edge]_Default.txt
├── Cookies
│ ├── Firefox_123456.default-release.txt
│ ├── Google_[Chrome]_Default Network.txt
│ └── Microsoft_[Edge]_Default Network.txt
├── DomainDetects.txt
├── FileGrabber # Desktop files
│ └── Users
│ └── ...
│ └── Desktop
│ └── ...
├── ImportantAutofills.txt
├── InstalledBrowsers.txt
├── InstalledSoftware.txt
├── Passwords.txt # Stolen credentials
├── ProcessList.txt
├── Screenshot.jpg
└── UserInformation.txt # Information about the compromised system
Every stealer has its own directory structure and text files formatting. Although close, they are different enough to require more than a ten-line script.
Credentials
In every system directory, there is usually a text file that groups stolen credentials. Most of the times, it is named as follows:
password.txt
;Password.txt
;All passwords.txt
;AllPasswords_list.txt
;_AllPasswords_list.txt
.The document layout depends on the stealer that harvested the data. Here are five examples from five different sources:
Soft: Google Chrome [Default]
Host: https://example.com
Login: ••••••
Password: ••••••
["Chrome" = "Default"]
Hostname: https://example.com
Username: ••••••
Password: ••••••
browser: Google Chrome
profile: Default
url: https://example.com
login: ••••••
password: ••••••
SOFT: Chrome (foo.default)
URL: https://example.com
USER: ••••••
PASS: ••••••
URL: https://example.com
Username: ••••••
Password: ••••••
Application: Google_[Chrome]_Profile 1
It shows that the stolen information is mainly the same. Only the field names that prefix them change.
Information | Prefixes |
---|---|
URL or hostname | Host, URL, url |
application (web browser, email client) | Soft, browser, SOFT, Application |
username | Login, login, username |
password | Password, password, PASS |
It is important to note that the prefix list above is only part of a larger one. Moreover, there is little doubt new ones will be discovered in the future.
In order to avoid struggling with a forest of if
statements and write maintainable and extensible code, creating a grammar is the first thing to do, beginning with the tokens. They will be used to split the file's text content before parsing it.
%token WORD
%token NEWLINE
%token SPACE
%token SOFT_PREFIX
/* 'Soft:' | 'SOFT:' | 'Browser:' | 'Application:' | 'Storage:' */
%token SOFT_NO_PREFIX
/* '["Browser" = "Profile"]' */
%token HOST_PREFIX
/* 'Host:' | 'Hostname:' | 'URL:' | 'UR1:' */
%token USER_PREFIX
/* 'USER LOGIN:' | 'Login:' | 'Username:' | 'USER:' | 'U53RN4M3:' */
%token PASSWORD_PREFIX
/* 'USER PASSWORD:' | 'Password:' | 'PASS:' | 'P455W0RD:' */
%token SELLER_PREFIX
/* 'Seller:' | 'Log Tools:' | 'Free Logs:' */
Grammar rules come next. They are defined from the various log formats encountered.
After many iterations, these are the rules obtained:
%start passwords
%%
passwords : NEWLINE
| user_block
| seller_block
| header_line
;
header_line : WORD NEWLINE
| SPACE NEWLINE
| WORD header_line
;
seller_block : SELLER_PREFIX SPACE entry
| host_line
| seller_block NEWLINE
;
user_block : soft_line host_line user_line password_line
| host_line user_line password_line
| soft_line user_line password_line
;
soft_line : SOFT_PREFIX NEWLINE
| SOFT_PREFIX SPACE NEWLINE
| SOFT_PREFIX SPACE entry NEWLINE
| soft_line profile_line NEWLINE
| SOFT_NO_PREFIX NEWLINE
;
profile_line : 'profile:' SPACE WORD
;
host_line : HOST_PREFIX NEWLINE
| HOST_PREFIX SPACE NEWLINE
| HOST_PREFIX SPACE entry NEWLINE
;
user_line : USER_PREFIX NEWLINE
| USER_PREFIX SPACE NEWLINE
| USER_PREFIX SPACE entry NEWLINE
;
password_line : PASSWORD_PREFIX NEWLINE
| PASSWORD_PREFIX SPACE NEWLINE
| PASSWORD_PREFIX SPACE entry NEWLINE
| multiline_entry NEWLINE
;
multiline_entry : WORD
| multiline_entry NEWLINE WORD
;
entry : WORD
| entry SPACE WORD
;
However, keep in mind that new unseen or wrong formats could still be found and that this grammar is basically never finished. As they say, there is always room for improvement.
The grammar is available in our GitHub repository.
About the host
In addition to the credentials, the system directory contains a text file describing the infected machine (country, IP address, user, and so forth). Most common names include:
system_info.txt
;System Info.txt
;information.txt
;Information.txt
;UserInformation.txt
.Here is an example:
Network Info:
- IP: ••••••••••••
- Country: FR
System Summary:
- UID: ••••••••••••
- HWID: ••••••••••••
- OS: Windows 10 Pro
- Architecture: x64
- User Name: ••••••
- Computer Name: ••••••••••••
- Local Time: 2023/6/21 20:58:2
- UTC: 0
- Language: fr-FR
- Keyboards: French (France)
- Laptop: FALSE
- Running Path: ••••••••••••
- CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
- Cores: 4
- Threads: 8
- RAM: 16348 MB
- Display Resolution: 1920x1080
- GPU:
-NVIDIA GeForce GTX 1650
-NVIDIA GeForce GTX 1650
-NVIDIA GeForce GTX 1650
-NVIDIA GeForce GTX 1650
Good to Know: The IP address may sometimes be found in a separate file named
ip.txt
.
Below is the token list produced:
%token WORD
%token NEWLINE
%token SPACE
%token DASH
%token UID_PREFIX
/* 'UID:' | 'MachineID:' */
%token COMPUTER_NAME_PREFIX
/* 'Computer:' | 'ComputerName:' | 'Computer Name:' | 'PC Name:' | 'Hostname:' | 'MachineName:' */
%token HWID_PREFIX
/* 'HWID:' */
%token USERNAME_PREFIX
/* 'User Name:' | 'UserName:' | 'User:' */
%token IP_PREFIX
/* 'IP:' | 'Ip:' | 'IPAddress:' | 'IP Address:' | 'LANIP:' */
%token COUNTRY_PREFIX
/* 'Country:' | 'Country Code:' */
%token LOG_DATE_PREFIX
/* 'Log date:' | 'Last seen:' | 'Install Date:' */
%token OTHER_PREFIX
/* 'User Agents:' | Installed Apps:' | 'Current User:' | 'Process List:' */
The grammar rules are constructed in the same manner as for credentials.
Find the full grammar in our GitHub repository.
In order to implement all of this, two more things are required:
If we take a quick look at the grammar, a line containing hostname or URL information is parsed thanks to the host_line
rule:
host_line : HOST_PREFIX NEWLINE
| HOST_PREFIX SPACE NEWLINE
| HOST_PREFIX SPACE entry NEWLINE
;
# Just adding the entry rule to bring better understanding of host_line
entry : WORD
| entry SPACE WORD
;
Which translated to Python becomes:
def parse_host_line(parser: LogsParser, credential: Credential) -> bool:
"""Parse host data (website visited when user's data was stolen).
host_line : HOST_PREFIX NEWLINE
| HOST_PREFIX SPACE NEWLINE
| HOST_PREFIX SPACE entry NEWLINE
"""
if parser.eat("HOST_PREFIX"):
if parser.eat("SPACE"):
credential.host = parse_entry(parser)
extract_credential_domain_name(credential)
parser.eat("NEWLINE")
return True
return False
The result is stored in a Credential
object.
Every other rule follows this logic.
Since the data is structured, the chosen solution was PostgreSQL. A Python API (FastAPI) performs data validation and enables users to search leaks, exposing conveniently documented routes.
A CSV file is sent daily to analysts so they get an overview of the client sensitive data that leaked in the last 24 hours.
Looking more closely at the log files shows that they do not differ significantly. Writing a grammar seemed then the most clean and extensible choice. Whenever a new prefix is encountered, all we have to do is add it to the token list.
If you want to test the tool, a part of its code is available here.