Hidden between the tags: Insights into spammers’ evasion techniques in HTML Smuggling

Wednesday, July 10, 2024 08:00

Cisco Talos has spotted several malicious email campaigns over the past few months that disguise JavaScript code within HTML email attachments, a technique commonly known as “HTML Smuggling.”
Cisco Talos has noticed that some industry verticals were targeted more than others by email threats using the HTML smuggling technique during the observed time window. For example, companies in the human resources, insurance and healthcare domains were targeted the most, while legal, supply chain and e-commerce companies were among those targeted the least.
A wide range of evasion techniques has been identified from the senders of these emails, finding ways to get around email gateways and even more advanced detections. These techniques range from various encoding mechanisms to encryption and obfuscation.
These adversaries use simple methods to increase their chances of success, like playing around with email attachments, as well as more advanced techniques by combining different evasion methods or employing a single evasion method multiple times.
Talos is releasing a new list of CyberChef recipes that enable faster and easier reversal of encoded JavaScript code contained in the observed HTML attachments. This may assist in creating automation to process and identify such emails for more effective long-term security measures.

Introduction to HTML smuggling

HTML smuggling is a technique used by attackers to embed encoded or encrypted JavaScript code within HTML attachments or web pages. This technique has been used extensively in spear phishing email campaigns over the past few months. HTML smuggling is quite effective in bypassing perimeter security controls such as email gateways and web proxies for two main reasons: It abuses the legitimate features of HTML5 and JavaScript, and it leverages different forms of encoding and encryption.

Threat actors start by sending one or more emails with URLs or HTML attachments to their targets. When the recipient clicks on the URL or opens the attachment, the browser decodes and runs all encoded JavaScript code automatically, which will eventually download and deliver the malware to the victim’s device, or alternatively, redirect the user to the final phishing page. In some cases, the code for the malware is embedded in the HTML attachment, and the JavaScript code simply reconstructs and runs it without needing additional downloads.

Reversing the HTML attachments

Security researchers should continuously monitor changes in threat actors’ techniques and update their detection logic and/or processes to make sure customers stay protected. Reverse engineering tools could be helpful for several reasons because they help security analysts better understand attackers’ techniques, especially those that are used to help them stay under the radar. They could be used to update the logic in detection rules or feature extraction processes for more advanced detection solutions that rely on machine learning.

By sharing the insights gained from reverse engineering with the broader cybersecurity community, organizations can contribute to collective threat intelligence efforts, helping others to prepare for and defend against similar attacks.

CyberChef is a powerful open-source web application developed by Government Communications Headquarters (GCHQ) that facilitates the decoding and decryption of JavaScript code in HTML attachments. It provides a variety of modules (or functions) for decoding and decryption that can be combined to build up a “recipe.” These recipes can then be exported in different formats and loaded later to be used elsewhere. A snippet of an HTML attachment is shown below.

A snippet of an HTML attachment with base64-encoded string within a script tag.

This attachment contains an encoded string in JavaScript that can be decoded using a base64-decoding function. The URL is defanged to avoid being opened by the readers by mistake.

An encoded string and its decoded equivalent using the “From Base64” function in CyberChef.

Talos is releasing new recipes that security researchers can use to reverse encoded and/or encrypted JavaScript code in HTML attachments. Alternatively, these recipes could be integrated into automated feature extraction processes to improve the detection of emails containing HTML attachments. We will also share recipes that are not referenced in this blog post but have been used frequently to reverse HTML attachments. The combinations captured by these recipes show which evasion techniques threat actors use most often.

A dive into evasion techniques

Talos has been closely monitoring email campaigns that leverage HTML smuggling over the past several months. Various evasion techniques have been identified, which threat actors use to bypass email gateways, ranging from different encoding mechanisms to encryption. In some instances, evasion techniques are chained together, but in others, a single method is employed multiple times to increase the challenge of detection. Additionally, obfuscation has been applied to the encoded JavaScript code to further complicate their detection.

Playing around with attachments

Talos has witnessed various attempts by threat actors to manipulate email attachments to take advantage of software engineering oversights and evade detection systems.

One common technique involves using alternative or similar file extensions for attachments to bypass message filtering mechanisms. For example, XHTML, an older and stricter version of HTML that follows XML syntax, has been frequently used in HTML smuggling. SHTML, an extension of HTML that allows for the inclusion of dynamic content, has also been used in the wild.

Instead of using alternative file extensions, a frequent pattern has been identified where dots are added to the end of HTML file extensions (e.g., “html.”, “htm.”, “htm…”). This attempt aims to bypass email parsers that rely on the Content-Type header to determine the type of an email attachment. By adding at least one dot to the end of the file extension, the Content-Type of an email attachment changes from text/html to application/octet-stream (see the examples below).

The Content-Type of the HTML attachment of an example email.

The Content-Type of the ”htm.” attachment of an example email.

In some cases, the attachment’s file extension is repeated multiple times, or the attachment lacks any file extension (see the examples below). We have also observed attempts to combine different file extensions (e.g., “.pdf.html” or “xls.html”), which may confuse the file type identification logic of the detection code. This can affect how files are passed to downstream modules for further assessment.

An example email with ”.html .html” file extension.

A snippet of the HTML attachment of the above email with JavaScript code.

An example email with an HTML attachment that lacks the file extension.

Other popular techniques frequently observed include enclosing HTML attachments in ZIP archives and attaching multiple similar HTML attachments to a single email. The latter method has been identified as an attempt to increase the chances of success. The following email provides an example in which multiple SHTML attachments with identical content have been attached to a spear phishing email. The goal of the threat actors is to offer multiple employment benefits and trick the victim into engaging with any one of them. The more attractive these benefits are to an employee, the higher the chance of success for the attackers.

A spearphishing example email with multiple SHTML attachments.

A snippet of the SHTML attachment of the above email with JavaScript code, and the decoded phishing URL.

Obfuscation

Code obfuscation is used extensively in HTML smuggling attacks to make their detection more challenging and expensive. One of the most popular techniques that is often applied to JavaScript code is identifier renaming, which changes the key identifiers of the code such as variable and function names to some meaningless strings. This technique is popular because it’s offered by most free and open-source obfuscators and doesn’t change the logic of the code. Two examples are provided below. In the first example, the phishing URL is an array of integers and is stored in an obfuscated variable, which is then decoded on the fly, followed by a “click” method that redirects the victim to the phishing page.

A spear phishing example email with obfuscated JavaScript within HTML attachment.

A snippet of the HTML attachment of the above email with JavaScript variables that are obfuscated via identifier renaming method, and the decoded phishing URL.

In the second example, the function name is also obfuscated. Here, an obfuscated string is initially decoded through a series of replacements. Subsequently, the decoded string is passed to the “eval” method, which executes it.

An example email with obfuscated JavaScript within HTML attachment.

A snippet of the HTML attachment of the above email with JavaScript variables and functions that are obfuscated via the identifier renaming method.

Using a single evasion technique multiple times

The following case provides an example in which one of our customers received an email with an "html." attachment in October 2023. In this instance, threat actors leveraged the base64 encoding method twice, a technique also known as double encoding, to evade detection systems that likely rely on single-stage decoding procedures before analyzing scripts.

A spear phishing example email with an HTML attachment that contains double encoded JavaScript string variables.

The content of the HTML attachment is shown in the figure below. This attachment contains four hidden input fields. Hidden input fields are not visible on the webpage, but they can still hold values that are sent to the server when the form is submitted. Initially, the JavaScript code retrieves the values of these hidden input fields. It then uses the "substr" method to extract substrings from the values and concatenates them. Finally, two URLs are generated from these hidden fields on the fly (the values of the hidden fields and next variables can be decoded via this recipe: Base64_Decode_DecimalUnicode2String_Base64_Decode recipe).

A snippet of the HTML attachment of the above email with double encoded JavaScript string variables.

The final phishing page (https[:]//dompr[.]arrogree[.]park/login.php), constructed on the fly and stored in the 'trc' variable, is automatically shown to the victim once the HTML page is fully loaded, using the 'onload' method. As soon as the user enters the credentials, the 'SFiegrt' method will send them to another remote server (https[:]//cpsvr[.]hiominsa[.]com/POST/genofcatch.php) using an asynchronous HTTP POST request.

A snippet of the HTML attachment of the above email with double-encoded JavaScript string variables and the decoded URLs.

Chaining different evasion techniques

Talos has observed continuous efforts to combine different encoding and/or encryption techniques in HTML attachments by email campaigns to evade detection. In addition, threat actors typically obfuscate the embedded JavaScript code to increase their chances of success.

In the example below, you can find an email with an "htm" attachment that was sent to one of our customers in December 2023.

A spear phishing example email with an HTML attachment that combines identifier renaming obfuscation, base64 encoding and Caesar encryption to bypass detection.

The embedded JavaScript code is shown in the figure below. This email was clearly tailored for a specific recipient (see the masked email address).

A snippet of the HTML attachment of the above email with obfuscated JavaScript variables and a base64-encoded string variable.

The script block contains an obfuscated variable named '_0x5da6a8' that holds a base64-encoded string. When decoded (via the Base64_Decode_2 recipe), it yields the main JavaScript code, which includes the final phishing URL and the de-obfuscation function. The de-obfuscation function takes the "link" string variable as input, iterates over its characters, converts each character to its Unicode equivalent, subtracts five from the decimal value of each character, converts the resulting decimal value back to a Unicode character, and converts the Unicode value back to a string. Under the hood, this function effectively replicates the functionality of a Caesar cipher decryption method (see the Caesar_Decrypt recipe).

The decoded JavaScript code of the above base64-encoded string variable, the Caesar decryption function, and the phishing URL.

So, threat actors have used encoding, Caesar encryption and obfuscation altogether in this case to evade detection.

The example email below shows how threat actors have combined encoding and AES encryption in an HTML attachment to evade detection.

A spear phishing example email with an HTML attachment that combines base64 encoding and AES encryption to bypass detection.

A snippet of the HTML attachment of the above email with a base64-encoded input field that is hidden.

As can be seen from the HTML attachment snippet, there is a base64-encoded string in this file. This string is first decoded using the 'atob' method (or the Base64_Decode_1 recipe). Once decoded, it yields a JSON string with four keys (a-d), as shown below. The value of the 'a' key is the encrypted string, which is decrypted on the fly. The values of the 'b' and 'd' keys are the passphrase and salt, respectively, for the PBKDF2 key derivation function used to create the decryption key. The value of the 'c' key is the Initialization Vector (IV) for the AES decryption function.

The decoded JSON string is obtained from the encoded input field in the above HTML attachment.

With the above values, the decryption key is derived on the fly (using the Derive_Key_PBKDF2 recipe). Then, with this key and the IV parameter, the value of 'a' is decrypted (using the Base64_Decode_AES_CBC_Decrypt recipe) to retrieve the URL, as shown below. Note that the inner URL is in HEX format (which can be decoded using the Hex_Decode recipe).

The final phishing URL is obtained from AES decryption.

Once this URL is created, a fetch request with a POST method is sent to it, with a JSON body containing "{ "lettuce": "friendliness" }". Then, the response from the fetch request is read as text. The text is decrypted again by calling the pea function, and finally, the second decrypted result is written to the document using "document.write", replacing all current content. The walkthrough above gives an idea of how threat actors combine different evasion techniques and how challenging it is for defenders to detect such threats.

HTML smuggling in the wild

The number of unique emails that used HTML smuggling between Oct. 1, 2023, and May 31, 2024.

The above chart indicates that we observed a peak in the number of emails leveraging this technique on Feb. 2, 2024, particularly those employing advanced encoding and obfuscation in their HTML attachments. The subsequent pie chart shows that our American customers were targeted significantly more than our European customers by email threats utilizing this technique.

The percentage of emails that used HTML smuggling against customers in different geographical regions.

We noticed that some industry verticals were targeted more than others by email threats using the HTML smuggling technique during the observed time window. For example, companies in the human resources, insurance and healthcare domains were targeted the most, while legal, supply chain and e-commerce companies were among those targeted the least.

The number of convictions for email threats that used HTML smuggling across different industry verticals.

Our observation shows that the ".shtml" file extension, followed by ".htm" and ".html", were the most used in email threats that leveraged the HTML smuggling technique. The fourth and fifth most widely used file extensions were ".htm...." and ".xhtml". The other file extensions we found interesting included ".pdf.shtml", ".pdf.htm", and ".xlsx.html". We have identified these as potential techniques to exploit software engineering oversights and bypass detection mechanisms.

Top HTML attachment file extensions observed in email campaigns that used HTML smuggling.

How to protect against email threats that use HTML Smuggling?

As outlined earlier, HTML smuggling poses significant challenges to traditional security solutions and rule-based detection engines. The extensive deployment of encoding, encryption, and obfuscation renders the detection of emails utilizing this technique challenging, necessitating enhancements in various aspects of defense. Several of these areas are detailed below.

Multi-layered defense

Swiftly blocking an attack minimizes its potential impact on our customers' business operations. Thus, it is advantageous to neutralize threats at initial stages, such as at the email gateway and web filtering level. However, if a message passes through these perimeter security controls and lands in a user’s inbox, retrospective detections and endpoint protection controls should either detect these messages at later stages and pull them out of the user’s inbox or prevent the execution of such malicious JavaScript on the victim’s device.

Improved security engineering

Despite typically having insufficient information about the architecture and inner workings of commercial security solutions, threat actors occasionally exploit software engineering oversights to bypass detection mechanisms. A notable instance of this is their manipulation of email attachments to evade scrutiny. Therefore, monitoring these evasion techniques and revising the code accordingly could improve the efficacy of existing detection engines significantly. Talos monitors changes in adversaries’ techniques precisely and makes sure our customers stay protected.

Advanced detection methods

The HTML smuggling technique enables emails to effectively evade traditional perimeter security measures, including email gateways and presents significant detection challenges for rule-based systems. Consequently, the deployment of more sophisticated message processing and security solutions is imperative. Additionally, enhancing endpoint security measures to prevent the execution of such scripts on user devices is essential.

Talos continuously monitors the changes in attackers' techniques and updates our detection capabilities to ensure our customers stay protected. Learn more about Cisco Security Email Threat Defense here.

Indicators of compromise (IOCs)

Indicators of compromise (IOCs) associated with this blog post can be found here.