[Cyware] How Polyglot Files Enable Cyber Attack Chains and Methods for Detection & Disarmament

Summary: This article discusses the concept of polyglot files, which are files that can be interpreted as multiple file types simultaneously, and the potential security implications they pose.

Threat Actor: N/A
Victim: N/A

Key Point :

  • Polyglot files are designed to exploit the way different file formats are interpreted by different software, allowing them to bypass security measures and potentially execute malicious code.
  • These files can be used in various cyber attacks, including phishing campaigns, malware distribution, and data exfiltration.


Luke Koch


kochlr@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Sean Oesch


oeschts@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Amul Chaulagain


chaulagaina@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Jared Dixon


dixonjm@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Matthew Dixon


dixsonmk@ornl.gov


Oak Ridge National Laboratory


Oak Ridge TN, USA

  
Mike Huettal


huettelmr@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Amir Sadovnik


sadovnika@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Cory Watson


watsoncl1@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Brian Weber


weberb@ornl.gov


Oak Ridge National Laboratory


Oak Ridge, TN, USA

  
Jacob Hartman


hartmanj@ainfosec.com


Assured Information Security


Rome, New York, USA

  
Richard Patulski


patulskir@ainfosec.com


Assured Information Security


Rome, New York, USA

Abstract

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files
to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding 30303030 polyglot samples and 15151515 attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild—the first of its kind—we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of 0.9990.9990.9990.999 with an F1 score of 99.2099.2099.2099.20% for polyglot detection and 99.4799.4799.4799.47% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized 100100100100% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

Keywords File-format Identification, Malware Detection, Polyglot Files, Machine Learning, APT Survey, Content Disarmament and Reconstruction

1 Introduction
Refer to caption
Figure 1: Functionality of a polyglot file is determined by the calling program, which can be explicitly provided or automatically determined by the operating system’s auto-launch settings.

A polyglot file simultaneously conforms to two or more file-format specifications. This means the polyglot file can exhibit two completely different sets of behavior depending on the calling program, as depicted in Figure 1. This dual nature poses a threat to endpoint detection and response tools (EDR) and file-upload systems that rely on format identification prior to analysis. As shown in Figure 2, a polyglot can evade correct classification by first evading format identification. If only one format is detected, then the sample may not be routed to the correct feature-extraction routine (in the case of machine learning-based detectors) or compared to the correct subset of malware signatures (in the case of signature-based malware detection).
As evidence that existing commercial off-the-shelf (COTS) endpoint detection and response tools are vulnerable to polyglots, we point to Bridges et al. [Bridges et al.(2020)], who demonstrated that 4 competitive COTS tools detected 0% of the malicious polyglots in the test data.

Standardized formats for files play a key role in cybersecurity.
By first identifying the format of an unknown sample, they allow malware detection tools to extract the most discriminate and robust features from an unknown sample. This allows the detection tool to discard unimportant bytes that can be manipulated to alter classification in an adversarial attack [Kolosnjaji et al.(2018), Demetrio et al.(2021)].
However, this feature-extraction process introduces a vulnerability; the correct format must be detected in order to route the file to the correct feature extractor.
Even when a detector does not use machine learning and instead relies upon signatures for detection, the need to maintain a high throughput encourages EDR tools to only search for signatures that correspond to the detected format [Jana and Shmatikov(2012)].

As prior researchers [Jana and Shmatikov(2012), Albertini(2015), Clunie(2019), Ortiz(2019), Desjardins et al.(2020), Popescu(2012b)] have demonstrated, polyglot files can be crafted that are fully valid (execute as intended) in multiple formats.
To date, however, no comprehensive study of polyglot usage by malicious actors in the wild and/or methods of detecting said polyglots has been undertaken.
In this paper, we set out to answer four key research questions related to polyglot usage and mitigation:

RQ1: How are polyglots currently used by threat actors in the wild? This includes the role the polyglot fills, the formats of the donor files, and the combination method used to fuse the donors together.

RQ2: Can we train a detector to effectively filter or reroute polyglots prior to ingestion by a malware detection system?

RQ3: Does this detector outperform existing file type detection, file carving, and polyglot-aware analysis tools at detecting polyglot files?

RQ4: Given the prevalence of image-based polyglots in adversary usage and the relative simplicity of image formats, what tools can we provide to defenders to address image-based polyglots in their existing workflows?

To address RQ1, we reviewed open-source intelligence feeds (see Section 3.1 for methods) that detail adversary tactics, techniques and procedures (TTP), finding that polyglots have played an important role in a number of malicious campaigns by well-known advanced persistent threat (APT) groups.
Polyglot files allowed the malicious actors to covertly execute malicious activity and extract sensitive data by masquerading as innocuous formats.
In Section 3, we provide an overview of the different roles polyglots played in each campaign, detail the file combinations used, and provide a detailed description of several high profile examples.
To address RQ2-RQ4, we first created a tool, Fazah, for generating polyglots that mimic the examples seen in the wild. Although there are other possible format combinations, our goal with this tool was to mimic, as closely as possible, the formats and combination methods used by real-world threat actors.
Using this tool, we then created a data set of polyglot and normal (referred to hereafter as monoglot) files for training and testing. See Section 4 for a full description of the data set.

To address RQ2, we tested machine learning models to solve both the binary and the multi-label classification problems, achieving an F1 score of 99.20% for binary classification and 99.47% for multi-label classification with our deep learning model PolyConv.
To address RQ3, we evaluated five commonly used format identification tools on this dataset: file [Darwin et al.(2019)], binwalk [ReFirmLabs(2021)], TrID, polydet, and polyfile. These tools were selected because of their use in existing cybersecurity tools or claim to detect polyglot files.
We evaluated the performance of these tools at both binary and multi-label classification.
In our context, binary classification determines whether a file is a polyglot or a monoglot. Multi-label classification, on the other hand, identifies all formats to which the file conforms.
We found that existing tools did not exceed an F1 score of 93.32% at binary classification and 83.74% at multi-label classification.

See Section 5 for details regarding our ML based approaches and Section 6 for a comparison of ML-based approaches to existing file-format identification tools.

As detailed in Section 7, to address RQ4 we developed and tested a CDR tool for sanitizing image-based polyglots since these were the most common vector for polyglot malware. We also tested YARA rules for detecting extraneous content in image files.
We found that the YARA rule approach did not generalize well to all formats that can be combined with an image, especially the more flexible scripting formats like Powershell or JavaScript. However, they may be use in high-throughput use cases where deploying a deep learning model is not feasible.
A more effective approach is to strip all extraneous content from images using a content disarmament and reconstruction (CDR) tool. Our CDR tool, ImSan, was able to sanitize all of the image polyglots in a random subset of our image polyglots. A subset was used so we could manually verify the results.

The following provides a summary of our contributions:

  • RQ1: The first, to our knowledge, survey of polyglot usage by malicious actors in the wild, demonstrating that polyglot files are an actively used TTP by well-known malicious actors.
    Utilizing the results of this study, we created a tool, Fazah, to generate polyglots using formats and combination methods exploited by malware authors in the wild.
    We then used Fazah to generate a dataset of polyglots and monoglots to evaluate existing detection methods and train polyglot detection models.

  • RQ2: Utilizing this novel dataset, we trained a deep learning model, PolyConv, that can distinguish between polyglots and monoglots with an AUC score over 0.999.
    We also created a multi-label model that reports all of the detected formats in monoglot and polyglot files, enabling analysts to quickly determine the nature of a threat or route the suspicious file to multiple format-specific detection systems.

  • RQ3: We provide a comparison of our polyglot detection models with existing file-format identification and carving tools, some of which are polyglot aware.
    This evaluation shows that existing methods for detecting file type manipulation are inadequate and often fail to detect polyglot files, even with special flags set that are meant to ensure multiple file types are detected.

  • RQ4: For image-based polyglots, which are common in the wild, we explored YARA rules and content disarmament and reconstruct (CDR) tools, finding that our ImSan CDR tool was 100% effective while the YARA rules did not compete with our deep learning detector. They may, however, be of use in high throughput situations.

Refer to caption
Figure 2: Since polyglot files simultaneously conform to multiple formats, they can evade correct format identification. This in turn allows them to evade format-specific feature extraction or signature matching, thereby evading malware detection. Therefore, some preprocessing should be done to either filter/quarantine polyglot files prior to feature extraction or route them to multiple format-specific malware detectors so all functional components of the polyglot are analyzed.

3 RQ1: Polyglot Exploitation in the Wild

Thanks to Bridges et al. [Bridges et al.(2020)], we know polyglots can evade detection by COTS tools. However, the extent to which malicious actors employ polyglots has never, to our knowledge, been published before. Do malicious actors use polyglots in their attack chains? What role do polyglots play within an attack chain? What file formats and combination methods were utilized in these attacks?
To address these questions we conducted a survey of threat intelligence feeds, collecting file hashes of polyglot samples and information on the roles played by these files within attack chains.
For the file hashes and a list of the sources used in this survey, see Table 3 and Table 4 in Section 9.

3.1 Survey Methods

The survey, performed between November 2022 and January 2023, focused on identifying the role of a polyglot file within a threat actor’s cyber-attack chain.
We used publicly available independent sources, general search engines and threat intelligence feeds (e.g., ORKL, X) to gather a wide range of information security reports and articles. Those sources were searched using the following terms: polyglot, combined, and contained. We found that the term polyglot is not always utilized in reports. We therefore had to manually distinguish between reports of true polyglots (two or more valid formats in one file) and other forms of digital steganography. A number of reports described malware that contained a valid format along with an oft-encrypted set of malicious instructions. We do not consider these files as polyglots because the malicious instruction can only be correctly interpreted when passed as input to another component of the malware rather than a parser conforming to a published standard.

For each true polyglot found, we used our knowledge of threat operations to determine the role the polyglot played in the cyber-attack chain. Lastly, the online malware databases, VirusTotal and MalwareBazaar, were used to obtain the actual polyglot samples whenever hashes of the polyglot were provided in a report. The file hashes and sources from our survey of open-source intelligence can be found in the appendix in Tables 3 and  4, respectively.

3.2 Role of Polyglot Files in Cyber Attack Chains

The survey discovered fifteen examples of a threat actor using a polyglot file in their cyber-attack chain, along with 30 distinct polyglot files.
According to MITRE’s Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework, polyglots are primarily utilized for Defense Evasion (MITRE ATT&CK TA0005). Polyglot files also fall under the Obfuscated Files or Information (MITRE ATT&CK T1027) heading since these files conceal hidden functionality by appearing to conform to only one file format. We obtained 30 polyglot samples from VirusTotal and MalwareBazaar using the file hashes specified in the reports.

For the purpose of establishing a formal taxonomy for polyglot files, we refer to polyglots as having an overt format and a covert format. The overt format is the format the file presents as (e.g., matches the extension) while the covert format is not apparent without analysis. In most cases, a polyglot consists of a malicious file combined with a benign one; however, in some cases we found that both file formats play a role in advancing the malicious attack chain, as in the HTA+CHM polyglot utilized by IcedID in Section 3.2.1. Therefore, we instead refer to polyglots as combining an overt format with a covert format. A summary of the found-in-the-wild samples is provided in Table 1. In Appendix 9.2 we discuss the capabilities of interest that each file format provides to the malware author (camouflage, non-standard execution path, etc.) to understand why these combinations exist in the wild and how they fill a desired role in attack chains.

We selected two cyber attack chains to demonstrate how well-known APTs utilize polyglots to reach the next step in their cyber attack chains. A third attack chain can be found in Appendix 9.1. CVE numbers and MITRE ATT&CK references are provided where applicable.

3.2.1 IcedID

IcedID is a banking trojan that, according to Check Point’s Global Threat Index, was the fourth most widespread malware variant in 2022 [Team(2022a)]. The trojan uses an evolving variety of methods to establish initial access. One of these methods relies on a polyglot formed by combining a CHM and an HTA file.

The attack chain is illustrated in Figure 3. It begins with a password-protected Zip file attached to a phishing email. The Zip contains an ISO file which exploits CVE-2022-41091 to evade flagging by Microsoft’s alternate data stream (ADS) defensive mechanism [Hegt(2020)].

The ISO file in turn contains two files: a DLL (hidden by default on Windows) and a CHM+HTA polyglot. The polyglot masquerades as a CHM file which presents a benign decoy window when executed. The Microsoft compiled HTML (CHM) format used for software documentation. Each file consists of a number of HTML pages organized into a document that is compressed into a binary stream. As with any HTML page, CHM files may download/execute other files or run Powershell/Javascript commands when viewed.

In the background, this CHM file starts a MSHTA.exe process with itself as the input. This new process executes the malicious component of the polyglot, the HTA file, which in turn launches the hidden DLL file that contains the actual IcedID payload.

Refer to caption
Figure 3: IcedID Attack Chain

3.2.2 Andariel/Lazarus

Lazarus (of which Andariel is a subgroup) is an advanced threat group that has operated out of North Korean since 2009 [Htet(2017)]. In 2021 attack chains connected to this group utilized polyglots to infect systems with a Remote Access Trojan (RAT) [Jazi(2021), Park(2021)]; this process is illustrated in Figure 4.

This attack chain typically begins with a phishing email that has an attached malicious Microsoft Word Document (DOC) file (MITRE ATT&CK T1566). When the DOC file is launched, a macro begins execution (MITRE ATT&CK T1204.002). First, the macro drops a PNG file to the Temp directory. The image data in the PNG file is a compressed polyglot file.

Next, the DOC macro converts the PNG file to a BMP file, which has the intended side effect of decompressing the contents (MITRE ATT&CK T1140). The DOC Macro does this by leveraging the Windows Image Acquisition (WIA) Automation Layer Objects: ImageFile and ImageProcess [Microsoft(2018a), Microsoft(2018b)].

After conversion, the DOC Macro saves the BMP as a zip file by giving it a zip extension. However, the file is actually a BMP+HTA polyglot, with the HTA covert contents appended to the end of the overt BMP data. Finally, the DOC Macro executes the polyglot file as an HTA file using the MSHTA application via the Windows Management Instrumentation (WMI) Service (MITRE ATT&CK T1059, T1047).

WMI is used so that the resulting process does not appear to be a child of the DOC process. The HTA file drops its payload, a hidden PE file, into a hidden folder. Finally, the HTA file launches the PE file which provides a foothold on the target system for future exploitation.

Refer to caption
Figure 4: Andariel/Lazarus Attack Chain

4 Wild Polyglots: A Polyglot Data Set Based on Malicious Usage in the Wild

This section describes how we created our data set based on our survey of polyglot usage in the wild (RQ1) using the Fazah tool in order to address RQ2-RQ4.

4.1 Fazah: A Polyglot Generation Framework

Having uncovered which formats have been used in real-world malicious polyglots, we created a data set consisting of monoglot and polyglot files conforming to these formats. Our first step was to create a framework for generating polyglots by combining donor files. Our goal for this tool was to mimic format and combination methods found in the wild rather than demonstrate all possible combinations.

Table 1: Polyglot Formats Deployed Maliciously in the Wild
Covert Format Overt Format
HTA


JPEG, PNG, BMP, GIF, LNK,


PE, MSI, RAR, Zip, TTF,


RAR, CHM, PDF

PHP


JPEG, PNG, BMP, GIF, TTF,


RAR, Zip, LNK, PDF

PHAR JPEG, PNG, BMP, GIF
JavaScript GIF, BMP
PowerShell JPEG, BMP, GIF
Zip JPEG, PNG, GIF, PDF
JAR


JPEG, PNG, GIF, PDF,


MSI

RAR JPEG, PNG, BMP, GIF
BMP Zip, JAR

The Fazah framework is a modular tool written in Python that can currently generate 46464646 format combinations using 8888 covert formats. The combination method—stack and a variety of parasites—is derived from reports of malicious use in our survey and varies between covert format. As discussed in the survey, malicious actors use polyglots either to disguise malicious content using a less suspicious format (images) or add hidden functionality (scripts). Since image formats typically use comment markers, parasites are commonly used by malicious actors. Stacks, meanwhile, are the simplest and easiest method for malicious actors to implement, working well with script and archive formats. Files with distinct comment markers (necessary for zippers) are quite rare. Of the common (but by no means exhaustive) set of formats we tested, only DCM combined with either PDF/GIF/ISO could result in a zipper. Similarly, we found that only ISO paired with PE/PNG/GIF yielded cavities. This does not preclude their use in malicious campaigns, but places them beyond scope for our goal of emulating known attack chains. Table 1 provides the format pairings that Fazah can turn into polyglots.
Given the possibility for malicious abuse of the framework, Fazah will not be published publicly at this time.

4.1.1 Wild Polyglots Data Set Creation and Contents

We collected benign files conforming to 13 common formats using Github’s search API: BMP, EXE, GIF, HTA, JAR, JPG, JS, MSI, PHP, PNG, PS1, RAR, ZIP. Using a held-out set of donor files, we created 32 types of polyglots organized according to which 2 types of donor files were combined to create the polyglot file. We kept all donor files separate from the train and test set to ensure that the models did not cheat by learning that data added to a monoglot in the training set is a polyglot. Table 2 provides an overview of the Wild Polyglots data set. Figures 5 and 6 breakdown the formats contained in the monoglot and polyglot training sets, respectively.

Refer to caption
Figure 5: File counts for the monoglot formats in the Wild Polyglots training data.

Since our objective was to train a polyglot detector rather than a malware detector, we only utilized benign files. We first scanned the files we scraped for malware and removed any suspicious samples. Next, we removed any scraped files whose extension did not match the file contents (e.g., a JPEG with a .png extension) or if the file could not be parsed by an appropriate utility (e.g., Pillow for images). We erred on the side of inclusion for highly flexible scripting language formats like HTA. Since MSHTA.exe is tolerant of a high degree of malformation, we felt it unwise to exclude malformed HTA from our training data.

Table 2: Wild Polyglots Data Set Contents
Train Test
Monoglot 25192 9975
Polyglot 1148604 213109

5 RQ2: Using Machine Learning for Polyglot Detection

This section explores using machine learning to detect polyglot files. Section 5.1 chronicles our development process as we tested different ML model architectures and experimented with improvements to the feature space. Section 5.2 presents the results from out best-performing models compared to existing tools.

Refer to caption
Figure 6: File counts for each of the 32 polyglot combinations in the Wild Polyglots training data.

5.1 Ml-based Detection Development

Our first objective was to determine which machine learning architecture and feature set were most effective at detecting polyglots. Toward this end, we created a small (∼70,000similar-toabsent70000sim 70,000∼ 70 , 000 files) initial data set using the mitra tool (described in Section 2.2) prior to the development of our Fazah tool. On this preliminary data set, we tested a Support Vector Machine, Random Forest, GradBoost, CatBoost, LightGBM, and MalConv. With the exception of MalConv [Raff et al.(2018)], these models used the byte histogram as their only feature. The byte histogram is a vector of length 256 where the value stored at each index corresponds to the number of times that byte value occurs in the input file. This feature vector is agnostic with respect to file formats since all digitally stored files are a string of bytes. We found that, on this preliminary data set, MalConv and CatBoost were the top performers.

We focused further development on MalConv and CatBoost, labeling our improved versions PolyConv and PolyCat, respectively. At this point, we trained and tested both models on our survey-informed Wild Polyglots data set; all results and figures reported in this paper refer to the Wild Polyglots data set. We found that, for PolyCat but not PolyConv, adding the mime-type output of the file utility improved results. Although file was not competitive at detecting polyglots (see Section 6.1) or at identifying both formats contained within, it was extremely accurate at identifying the first format contained in the file. Therefore, we augmented PolyCat’s feature space with a 1111-hot encoding of the mime-type output from file. We found further improvement by adding the 8000800080008000 most common bigrams and trigrams extracted from each file using an overlapping window. Thus, the final feature space for PolyCat consisted of the byte histogram, the 1-hot encoding of the mime-type from file, and the most common bigrams and trigrams.

MalConv is an oft-cited deep learning classifier designed to detect malware [Raff et al.(2018)]. We trained the model from scratch to identify polyglots rather than to identify malware. None of the polyglots in our data set were malicious in order to guarantee that the model learned to detect multiple formats rather than malicious content. Since the model is trained on raw bytes rather than format-specific features (e.g., the EMBER feature set for PE files [Anderson and Roth(2018)]), MalConv’s architecture is well-suited to the polyglot detection problem which requires a format-agnostic approach. In lieu of a fixed feature-extraction routine, the model takes in raw bytes and learns an encoding (first layer) as well as a set of filters (the convolution layers) to recognize significant byte patterns. MalConv also features an attention and gating mechanism intended to filter out extraneous information in the raw bytes.

We experimented with changes to the architecture in order to make it more effective at our novel task, yielding the PolyConv model mentioned above. The original architecture of MalConv is presented in Figure 7 while PolyConv’s architecture is presented in Figure 8.

The changes we made to MalConv consist of the following:

  • Decreasing the window and stride from 512 bytes to 16 and 8 bytes, respectively, in order to capture the byte patterns of very short (in terms of bytes) script files hidden within larger files

  • Removing the attention and gating mechanism as they did not seem to improve the results on our task

  • Increasing the number of kernels in the remaining convolution layer to 512 in order to learn enough byte patterns to distinguish the wide variety of distinct formats upon which we trained the model

  • Increasing the number of fully connected layers to 3 as a result of experimenting with different layers counts

  • Increasing the number of nodes in each fully connected layer to 512, 512, and 128 as a result of experimenting with different node configurations

Refer to caption
Figure 7: MalConv Architecture
Refer to caption
Figure 8: PolyConv Architecture

5.2 Comparing ML-based Polyglot Detection Approaches

We trained and tested PolyConv, MalConv, PolyCat, and CatBoost on our Wild Polyglots data set. For this comparison, we evaluated binary label (polyglot or monoglot) versions of the models. Since our data set is imbalanced, we used the precision-recall curve rather than the ROC curve to score our models. Therefore, our top model is the one with the highest PR-AUC on the Wild Polyglots test set. PolyConv scored a PR-AUC of 0.99998, the highest score for all the models we evaluated. MalConv—when trained on this novel task—scored a slightly lower PR-AUC of 0.99989, outperforming both PolyCat and CatBoost.
The model results are summarized in Figure 9.

Refer to caption
Figure 9: Precision-Recall AUC Scores: Our deep learning model, PolyConv, slightly outperformed the stock version of MalConv upon which it is based as well as CatBoost and Polycat.

6 RQ3: Comparison to Existing Signature-based File-format Identification Tools

This section compares our best-performing polyglot detection model, PolyConv, to existing tools for format identification to determine which approach is best suited to identifying polyglot files and labeling their contents correctly.

Within the context of cybersecurity, there are two complimentary questions of paramount importance: detection and analysis. We trained two versions of our best-performing model, PolyConv, that differ only in the final layer to suit detection and analysis needs.

The first version is a binary classifier (polyglot or monoglot) for use in filtering out polyglots on an endpoint. This is intended for file upload services that only want to allow uploads of known formats, e.g., images.

The second version is a multi-label classifier to identify all of the formats detected within a file. This provides two benefits. First, the labels can be used to route files to all applicable file-format feature extraction or signature-matching routines rather than a single format-specific model or signature subset. This means that the remainder of an existing EDR tool’s extraction and detection routines do not need to be altered. Second, the labels provide an analyst with introspection, revealing not only that the file is a polyglot but also which format-specific tools/routines they should use to examine the covert format(s) hidden in the polyglot. This is intended to reduce the response time necessary for secure operation center (SOC) analysts that must handle a high volume of alerts.

6.1 Tools Tested

We established a baseline for performance by testing existing file-format identification tools on the Wild Polyglots data set: file [Darwin et al.(2019)], binwalk [ReFirmLabs(2021)], TrID [Pontello(2020)], and polydet [pol(2018)]. We also evaluated polyfile [of Bits(2022)], a DARPA-funded tool developed by Trail of Bits for detecting unusual files.
Of the aforementioned tools, file and TrID are well-established signature-based utilities for file-format identification. VirusTotal, a widely used anti-virus aggregator (www.virustotal.com), utilizes TrID when reporting detected formats. Binwalk is a file-carving tool that has been used by analysts to find and extract hidden files. We selected these tool to establish a baseline because of their wide-spread use (file), cybersecurity application (binwalk,TrID), and polyglot-awareness (polyfile, polydet). We leave as future work a comparison to FiFTy [Mittal et al.(2020)] and Sceadan [Beebe et al.(2013)], as these detectors do not appear to be polyglot-aware, but might be re-trained in order to properly label polyglot files. Since file outputs labels and not probabilities, the precision-recall curve is not an appropriate metric when comparing our deep learning model to existing tools. Instead, we used the F1 score that balances recall and precision. For any cybersecurity system deployable in the real-world, the ability to detect malware/polyglots (recall) must be tempered by a low probability of false positives (precision) to prevent red-flag fatigue. Therefore, we use F1 to provide a balanced evaluation.

6.1.1 Binary Comparison

Figure 10 considers the performance of each tool in a binary context, determining if the tool detects the presence of two or more formats in one file. TrID aggressively speculates as to which formats are present in a file, assigning a percent score to each possibility. We therefore omitted the performance of TrID as a multi-label detector as this behavior put it at a disadvantage compared to the other tools.
As Figure 10 demonstrates, none of the existing tools approached the F1 score, precision, or recall of our PolyConv deep learning model. All of the tools had a relatively high precision and low recall, indicating that false negatives were the primary cause of the low F1 scores.

The recall for file was lower than expected as the tool reported multiple formats when examining BMP, EXE, HTA, and PHP monoglots. The EXE false positives may have been caused by the presence of other files embedded as resources. Although it was outperformed by our PolyConv model, polyfile was the best binary performer among the existing set of tools by F1 score.

Refer to caption
Figure 10: Binary Performance vs Existing Tools: PolyConv exceeded the F1 score, precision, and recall of all existing tools by a large margin.

6.1.2 Multi-label Comparison

Figure 11 considers the performance of each tool in a multi-label context where a true positive means the tool correctly identified both the count and the exact formats present in each file. None of the tools performed well compared to the multi-label version of PolyConv.

Of the existing tools, polydet outperformed the other tools in all three metrics by a noticeable margin. With regard to the remaining tools, file’s precision is unusually low given its widespread use and long development history. Upon examination, we found that file did not differentiate between PowerShell and JavaScript files; instead, it applied the generic label of ASCII or Unicode text. This behavior almost exclusively accounted for the lower precision.

The lack of required signatures for script files makes signature-based detection difficult for these script formats. Upon further inspection we found that polyfile and polydet share file’s dependence on Libmagic, which labels PowerShell and JavaScript as either ASCII or Unicode text. While it might seem unfair to expect Libmagic to differentiate between different forms of ASCII or Unicode text, we consider it important for analysts to be aware of this opaque label. A harmless log file of unstructured ASCII text presents a very different level of danger compared to a functional JavaScript file.

Refer to caption
Figure 11: Multi-label Performance vs Existing Tools: PolyConv also proved more adept at correctly identifying all of the formats contained within a file. Of the existing tools, polydet provided the most reliable file-format identification.

7 RQ4: Methods for Addressing Image-based Polyglots

Given the prevalence of image-based polyglots in adversary usage and the relative simplicity of image formats, we developed tools for detecting and remediating polyglots that employ an overt image format.

We first tested YARA rules in the hopes that the comment markers/delimiters present in image files would allow for rule-based detection of extraneous content. However, we found that their recall of 82.08% and F1 score of 90.15% were too low to be useful except in situations where high throughput is tantamount. We then turned to the content disarmament and reconstruction approach.

7.1 ImSan, a Content Disarmament and Reconstruction Tool for Image-based Polyglots

Content disarmament and reconstruction (CDR) tools present an alternative approach to the pre-processing filtering approach for which we have provided solutions. CDR tools allow an end user to strip all but the most trustworthy content from certain formats. Where highly flexible formats, like PDF, have proliferated, these tools have emerged to provide secure use of files that abuse the format flexibility.

Although we have not exhaustively examined this approach, we have developed an image sanitization tool to demonstrate the potential of CDR in disarming polyglots.
Our tool, ImSan, disarms image-based polyglots by stripping away all file contents that are not required to display the image. The process is quite straightforward:

  1. 1.

    The image file is loaded into Pillow, a fork of the Python Imaging Library

  2. 2.

    The image contents are then written to a new file with the option to strip all metadata activated

  3. 3.

    The new image file has no extraneous content before/after the image contents (stack/cavity polyglot) or inserted into comment areas (parasite/zipper polyglot)

ImSan can disarm any of the formats that are fully supported (read/write) by Pillow: BLP, BMP, DDS, DIB, EPS, GIF, ICNS, ICO, IM, JPEG, JPEG 2000, MSP, PCX, PNG, PPM, SGI,SPIDER, TGA, TIFF, WebP, XBM.
Note, ImSan should be run in an isolated environment to ensure that no vulnerability in Pillow (2 CVE’s reported in 2022) could allow a malicious image to gain execution when the image is parsed.

ImSan disarmed 100% of the image polyglots in a subset (n=392) of image polyglots drawn randomly from the benign Wild Polyglots data set.
A small subset was chosen so we could manually verify disarmament through visual inspection of the image’s code rather than relying on one of our detectors.
An evaluation of commercial CDR tools against polyglots (including those that are not image based) and the potential methods of circumventing CDR solutions, while out of scope for this work, would be a valuable direction for future work to explore.

Acknowledgements

This work was funded by a key intelligence community partner’s Laboratory for Advanced Cybersecurity Research under a Memorandum of Agreement.

Source: https://arxiv.org/html/2407.01529v1


“An interesting youtube video that may be related to the article above”