Analyzers

Posted 19 December 2020 | Modified 21 April 2021
Author Attila Suszter

HexLasso performs the byte coverage analysis through analyzers and transformers.

The analyzer is a dedicated routine to perform pattern recognition and prediction of the given block. The result is the byte coverage, that is how many bytes are covered by the pattern and how many bytes are predicted. The minimum possible value is zero, the maximum possible value is the size of the block in bytes.

The analyzer directly reports the byte coverage but it might call transformers to transform the data prior to the analysis. The transformation might be needed because the transformed data might fit better into the pattern recognition and prediction models.

The analyzer reports only on the bytes covered by the pattern recognition or prediction. For example, if there is a long sequence of ASCII bytes, the analyzer might report string regardless of what are the bytes which are not covered by the pattern recognition or prediction.

There are number of analyzers, and each analyzer runs on the given block.

If the string analyzer reports to cover, say, 60 bytes out of 64 bytes, then the result could be meaningful enough for the analyst.

However, if the string analyzer reports to cover, say, only 5 bytes out of 64 bytes, then the string might not be the one that describes the block in meaningful way. Ideally, there should be another analyzer covering more bytes. The analyzer that covers for the most bytes will describe the block.

HexLasso maintains a priority list of analyzers that each analyzer is given a distinct priority. If two analyzers report the same byte coverage then the analyzer with the higher priority will describe the block.

When considering two analyzers, the one which can more accurately detect the redundancy is given the higher priority. For example, the string analyzer is given higher priority than any of the match analyzers.

Five Random Things about Entropy

Posted 15 December 2020 | Modified 15 December 2020
Author Attila Suszter

Here is a list of five random things about entropy.

1. The order of bytes in the data does not matter when calculating the entropy of the data. The entropy will always be the same regardless of the order of bytes. That has consequences. Here are few:

High entropy data does not necessarily mean random data (albeit random data always has high entropy).

High entropy passkey in itself does not mean the passkey is a good choice for authentication.

You can bring up the entropy of the data to arbitrary value. By appending content to the data to have an equal distribution of all bytes, the newly created data will have the maximum entropy of 8.

2. Let’s assume the data of 1000 bytes in length has the entropy of 4 of the maximum 8. It means, from data compression standpoint, that 1 byte can be stored in 4 bits meaning the data can be compressed to 500 bytes. This compression ratio is guaranteed, however likely better ratio can be achieved depending of what is known of the data.

3. Entropy analysis is often used to detect redundancy or anomaly in data. It is possible that the stream of data has an entropy of around 7 throughout the length but there is a structural change in the stream that does not noticeably show up in entropy change. In such case, another approach would be needed to try detecting the structural change, such as match analysis.

4. Usually, publicly available tools calculate entropy on byte-level, in which case the maximum entropy is 8 because the byte has 8 bits. However, it is possible to calculate entropy on nibble-level and on word-level, in which cases the maximum entropies are 4 and 16.

5. The higher the entropy, the lower the redundancy. The lower the redundancy, the higher the entropy. They are inversely proportional.

Match Analyzers

Posted 13 December 2020 | Modified 13 December 2020
Author Attila Suszter

Match analyzers in HexLasso look for matching byte sequences in block of data and return the byte coverage of the found matches.

It is possible that no match found in the block, and it is also possible that all the bytes are matches.

HexLasso implements three type of matching algorithms and the difference between them is the width of the matches.

If the block contains QWord matches, it consequently contains DWord matches, and it consequently contains Word matches.

For example, by following the links you can see Word, DWord and QWord matches of a given block.

The presence of the matches in the block indicates some sort of redundancy. However, the match analyzers are given lower priorities than many other analyzers because they cannot be specific on what the redundancies are.

There are many other analyzers that can be more specific on what the redundancies are though. For example, the Byte00 analyzer returns the byte coverage of 00 bytes in the block. Therefore, the Byte00 analyzer is given a higher priority than the match analyzers. As a consequence, if the Byte00 analyzer and the match analyzers return with the same byte coverage on a given block then Byte00 analyzer will be reported.

So match analyzers are used as a fall-back mechanism when specifics on the redundancies are not well known.

Automating Hexdump Analysis

Posted 10 December 2020 | Modified 11 December 2020
Author Attila Suszter

The researchers who work with binary data have eyes trained on recognizing patterns in hexdumps.

The pattern consists of bytes with redundancy in them. Such pattern may include array of correlated values, structures with fixed length, runs of bytes, textual data, machine instruction fragments, etc.

The researchers may look for patterns in hexdump of some file or memory content. That covers lots of more specific content types, such as varying file formats, malware samples, firmware images, process memory dumps, etc.

There are many reasons to analyze a hexdump. For each situation the reasons may vary. Here are few:

Improve data compression ratio of the compression algorithm.
Improve data mutation efficiency of the fuzzer.
Have an idea of how the program parses and processes the data, without reverse engineering the program (which may not even be available to access).
Improve decision making for telling if a sample is malware, and if it is malware, which family it can be associated with.
Recover data artifacts from a corrupted or an unknown sample. Recover concealed data from sample hidden by stenography techniques.
Hypothesize on the layout of the data.
Find anomaly in data such as in network traffic.
Decide on if data is random (when no redundancies are observed).

Being able to automate the manual pattern recognition task would have two important advantages:

A tool would allow for anyone to assist in recognizing patterns in a hexdump.
Analyses could have been done in batches and at scale.

The core strategy for automated hexdump analysis is like this.

The automation splits the input stream into blocks. I approximated the number of bytes I can manually process at a time when looking at hexdump and I came up with a number, so the block has a size of 64 bytes, which is just large enough to contain patterns in it.

The pattern recognition is being done through analyzers.

There is a set of analyzers for different patterns. Each analyzer runs on a block. The result of the analyzers are being compared that the analyzer with the best result, that is with the highest score, will describe the block.

When comparing the result between two analyzers, the one that covers for more bytes has the higher score. The higher score means more certainty. If the score is the absolute maximum it means 100% certainty.

The name of the analyzer should be as relevant as possible to every single byte incorporated in the pattern recognition.

Each analyzer is given a distinct priority. If two analyzers come back with the same score for a given block, the analyzer with the higher priority will describe the block.

The reason to maintain a priority list for the analyzers is because one analyzer is more specific than other and more meaningful than other.

The file notepad.exe is being opened in the prototype version of HexLasso Desktop which implements the core strategy of hexdump analysis.

Unexpected Results When Analyzing Files in a Windows Installation

Posted 20 July 2019 | Modified 20 July 2019
Author Attila Suszter

This is an important article to read if you use HexLasso CLI for analyzing files in a Windows installation.

Symptoms

When you use HexLasso CLI to analyze files in a Windows installation you may experience that the analysis result is unexpected on one or more files.

Cause

You may see unexpected result if you run HexLasso CLI on files that are subject to file system redirection.

The file system redirection is a feature of the 64-bit version of Windows and it redirects file access for backward compatibility reasons.

HexLasso CLI is not aware of this redirection. And therefore, for example, if you intend to analyze C:\Windows\System32\wermgr.exe, Windows will redirect the file access to C:\Windows\SysWOW64\wermgr.exe and so the latter file will be analyzed.

Workaround

Copy the files of the Windows installation into a temporary folder using a copy utility. Most of the copy utilities can handle file system redirection.
You can now run HexLasso CLI on the files of the temporary folder.

Remarks

Although Microsoft provides an API function to disable file system redirection for the application, it would require calling native function from the otherwise fully managed code. Looking ahead, keeping the fully managed code is preferred over addressing this platform specific issue via code change.