Posted | Modified
Author

Building on the concept of the prediction table, the algorithm has been extended with a gap parameter.

Now, the algorithm predicts the most likely byte that comes after the current byte, allowing for a gap between them.

In practice, this means the algorithm can find frequent 16-bit patterns with gaps. The example below shows a gap of 3 bytes between two 00 bytes.

00 +(3) 00

The tool FF-16 (Find Frequent 16-bit) utilizes this algorithm.

FF-16 processes a file by splitting it into multiple 256-byte blocks, running the algorithm independently on each block. While the analysis occurs at the block level, the results are displayed at the bucket level, where a bucket consists of multiple blocks. This approach ensures that the user is not overwhelmed with excessive detail, making the output more practical to interpret.

Map View

When running FF-16 on notepad.exe the output looks like this.

BucketSize: 2816         0               1               2               3               4               5               6               7
Pattern     Ascii Blocks 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF01234567
00 00        |..| 542   |***********************************************************    ***** **** ********                       ********* * ***|
-               - 309   |                                                           ****                  *************************              |
CC CC        |..| 91    | ********* ******** ******                **  * ****** *                                                                |
48 8B        |H.| 68    |     ****  *** ****  * * *   **   ****     *   *  ******                                                                |
00 +(3) 00   |..| 42    |                                                              ****       *   *****                            * **  **  |
FF +(3) FF   |..| 42    |                                                                                                          ** *** ****   |
00 +(1) 00   |..| 41    |                                                          **  *  ** **      *   *                             *      *  |
00 +(7) 00   |..| 40    |                                                       ***      ****  **   ***** *                          *  *        |
00 +(11) 00  |..| 17    |                        *                                     *     *   **    *                                         |
00 +(15) 00  |..| 17    |                  *                                      *       ***  *     ***  *                                      |
02 00        |..| 15    |       ** * **   * ** *   *                                              *                                              |
01 00        |..| 12    |                  *      **     * *         *** *                                                                       |
FF +(11) FF  |..| 12    |                                                                                                                 **     |
00 +(19) 00  |..| 12    |                              *          *                      * *                                                     |
41 8B        |A.| 11    |                             * **      **                                                                               |
48 89        |H.| 10    |                              **        **                                                                              |
48 +(5) 00   |H.| 8     |              **    * * * **                                                                                            |
<LIST CUT HERE>

The map lists frequent patterns and marks buckets with * if they contain at least one block with a frequent pattern. The patterns occurring in more blocks are listed first. The map presents an overview of the file layout, showing the dominant patterns in different sections of the file.

Posted | Modified
Author

What is this screenshot about?

This is a prediction table which tells what is the most likely (second most likely, third most likely) byte which comes after the current byte.

The prediction table has 4 columns. Each column contains the current byte, the ASCII representation of the current byte, the top three bytes most likely come after the current byte, their ASCII representation, and the number of occurrences of the top three bytes in the context of the current byte.

The current byte ranges from 0 to 255, inclusive.

How the prediction table is useful for static binary analysis?

The biggest usefulness of what the prediction table provides is the reduction of the data while aiming to keep the key redundancies of the input data visible. The prediction table is always fixed in size regardless of the size of the input data, and it fits on a single screen. Therefore, it’s easier to analyze the resulting prediction table and easier to make automation on it.

The generation of the prediction table is a very straightforward and single-pass process, and therefore it’s a pretty fast one. And because of the simplicity, the algorithm is low risk to have errors that would lead to inaccurate output. It is simple to understand if the output is correct.

Redundancies which can intuitively be picked by the human eye when skimming through the data might be seen in the prediction table. Below is an example of how to read the prediction table of notepad.exe.

What can specifically be understood from the prediction table?

The screenshot of the prediction table is annotated by numbers in red.

Number 1

Quite often 00 comes after 00. Specifically, the occurrence of 00 in the context of 00 is 17502 which is the largest occurrence number in the table. It is commonly seen redundancy in EXE files because zeroes are used to fill slack space (to align section boundaries), used in header fields, and in x86 instructions.

Number 2

The most likely byte which comes after 0D is 0A. This sequence indicates a line break of CR LF. This kind of line break can be found in textual data handled by Windows applications. It is common to see that EXE files contain textual data like notepad.exe.

Number 3

Space (20) comes after space, or in other words, runs of spaces. Usually, this pattern is seen when there is indented textual data and for the indentation, spaces are used rather than tabs. In this case, there is an embedded XML-based document with space indentation.

Number 4-6

In most cases, an ASCII printable letter comes after an ASCII printable letter. That indicates the presence of text.

In many cases, 00 comes after ASCII printable letter. That indicates the presence of Unicode text.

Of course, both can be found in notepad.exe.

Number 5

Often, 8B comes after 55.

That’s a part of the sequence 55 8B EC push ebp; mov ebp, esp which is part of the x86 function epilogue and is commonly seen in EXE files.

Number 7

8B is a part of mov instruction which is one of the most often occurring instructions in x86 code in EXE files. And that’s why the top three occurrence numbers in the context of 8B are high compared to the occurrence numbers of many other contexts.

Number 8

These indicate sequences of nop (90) instructions sometimes followed by mov (8B) instructions. Again, note the occurrence number in the context of 90.

Number 9

The sequences FF 15 and FF 35 are parts of call and push instructions commonly found in x86 code in EXE files.

Note

There are other patterns that are not listed here nevertheless can be seen in the prediction table that confirm the presence of x86 code in EXE file. The goal here is to demonstrate that solely, prediction table is useful to make quick decisions on the content of the file.

Are there redundancies that this prediction table doesn’t pick up?

I used the simplest possible implementation of the prediction table generator. It makes sense to start with that.

A prediction table made by an enhanced generator could pick up more interesting patterns and redundancies, for example when the correlated byte is not the following byte to the current byte instead the correlated byte is at an arbitrarily chosen distance from the current byte.

Posted | Modified
Author

Since HexLasso can also be run on small chunk of data to provide accurate results it can be used to analyze the characteristic of network packets.

In the experiment, 2593 TCP packets were captured using Wireshark with the filter of tcp.payload and !tls. The captured packets were exported and HexLasso was run on the TCP payload section of each packet.

The smallest packet has the size of 16 bytes, the largest one has the size of 1420 bytes.

The below table summarizes the result.

Analyzer Chart
AsciiByte PNG
ExtAsciiByte PNG
SpPredictedByte PNG
PredictedByte PNG
SpByteMulOf4 PNG
ByteMulOf4 PNG
SymmetricByteSeq PNG
SpSameByteSeq PNG
PredictedByteSeq PNG
SpIncByteSeq PNG
SpDecByteSeq PNG
SpSameByteDiffSeq PNG
IncByteSeq PNG
DecByteSeq PNG
SameByteDiffSeq PNG
SameByteSeq PNG
SameAsciiByteSeq PNG
SameDWordSeq PNG
X86Fragment PNG
ArmFragment PNG
SpAsciiString PNG
UnicodeString PNG
AsciiString PNG
AsciiStringOfDigits PNG
AsciiStringOfSpecial PNG
WordMatch PNG
DWordMatch PNG
QWordMatch PNG

The analyzer is the functionality that HexLasso used for analyzing the TCP payload.

The chart shows the result of the analysis. The Y axis tells how much data is covered in each packet by the analyzer. The Y axis represents percentage. The X axis lists all the packets.

Posted | Modified
Author

This is a pre-recorded presentation of HexLasso.

You can download the slides from here.

Transcript

Hello, I’m Attila Suszter. This is my talk about a tool called HexLasso.

HexLasso has a story which is like this. Many years ago when I worked with binaries from various sources, I used hex editor to view the content of the binaries. I peaked or skimmed through the binary data to make decisions based on what I saw. When you regularly do this task you get used to recognizing and memorizing patterns in binaries. I thought it would be a good idea to automate this manual process.

Besides having a goal to automate the pattern recognition process, it was also important to be able to show concise analysis results to the user.

Automation has its advantages. You’ll get precision because all the bytes in the data can be practically processed, unlike when you manually examine the binary. Also, automation is very fast compared to manual examination.

However, there is a challenge. The manual examination might be influenced by good intuition. When we recognize patterns we might make unconscious decisions. To create pattern recognizers, the manual process needs to be understood in more detail than you might initially think.

Now, let’s talk about what is the result of the analysis. When the process is manual and you’re looking for a particular thing in the binary, the result is whether or not you found that particular thing in the binary. In other situations, you might look for various things in the binary and you might want to know the proportion of the things found in that binary. For example, you might find out the contents of the binary is 1/3 of text, 1/3 of other redundant data and 1/3 of random data. And you make decision based on that information. When the process is automatic the result should be similar and like this. You need to know the contents of the binary. You need to know the proportion of the contents. And you just want to get the right level of details. That is you want a filtered result that does not contain noise and also does not miss important things.

The following patterns have been added to HexLasso. Patterns are systematically categorized and named. You might get an idea of what each pattern means by reading their names.

It is possible that different patterns cover the same data. For example, the string which consists of runs of ‘A’-s is covered, at least, by three patterns. It’s covered by QWordMatch, it’s covered by AsciiString and it’s covered by SameAsciiByteSeq. Now the question is: what should be reported to the user? There is a solution which involves maintaining a relevance list of patterns. So patterns are sorted in order of their relevance. A pattern is more relevant than another pattern if it’s more accurate, that is if it tells more about the data covered. So in the above example, SameAsciiByteSeq is reported because it’s more meaningful than AsciiString or QWordMatch.

The inner working of HexLasso is straightforward. There is an internal map which can be projected to the data, or the data can be projected to the map, as they have the same size. The pattern recognizer algorithms run one-by-one and in each iteration, the map is being updated by the matching patterns. When the map is fully populated, the content can be queried from the map. This metadata content can be represented in a CSV format.

In this example, a CSV file was converted to pie charts detailing the content of the input data. The left pie chart is the content of the input data, the right pie chart details the Other slice from the left pie chart. I’m gonna talk about the left pie chart for a bit. WordMatch covers 35% of the data. WordMatch means there are repeating byte sequences with size 2. This indicates some sort of redundancy. We have 5% of DWordMatch and 3% of QWordMatch which indicates redundancy, and the longer the repeating byte sequence the higher the redundancy is, so QWordMatch indicates more redundancy than DWordMatch which indicates more redundancy than WordMatch. 12% of the data is SameByteSeq, meaning runs of bytes. This could be, for example, runs of zeroes. 7% of the data is X86Fragment which indicates the presence of x86 code. We also have strings like AsciiString and UnicodeString in the data. PredictedByte means bytes that are predictable, that is the current byte successfully predicts the following byte. This indicates redundancy. AsciiByte and ExtAsciiByte are last resort recognizers. The presence of these means that there are bytes that are not in other contexts. Usually, some parts of these bytes are part of a random blob.

That’s all I have for now. Thank you.

Posted | Modified
Author

Below is the summary of analyzers supported by HexLasso.

Category Analyzer
Byte AsciiByte
ExtAsciiByte
SpPredictedByte
PredictedByte
SpByteMulOf4
ByteMulOf4
ByteSeq SymmetricByteSeq
SpSameByteSeq
PredictedByteSeq
SpIncByteSeq
SpDecByteSeq
SpSameByteDiffSeq
IncByteSeq
DecByteSeq
SameByteDiffSeq
SameByteSeq
SameAsciiByteSeq
DWordSeq SameDWordSeq
Code X86Fragment
ArmFragment
String SpAsciiString
UnicodeString
AsciiString
AsciiStringOfDigits
AsciiStringOfSpecial
Match WordMatch
DWordMatch
QWordMatch
Abbreviations
Ext Extended
Sp Sparse
Seq Sequence
Inc Increasing
Dec Decreasing
Mul Multiple
Diff Difference