File layout analysis of a Windows executable in 40 chunks

Posted 23 June 2026 | Modified 23 June 2026
Author Attila Suszter

I run FF-16 on notepad.exe from Windows, twice with different parameters.

With the -cpf 40 parameter, the result of the run is shown on the left side of the image.

With the -cpf 40 -z 2 parameter, the result of the run is shown on the right side of the image.

CPF stands for Chunks Per File. -cpf 40 requests that the analysis result to be summarized into 40 chunks (i.e., 40 lines). The reason for requesting a specific chunk size is to control the length of the output. It trades detail for compactness.

The -z 2 parameter runs a filter to remove 00 patterns. 00 bytes are ubiquitous in binary files. They can reveal the length of structures, but they also contribute to noise. In the left analysis, many 00 patterns are present, and I wanted to see whether more meaningful frequent patterns could be uncovered.

A quick side-by-side comparison of the two outputs shows that meaningful patterns appear in both analyses, so I decided to create a merged output that contains the meaningful patterns from both results.

Below is the result of the merge.

From the merge, seven different layouts can be identified.

Layout 1: Patterns resembling x86 mov / int 3 instructions.

Layout 2: In chunk 00029900, patterns of 00 ?? 00 suggest a sequence of 16-bit values.

The blocks from chunk 00029900 are listed below. When not using the -cpf parameter, FF-16 by default displays block-level results.

Below is a dump of the block at the first occurrence of 00 +(1) 00 in the chunk of 0002A900.

Now the dump confirms the presence of UTF-16 strings.

Layout 3: The - in the Pattern column indicates that no statistically significant pattern was found. It suggests random, encrypted, or compressed data.

Layout 4: The chunk summary indicates that a sequence of DWord values with byte 00 dominates.

Layout 5: The chunk summary indicates that a sequence of 16-byte structures dominates.

Layout 6: The chunk summary indicates that consecutive zero bytes dominate.

Layout 7: The chunk summary indicates that a sequence of DWord values with byte FF dominates.

References

Analysis summarizing results into chunks using -cpf 40 parameter: ff-16_-cpf_40_notepad_20-Jun-2026.log
Analysis filtering out zero bytes and summarizing results into chunks using -cpf 40 -z 2 parameter: ff-16_-cpf_40_-z_2_notepad_20-Jun-2026.log
Analysis listing results for all blocks (no parameter): ff-16_notepad_20-Jun-2026.log
FF-16: Snapshot

Categories FF-16

FF-16 (Find Frequent 16-bit) is now released

Posted 15 June 2026 | Modified 16 June 2026
Author Attila Suszter

What does FF-16 do?

FF-16 is a static analysis tool that finds frequently occurring local 16-bit patterns across the entire file. It can help to locate structures from frequent patterns and understand file layout.

Command line usage

ff-16 [filename] [-d <filename>] [<-bpc <1..256>|-cpf <1..65536>>] [-g <0..127>] [-t <1..255>]
  <filename>      Target file
  -d <filename>   Dictionary file  (Default: dict.csv)
  -bpc <1..256>   Blocks per chunk (Default: 1)
  -cpf <1..65536> Chunks per file  (Default: not specified)
  -g <0..127>     Max gaps         (Default: 31)
  -t <1..255>     Freq threshold   (Default: 5)

The long description is available in the Readme file. The project is now public on GitHub.

Categories FF-16

Byte Prediction with Gap and FF-16

Posted 15 February 2025 | Modified 14 June 2026
Author Attila Suszter

Building on the concept of the prediction table, the algorithm has been extended with a gap parameter.

Now, the algorithm predicts the most likely byte that comes after the current byte, allowing for a gap between them.

In practice, this means the algorithm can find frequent 16-bit patterns with gaps. The example below shows a gap of 3 bytes between two 00 bytes.

00 +(3) 00

The tool FF-16 (Find Frequent 16-bit) utilizes this algorithm.

FF-16 processes a file by splitting it into multiple 256-byte blocks, running the algorithm independently on each block. While the analysis occurs at the block level, the results are displayed at the bucket level, where a bucket consists of multiple blocks. This approach ensures that the user is not overwhelmed with excessive detail, making the output more practical to interpret.

Map View

When running FF-16 on notepad.exe the output looks like this.

BucketSize: 2816         0               1               2               3               4               5               6               7
Pattern     Ascii Blocks 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF01234567
00 00        |..| 542   |***********************************************************    ***** **** ********                       ********* * ***|
-               - 309   |                                                           ****                  *************************              |
CC CC        |..| 91    | ********* ******** ******                **  * ****** *                                                                |
48 8B        |H.| 68    |     ****  *** ****  * * *   **   ****     *   *  ******                                                                |
00 +(3) 00   |..| 42    |                                                              ****       *   *****                            * **  **  |
FF +(3) FF   |..| 42    |                                                                                                          ** *** ****   |
00 +(1) 00   |..| 41    |                                                          **  *  ** **      *   *                             *      *  |
00 +(7) 00   |..| 40    |                                                       ***      ****  **   ***** *                          *  *        |
00 +(11) 00  |..| 17    |                        *                                     *     *   **    *                                         |
00 +(15) 00  |..| 17    |                  *                                      *       ***  *     ***  *                                      |
02 00        |..| 15    |       ** * **   * ** *   *                                              *                                              |
01 00        |..| 12    |                  *      **     * *         *** *                                                                       |
FF +(11) FF  |..| 12    |                                                                                                                 **     |
00 +(19) 00  |..| 12    |                              *          *                      * *                                                     |
41 8B        |A.| 11    |                             * **      **                                                                               |
48 89        |H.| 10    |                              **        **                                                                              |
48 +(5) 00   |H.| 8     |              **    * * * **                                                                                            |
<LIST CUT HERE>

The map lists frequent patterns and marks buckets with * if they contain at least one block with a frequent pattern. The patterns occurring in more blocks are listed first. The map presents an overview of the file layout, showing the dominant patterns in different sections of the file.

Categories FF-16

The Simplest Possible Prediction Table

Posted 19 October 2021 | Modified 19 October 2021
Author Attila Suszter

What is this screenshot about?

This is a prediction table which tells what is the most likely (second most likely, third most likely) byte which comes after the current byte.

The prediction table has 4 columns. Each column contains the current byte, the ASCII representation of the current byte, the top three bytes most likely come after the current byte, their ASCII representation, and the number of occurrences of the top three bytes in the context of the current byte.

The current byte ranges from 0 to 255, inclusive.

How the prediction table is useful for static binary analysis?

The biggest usefulness of what the prediction table provides is the reduction of the data while aiming to keep the key redundancies of the input data visible. The prediction table is always fixed in size regardless of the size of the input data, and it fits on a single screen. Therefore, it’s easier to analyze the resulting prediction table and easier to make automation on it.

The generation of the prediction table is a very straightforward and single-pass process, and therefore it’s a pretty fast one. And because of the simplicity, the algorithm is low risk to have errors that would lead to inaccurate output. It is simple to understand if the output is correct.

Redundancies which can intuitively be picked by the human eye when skimming through the data might be seen in the prediction table. Below is an example of how to read the prediction table of notepad.exe.

What can specifically be understood from the prediction table?

The screenshot of the prediction table is annotated by numbers in red.

Number 1

Quite often 00 comes after 00. Specifically, the occurrence of 00 in the context of 00 is 17502 which is the largest occurrence number in the table. It is commonly seen redundancy in EXE files because zeroes are used to fill slack space (to align section boundaries), used in header fields, and in x86 instructions.

Number 2

The most likely byte which comes after 0D is 0A. This sequence indicates a line break of CR LF. This kind of line break can be found in textual data handled by Windows applications. It is common to see that EXE files contain textual data like notepad.exe.

Number 3

Space (20) comes after space, or in other words, runs of spaces. Usually, this pattern is seen when there is indented textual data and for the indentation, spaces are used rather than tabs. In this case, there is an embedded XML-based document with space indentation.

Number 4-6

In most cases, an ASCII printable letter comes after an ASCII printable letter. That indicates the presence of text.

In many cases, 00 comes after ASCII printable letter. That indicates the presence of Unicode text.

Of course, both can be found in notepad.exe.

Number 5

Often, 8B comes after 55.

That’s a part of the sequence 55 8B EC push ebp; mov ebp, esp which is part of the x86 function epilogue and is commonly seen in EXE files.

Number 7

8B is a part of mov instruction which is one of the most often occurring instructions in x86 code in EXE files. And that’s why the top three occurrence numbers in the context of 8B are high compared to the occurrence numbers of many other contexts.

Number 8

These indicate sequences of nop (90) instructions sometimes followed by mov (8B) instructions. Again, note the occurrence number in the context of 90.

Number 9

The sequences FF 15 and FF 35 are parts of call and push instructions commonly found in x86 code in EXE files.

Note

There are other patterns that are not listed here nevertheless can be seen in the prediction table that confirm the presence of x86 code in EXE file. The goal here is to demonstrate that solely, prediction table is useful to make quick decisions on the content of the file.

Are there redundancies that this prediction table doesn’t pick up?

I used the simplest possible implementation of the prediction table generator. It makes sense to start with that.

A prediction table made by an enhanced generator could pick up more interesting patterns and redundancies, for example when the correlated byte is not the following byte to the current byte instead the correlated byte is at an arbitrarily chosen distance from the current byte.

Running HexLasso on Packet Payloads

Posted 18 August 2021 | Modified 18 August 2021
Author Attila Suszter

Since HexLasso can also be run on small chunk of data to provide accurate results it can be used to analyze the characteristic of network packets.

In the experiment, 2593 TCP packets were captured using Wireshark with the filter of tcp.payload and !tls. The captured packets were exported and HexLasso was run on the TCP payload section of each packet.

The smallest packet has the size of 16 bytes, the largest one has the size of 1420 bytes.

The below table summarizes the result.

Analyzer	Chart
AsciiByte	PNG
ExtAsciiByte	PNG
SpPredictedByte	PNG
PredictedByte	PNG
SpByteMulOf4	PNG
ByteMulOf4	PNG
SymmetricByteSeq	PNG
SpSameByteSeq	PNG
PredictedByteSeq	PNG
SpIncByteSeq	PNG
SpDecByteSeq	PNG
SpSameByteDiffSeq	PNG
IncByteSeq	PNG
DecByteSeq	PNG
SameByteDiffSeq	PNG
SameByteSeq	PNG
SameAsciiByteSeq	PNG
SameDWordSeq	PNG
X86Fragment	PNG
ArmFragment	PNG
SpAsciiString	PNG
UnicodeString	PNG
AsciiString	PNG
AsciiStringOfDigits	PNG
AsciiStringOfSpecial	PNG
WordMatch	PNG
DWordMatch	PNG
QWordMatch	PNG

The analyzer is the functionality that HexLasso used for analyzing the TCP payload.

The chart shows the result of the analysis. The Y axis tells how much data is covered in each packet by the analyzer. The Y axis represents percentage. The X axis lists all the packets.