The researchers who work with binary data have eyes trained on recognizing patterns in hexdumps.
The pattern consists of bytes with redundancy in them. Such pattern may include array of correlated values, structures with fixed length, runs of bytes, textual data, machine instruction fragments, etc.
The researchers may look for patterns in hexdump of some file or memory content. That covers lots of more specific content types, such as varying file formats, malware samples, firmware images, process memory dumps, etc.
There are many reasons to analyze a hexdump. For each situation the reasons may vary. Here are few:
- Improve data compression ratio of the compression algorithm.
- Improve data mutation efficiency of the fuzzer.
- Have an idea of how the program parses and processes the data, without reverse engineering the program (which may not even be available to access).
- Improve decision making for telling if a sample is malware, and if it is malware, which family it can be associated with.
- Recover data artifacts from a corrupted or an unknown sample. Recover concealed data from sample hidden by stenography techniques.
- Hypothesize on the layout of the data.
- Find anomaly in data such as in network traffic.
- Decide on if data is random (when no redundancies are observed).
Being able to automate the manual pattern recognition task would have two important advantages:
- A tool would allow for anyone to assist in recognizing patterns in a hexdump.
- Analyses could have been done in batches and at scale.
The core strategy for automated hexdump analysis is like this.
The automation splits the input stream into blocks. I approximated the number of bytes I can manually process at a time when looking at hexdump and I came up with a number, so the block has a size of 64 bytes, which is just large enough to contain patterns in it.
The pattern recognition is being done through analyzers.
There is a set of analyzers for different patterns. Each analyzer runs on a block. The result of the analyzers are being compared that the analyzer with the best result, that is with the highest score, will describe the block.
When comparing the result between two analyzers, the one that covers for more bytes has the higher score. The higher score means more certainty. If the score is the absolute maximum it means 100% certainty.
The name of the analyzer should be as relevant as possible to every single byte incorporated in the pattern recognition.
Each analyzer is given a distinct priority. If two analyzers come back with the same score for a given block, the analyzer with the higher priority will describe the block.
The reason to maintain a priority list for the analyzers is because one analyzer is more specific than other and more meaningful than other.
The file notepad.exe is being opened in the prototype version of HexLasso Desktop which implements the core strategy of hexdump analysis.