Overview of HexLasso

Posted | Modified
Author

This is a pre-recorded presentation of HexLasso.

You can download the slides from here.

Transcript

Hello, I’m Attila Suszter. This is my talk about a tool called HexLasso.

HexLasso has a story which is like this. Many years ago when I worked with binaries from various sources, I used hex editor to view the content of the binaries. I peaked or skimmed through the binary data to make decisions based on what I saw. When you regularly do this task you get used to recognizing and memorizing patterns in binaries. I thought it would be a good idea to automate this manual process.

Besides having a goal to automate the pattern recognition process, it was also important to be able to show concise analysis results to the user.

Automation has its advantages. You’ll get precision because all the bytes in the data can be practically processed, unlike when you manually examine the binary. Also, automation is very fast compared to manual examination.

However, there is a challenge. The manual examination might be influenced by good intuition. When we recognize patterns we might make unconscious decisions. To create pattern recognizers, the manual process needs to be understood in more detail than you might initially think.

Now, let’s talk about what is the result of the analysis. When the process is manual and you’re looking for a particular thing in the binary, the result is whether or not you found that particular thing in the binary. In other situations, you might look for various things in the binary and you might want to know the proportion of the things found in that binary. For example, you might find out the contents of the binary is 1/3 of text, 1/3 of other redundant data and 1/3 of random data. And you make decision based on that information. When the process is automatic the result should be similar and like this. You need to know the contents of the binary. You need to know the proportion of the contents. And you just want to get the right level of details. That is you want a filtered result that does not contain noise and also does not miss important things.

The following patterns have been added to HexLasso. Patterns are systematically categorized and named. You might get an idea of what each pattern means by reading their names.

It is possible that different patterns cover the same data. For example, the string which consists of runs of ‘A’-s is covered, at least, by three patterns. It’s covered by QWordMatch, it’s covered by AsciiString and it’s covered by SameAsciiByteSeq. Now the question is: what should be reported to the user? There is a solution which involves maintaining a relevance list of patterns. So patterns are sorted in order of their relevance. A pattern is more relevant than another pattern if it’s more accurate, that is if it tells more about the data covered. So in the above example, SameAsciiByteSeq is reported because it’s more meaningful than AsciiString or QWordMatch.

The inner working of HexLasso is straightforward. There is an internal map which can be projected to the data, or the data can be projected to the map, as they have the same size. The pattern recognizer algorithms run one-by-one and in each iteration, the map is being updated by the matching patterns. When the map is fully populated, the content can be queried from the map. This metadata content can be represented in a CSV format.

In this example, a CSV file was converted to pie charts detailing the content of the input data. The left pie chart is the content of the input data, the right pie chart details the Other slice from the left pie chart. I’m gonna talk about the left pie chart for a bit. WordMatch covers 35% of the data. WordMatch means there are repeating byte sequences with size 2. This indicates some sort of redundancy. We have 5% of DWordMatch and 3% of QWordMatch which indicates redundancy, and the longer the repeating byte sequence the higher the redundancy is, so QWordMatch indicates more redundancy than DWordMatch which indicates more redundancy than WordMatch. 12% of the data is SameByteSeq, meaning runs of bytes. This could be, for example, runs of zeroes. 7% of the data is X86Fragment which indicates the presence of x86 code. We also have strings like AsciiString and UnicodeString in the data. PredictedByte means bytes that are predictable, that is the current byte successfully predicts the following byte. This indicates redundancy. AsciiByte and ExtAsciiByte are last resort recognizers. The presence of these means that there are bytes that are not in other contexts. Usually, some parts of these bytes are part of a random blob.

That’s all I have for now. Thank you.