Spotting Redundancies in High Entropy Data

Posted | Modified
Author

It’s more straightforward to understand redundant data than non-redundant data. So when analyzing binaries we look for redundancy.

Low entropy score indicates the presence of redundant data. High entropy score, however, does not necessarily indicate the presence of non-redundant data. High entropy data may or may not be redundant.

To decide on if a high entropy data is redundant, additional analysis is required. There is no all-in-one solution to tell where the redundancies are.

The following examples describe organic samples that contain high entropy blocks and explain the redundancies in those blocks.

High Match Coverage in High Entropy Data

I created a plot with HexLasso by selecting ENTROPY and MATCH_COVERAGE_DWORD analyzers.

According the ENTROPY analyzer (red line), the block at offset 7168 has an entropy of over 90% which is more than 7.2 (out of 8).

The MATCH_COVERAGE_DWORD analyzer (green line) reports match coverage of over 90% for the same block.


The HexLasso plot of a sample showing high entropy (red) and high match coverage (green) between the data offsets 7168 and 8192.

After viewing the hexdump of the block, the pattern in the data becomes obvious.

The first part of the data is a sequence of bytes, in incremental order, from 00 to FF. The sequence is repeated till more than half of the block.

The second part of the data contains text with some words repeating few times.


The hexdump showing a block of high entropy data taken from a sample at offset 7168 (1C00h). Matches can be seen all over.

High Coverage for Runs of Bytes in High Entropy Data

I created a plot with HexLasso by selecting ENTROPY and RUNS_OF_BYTES_MINLEN_4 analyzers.

According the ENTROPY analyzer (red line), the block at offset 44032 has an entropy of about 99% which is about 7.9 (out of 8).

The RUNS_OF_BYTES_MINLEN_4 analyzer (green line) reports runs-of-bytes coverage of about 99% for the same block.


The HexLasso plot of a sample showing high entropy (red) and high coverage for runs of bytes (green) between the data offsets 44032 and 45056.

After viewing the hexdump of the block, the pattern in the data becomes obvious.

Most of the data (apart from the first 8 bytes) can be described as a sequence of varying DWORDs, and each byte in a DWORD is the same.


The hexdump showing a block of high entropy data taken from a sample at offset 44032 (AC00h). Runs of bytes can be seen all over.