Recognizing CRC32 Value-block Pairs in Binary Data

Posted | Modified

Numerous data formats specify CRC32 (Cyclic Redundancy Check 32-bit) checksum for some part of the data for detecting accidental data corruption. For example, the PNG (Portable Network Graphics) file type is such a format. In addition to documented data formats, undocumented formats can also contain CRC32 checksum of some kind.

These data formats, among many constructs, contain a block of data and a field for the CRC32 value for that block of data.

One of the recognizers implemented in BinCovery is to find CRC32 value-block pairs.

This is the description of the algorithm for finding CRC32 value-block pairs.

The recognizer starts reading UInt32 values from the binary data from offset 0. After each read, it moves the data pointer towards the end of window. It creates a list and adds each UInt32 value to that list.

The recognizer calculates the CRC32 checksum for all continuous blocks within the current window, starting from data offset 0. Then it starts matching each of these CRC32 checksums to each UInt32 value from the list. If there is a match, we have found a CRC32 value-block pair. The result will be saved, and the matching resumes to find further pairs.

The recognizer can be configured to find CRC32 values on data offset that is a multiple of four (alignment). Also, it can be configured to calculate the checksum for continuous blocks that match minimum and maximum size criteria. Using these criteria improves the processing speed and reduces the false positive result.

When the scan is finished in the current window, the window is moved forward to continue processing the rest of the data.

The recognizer can be configured to find little and big endian CRC32 values. Also, it can perform CRC32 calculation with various polynomials.

Practical Importance

Reverse engineering and digital forensics. Numerous data formats guard an internal header by a checksum. A header usually carries key information regarding the way to access the data. Examples of such information include offset and size. When you want to locate key information, you may look for a header in the data. You can try looking for such headers by finding CRC32 value-block pairs.

Data mining. You can run the recognizer on a set of files to separate files with CRC32 blocks from files without such blocks. When necessary, you can set the recognizer to use a particular CRC polynomial. You can specify the location, minimum and maximum length of CRC block to find the files with the best matches.

Data compression. The data guarded by a checksum is likely to be different than other parts of the data. Separating different kinds of data allows the best compression method to be applied for each part. This approach can lead to a better overall compression ratio. Additionally, running a pre-processor on certain parts of the data — rather than on the whole data — can also improve the compression ratio.

Fuzz testing. Mutating the checksum-guarded-data can be a good idea when you update the checksum field in the data. Excluding the data guarded by checksum from fuzzing and focusing on a different part of the data is also an approach one may consider.