Motivation
- The more you know about your data the better you can compress it
- The more you know about your data the better you can examine it
- The more you know about your data the better you can mutate it for fuzzing purposes
- The more you know about your data the more you know about the software parses it
Background
Back in 2011 I worked on a tool for the analysis of data formats. That time, I mentioned it in a blog post: The forensic example.
It wasn’t until this year that I started drafting a more organized concept for that tool.
The tool has three layers of abstraction.
Layer 1
Layer 1 takes binary data as input and runs the recognizers on the data. It currently describes more than 50 recognizers. The main idea for the recognizers is to find blocks with specific characteristics in the data.
The result of the analysis is the list of blocks recognized. Description, offset and size fields together describe a block. Here is an example for a recognized block.
Description: Entropy drops after performing single-channel delta encoding Offset: 1000 Size: 4000
None of the recognizers rely on using magic bytes.
Each recognizer is meant to retain backward and forward compatibility. Each recognizer runs independently from other recognizer.
Layer 2
Layer 2 takes the analysis result of Layer 1. It has the facility to filter – and if required suppress – specific items from the analysis result.
For example, it handles overlapped blocks and eliminates potential false positive items.
Layer 3
Layer 3 implements practical approaches to utilize the analysis result of Layer 2 such as the followings.
- File finder
- File classifier
- File comparer
- Visualizer for the layout of file
- Generic purpose file mutator for fuzzing purposes
- Generic purpose lossy but reversible decompressor