Posted | Modified
Author

Introduction

Compression algorithms benefit from delta encoding, which makes the data more compressible. Delta encoding of correlated ASCII decimal strings improves the overall compression ratio, thereby providing a considerable benefit to compression algorithms.

ASCII Decimal String

An ASCII decimal string consists of a set of printable characters, which range from codes 0×30 to 0×39 that represent the digits from 0 to 9.

Delta Encoding ASCII Decimal Strings

Textual and semi-textual data can contain ASCII decimal strings that have the same digit lengths and are correlated. For example, the 8-digit decimal strings are in an incremental order as shown in Figure 1.

12345678 12345679 12345680 12345681

Figure 1: Correlated Decimal Strings

This understanding is useful for delta encoding decimal strings. The first decimal string has the base value, and each subsequent decimal string has the value difference from the previous value. The encoding is performed in a digit-by-digit manner rather than as a whole integer. The encoding can produce results similar to the one shown in Figure 2.

12345678 00000001 00000011 00000001

Figure 2: Delta-encoded Decimal Strings

In real-world data, there may be other data between the decimal strings, such as strings as shown in Figure 3.

ID 12345678 ID 12345679 ID 12345680 ID 12345681

Figure 3: Correlated Decimal Strings with Strings In Between

The delta coder recognizes the decimal strings regardless of location. The coder ignores the other strings during the encoding process. The encoding can result in a display similar to the one shown in Figure 4.

ID 12345678 ID 00000001 ID 00000011 ID 00000001

Figure 4: Delta-encoded Decimal Strings with Strings In Between

In real-world data, there may be other data between the decimal strings, such as decimal strings with different lengths as shown in Figure 5.

ID99 12345678 ID100 12345679 ID101 12345680 ID102 12345681

Figure 5: Correlated Decimal Strings with Decimal Strings In Between

The delta coder performs multiple passes on the data and encodes decimal strings with different lengths separately. In Figure 6, the coder has encoded decimal strings with 8 digits. Similarly, decimal strings with different lengths are encoded independently.

ID99 12345678 ID100 00000001 ID101 00000011 ID102 00000001

Figure 6: Delta-encoded Decimal Strings with Strings and Decimal Strings In Between

Delta encoding is beneficial for compression algorithms because it increases the occurrence of the same characters. In this example, it is the zero character. For this reason, many compression algorithms can compress data more efficiently. However, if the decimal strings are not correlated, the delta encoding process may not improve the compression ratio, or it could even make it worse.

Experimental Analysis

In this experiment, the efficiency of delta encoding is tested using a real-world PDF (Portable Document Format) file, which is a semi-textual format and represents the varying degrees of correlated ASCII decimal strings.

Ten delta encoding methods are applied to the sample. Delta encoding procedures range from encoding 1-digit decimal strings to 10-digit decimal strings. One combined delta encoding method is applied to the sample. It combines the five delta encoding methods that are the strongest individually. Seven compression methods are applied to the original and delta-encoded files. Each method is set to the maximum compression ratio.

Results

Combined delta encoding contributes to varying degrees of compression improvements ranging from 3.5 percent to 18.8 percent as shown in Table 1. PPMd, BZip2, and RAR achieve 18.8 percent, 15.8 percent, and 14.7 percent improvements respectively. ZPAQ achieves the least, but still the 3.5 percent result is noteworthy. Without delta encoding, ZPAQ produces the smallest file, followed by LZMA and Brotli. With combined delta encoding, ZPAQ produces the smallest file, but Brotli achieves a better result than LZMA. Delta encoding with 2, 4, 5, 6 and 10 digits improves the compression ratio. However, delta encoding with 1 and 9 digits deteriorates the compression ratio.

Table 1: PDF Sample Results
Delta encoding method ZIP:Deflate BZip2 7z:PPMd 7z:LZMA 7z:Brotli RAR ZPAQ
None (original sample) 6,798,779 6,769,657 7,013,053 6,040,012 6,151,065 6,750,897 5,692,034
1-digit 6,804,274 6,788,146 7,042,361 6,057,956 6,167,812 6,773,170 5,704,985
2-digit 6,753,716 6,721,548 6,977,163 6,034,762 6,134,957 6,722,578 5,680,818
3-digit 6,807,662 6,765,354 7,029,890 6,076,672 6,170,911 6,808,052 5,718,458
4-digit 6,747,452 6,734,268 6,942,803 6,032,576 6,134,263 6,708,992 5,685,910
5-digit 6,369,822 6,252,086 6,397,184 5,863,166 5,884,288 6,283,388 5,631,217
6-digit 6,649,723 6,609,305 6,804,199 5,975,727 6,055,567 6,596,946 5,668,890
7-digit 6,798,901 6,769,197 7,012,649 6,039,961 6,150,558 6,751,004 5,692,130
8-digit 6,798,790 6,769,490 7,012,362 6,041,421 6,150,680 6,750,935 5,692,051
9-digit 6,798,795 6,769,657 7,013,069 6,040,028 6,151,081 6,750,913 5,692,042
10-digit 6,591,545 6,456,421 6,579,591 5,943,496 5,987,127 6,452,858 5,598,654
Combined of 2, 4, 5, 6, 10 digits 5,920,912 (12.9%) 5,699,267 (15.8%) 5,696,300 (18.8%) 5,668,295 (6.2%) 5,580,474 (9.3%) 5,760,317 (14.7%) 5,494,643 (3.5%)

The size of the original PDF file is 20,642,355 bytes. The file size in yellow indicates that the compression ratio is better after individual delta encoding.

Reliability

The delta encoding presented in this analysis is lossless and reversible. The delta encoder used in this experiment can undo the delta encoding to restore the original data.

Conclusion

The experiment provides conclusive evidence for the benefits of delta encoding correlated ASCII decimal strings, which increases the occurrence of the same decimal characters and improves the overall compression ratio.

Categories Data Compression

Posted | Modified
Author

Introduction

Compression algorithms benefit from pre-processing, aka filtering, which makes data more compressible. Pre-processing printable Unicode text measurably improves the overall compression ratio, thereby providing a significant benefit to compression algorithms.

Pre-processing Unicode Text

Unicode text is characterized by a pattern, which can be visualized on a Hex editor as shown in Figure 1. A printable character follows each zero (non-printable) character. Thus, even-offset characters are printable characters, while odd-offset characters are zeroes.

48 00 65 00 6C 00 6C 00 6F 00 20 00 77 00 6F 00  H.e.l.l.o. .w.o. 
72 00 6C 00 64 00 21 00                          r.l.d.!.

Figure 1: Unicode Text

This understanding is useful for organizing textual data. Bytes at even and odd offsets could be presented sequentially. Placing printable characters before non-printable characters could result in a display shown in Figure 2.

48 65 6C 6C 6F 20 77 6F 72 6C 64 21 00 00 00 00  Hello world!.... 
00 00 00 00 00 00 00 00                          ........

Figure 2: Pre-processed Unicode Text

Compression algorithms benefit from the better organization of data. Algorithms including the LZ (Lempel-Ziv) algorithm can find matches easily, for instance, while the zeroes are more compressible because the RLE (Run-Length Encoding) algorithm can be effectively applied on them.

Experimental Analysis

In this experiment, the efficiency of the pre-processor is tested using real-world and random data. Real-world data includes data from a Windows registry (.REG) file, while random data includes random printable characters. Both files have the same sizes. Six compression methods are applied to the original and pre-processed samples. Each method is manually set to the maximum compression ratio.

Real-world Data Results

Pre-processing contributes to varying degrees of compression improvements ranging from 3.72 percent to 43.87 percent as shown in Table 1. The RAR algorithm outperforms all the other algorithms for the pre-processed data. ZPAQ and ZIP:Deltate also achieve very good results. The 7z:LZMA algorithm is the most effective for the original data and still achieves good result for the pre-processed data.

Table 1: Real-world Data Results
Uncompressed ZIP:Deflate BZip2 7z:PPMd 7z:LZMA RAR ZPAQ
Original 82,097,412 2,923,818 1,926,411 1,607,819 1,116,793 2,080,856 1,205,937
Pre-processed 82,097,412 2,328,555 1,854,655 1,469,189 1,007,911 1,167,924 694,266
Gain 20.36% 3.72% 8.62% 9.75% 43.87% 42.43%

Random Data Results

Findings for the random data show that ZIP and RAR achieve 16.4 percent and 11.53 percent respectively, while the pre-processor for BZip2 and ZPAQ provide negligible improvements.

Table 2: Random Data Results
Uncompressed ZIP:Deflate BZip2 7z:PPMd 7z:LZMA RAR ZPAQ
Original 82,097,412 40,812,580 33,951,889 34,784,486 35,666,811 38,877,871 33,679,718
Pre-processed 82,097,412 34,102,961 33,947,916 34,593,752 34,268,387 34,394,031 33,661,096
Gain 16.4% 0.01% 0.55% 3.92% 11.53% 0.06%

Conclusion

The experiment provides conclusive evidence for the benefits of pre-processing printable Unicode text, which makes the data better organized and improves the overall compression ratio.

Categories Data Compression

Posted | Modified
Author

Numerous data formats specify CRC32 (Cyclic Redundancy Check 32-bit) checksum for some part of the data for detecting accidental data corruption. For example, the PNG (Portable Network Graphics) file type is such a format. In addition to documented data formats, undocumented formats can also contain CRC32 checksum of some kind.

These data formats, among many constructs, contain a block of data and a field for the CRC32 value for that block of data.

One of the recognizers implemented in BinCovery is to find CRC32 value-block pairs.

This is the description of the algorithm for finding CRC32 value-block pairs.

The recognizer starts reading UInt32 values from the binary data from offset 0. After each read, it moves the data pointer towards the end of window. It creates a list and adds each UInt32 value to that list.

The recognizer calculates the CRC32 checksum for all continuous blocks within the current window, starting from data offset 0. Then it starts matching each of these CRC32 checksums to each UInt32 value from the list. If there is a match, we have found a CRC32 value-block pair. The result will be saved, and the matching resumes to find further pairs.

The recognizer can be configured to find CRC32 values on data offset that is a multiple of four (alignment). Also, it can be configured to calculate the checksum for continuous blocks that match minimum and maximum size criteria. Using these criteria improves the processing speed and reduces the false positive result.

When the scan is finished in the current window, the window is moved forward to continue processing the rest of the data.

The recognizer can be configured to find little and big endian CRC32 values. Also, it can perform CRC32 calculation with various polynomials.

Practical Importance

Reverse engineering and digital forensics. Numerous data formats guard an internal header by a checksum. A header usually carries key information regarding the way to access the data. Examples of such information include offset and size. When you want to locate key information, you may look for a header in the data. You can try looking for such headers by finding CRC32 value-block pairs.

Data mining. You can run the recognizer on a set of files to separate files with CRC32 blocks from files without such blocks. When necessary, you can set the recognizer to use a particular CRC polynomial. You can specify the location, minimum and maximum length of CRC block to find the files with the best matches.

Data compression. The data guarded by a checksum is likely to be different than other parts of the data. Separating different kinds of data allows the best compression method to be applied for each part. This approach can lead to a better overall compression ratio. Additionally, running a pre-processor on certain parts of the data — rather than on the whole data — can also improve the compression ratio.

Fuzz testing. Mutating the checksum-guarded-data can be a good idea when you update the checksum field in the data. Excluding the data guarded by checksum from fuzzing and focusing on a different part of the data is also an approach one may consider.

Categories Binary Data Analysis

Posted | Modified
Author

Motivation

  • The more you know about your data the better you can compress it
  • The more you know about your data the better you can examine it
  • The more you know about your data the better you can mutate it for fuzzing purposes
  • The more you know about your data the more you know about the software parses it

Background

Back in 2011 I worked on a tool for the analysis of data formats. That time, I mentioned it in a blog post: The forensic example.

It wasn’t until this year that I started drafting a more organized concept for that tool.

The tool has three layers of abstraction.

Layer 1

Layer 1 takes binary data as input and runs the recognizers on the data. It currently describes more than 50 recognizers. The main idea for the recognizers is to find blocks with specific characteristics in the data.

The result of the analysis is the list of blocks recognized. Description, offset and size fields together describe a block. Here is an example for a recognized block.

Description: Entropy drops after performing single-channel delta encoding
Offset: 1000
Size: 4000

None of the recognizers rely on using magic bytes.

Each recognizer is meant to retain backward and forward compatibility. Each recognizer runs independently from other recognizer.

Layer 2

Layer 2 takes the analysis result of Layer 1. It has the facility to filter – and if required suppress – specific items from the analysis result.

For example, it handles overlapped blocks and eliminates potential false positive items.

Layer 3

Layer 3 implements practical approaches to utilize the analysis result of Layer 2 such as the followings.

  • File finder
  • File classifier
  • File comparer
  • Visualizer for the layout of file
  • Generic purpose file mutator for fuzzing purposes
  • Generic purpose lossy but reversible decompressor

Categories Binary Data Analysis

Posted | Modified
Author

The Reversing on Windows blog started back in 2009. Until 2016, the writings have been hosted on Blogger. Now, the content has been exported to a PDF document and it’s available for download. The document consists of 72 pages and over 80 posts.

Here is a non-exclusive list of topics the document covers.

  • About hard-coded pointers
  • Approaches to track data-flow
  • Research involving various x86 instructions
  • Analysis of binary data formats
  • Examples to write a Pintool to detect software defects
  • Security research on Flash Player, Firefox and Internet Explorer

Since 2017, the blog has been continued on suszter.com/ReversingOnWindows.