Pre-processing Printable Unicode Text to Achieve Better Compression Ratio

Posted | Modified


Compression algorithms benefit from pre-processing, aka filtering, which makes data more compressible. Pre-processing printable Unicode text measurably improves the overall compression ratio, thereby providing a significant benefit to compression algorithms.

Pre-processing Unicode Text

Unicode text is characterized by a pattern, which can be visualized on a Hex editor as shown in Figure 1. A printable character follows each zero (non-printable) character. Thus, even-offset characters are printable characters, while odd-offset characters are zeroes.

48 00 65 00 6C 00 6C 00 6F 00 20 00 77 00 6F 00  H.e.l.l.o. .w.o. 
72 00 6C 00 64 00 21 00                          r.l.d.!.

Figure 1: Unicode Text

This understanding is useful for organizing textual data. Bytes at even and odd offsets could be presented sequentially. Placing printable characters before non-printable characters could result in a display shown in Figure 2.

48 65 6C 6C 6F 20 77 6F 72 6C 64 21 00 00 00 00  Hello world!.... 
00 00 00 00 00 00 00 00                          ........

Figure 2: Pre-processed Unicode Text

Compression algorithms benefit from the better organization of data. Algorithms including the LZ (Lempel-Ziv) algorithm can find matches easily, for instance, while the zeroes are more compressible because the RLE (Run-Length Encoding) algorithm can be effectively applied on them.

Experimental Analysis

In this experiment, the efficiency of the pre-processor is tested using real-world and random data. Real-world data includes data from a Windows registry (.REG) file, while random data includes random printable characters. Both files have the same sizes. Six compression methods are applied to the original and pre-processed samples. Each method is manually set to the maximum compression ratio.

Real-world Data Results

Pre-processing contributes to varying degrees of compression improvements ranging from 3.72 percent to 43.87 percent as shown in Table 1. The RAR algorithm outperforms all the other algorithms for the pre-processed data. ZPAQ and ZIP:Deltate also achieve very good results. The 7z:LZMA algorithm is the most effective for the original data and still achieves good result for the pre-processed data.

Table 1: Real-world Data Results
Uncompressed ZIP:Deflate BZip2 7z:PPMd 7z:LZMA RAR ZPAQ
Original 82,097,412 2,923,818 1,926,411 1,607,819 1,116,793 2,080,856 1,205,937
Pre-processed 82,097,412 2,328,555 1,854,655 1,469,189 1,007,911 1,167,924 694,266
Gain 20.36% 3.72% 8.62% 9.75% 43.87% 42.43%

Random Data Results

Findings for the random data show that ZIP and RAR achieve 16.4 percent and 11.53 percent respectively, while the pre-processor for BZip2 and ZPAQ provide negligible improvements.

Table 2: Random Data Results
Uncompressed ZIP:Deflate BZip2 7z:PPMd 7z:LZMA RAR ZPAQ
Original 82,097,412 40,812,580 33,951,889 34,784,486 35,666,811 38,877,871 33,679,718
Pre-processed 82,097,412 34,102,961 33,947,916 34,593,752 34,268,387 34,394,031 33,661,096
Gain 16.4% 0.01% 0.55% 3.92% 11.53% 0.06%


The experiment provides conclusive evidence for the benefits of pre-processing printable Unicode text, which makes the data better organized and improves the overall compression ratio.