Introduction
Compression algorithms benefit from pre-processing, aka filtering, which makes data more compressible. Pre-processing printable Unicode text measurably improves the overall compression ratio, thereby providing a significant benefit to compression algorithms.
Pre-processing Unicode Text
Unicode text is characterized by a pattern, which can be visualized on a Hex editor as shown in Figure 1. A printable character follows each zero (non-printable) character. Thus, even-offset characters are printable characters, while odd-offset characters are zeroes.
48 00 65 00 6C 00 6C 00 6F 00 20 00 77 00 6F 00 H.e.l.l.o. .w.o.
72 00 6C 00 64 00 21 00 r.l.d.!.
Figure 1: Unicode Text
This understanding is useful for organizing textual data. Bytes at even and odd offsets could be presented sequentially. Placing printable characters before non-printable characters could result in a display shown in Figure 2.
48 65 6C 6C 6F 20 77 6F 72 6C 64 21 00 00 00 00 Hello world!....
00 00 00 00 00 00 00 00 ........
Figure 2: Pre-processed Unicode Text
Compression algorithms benefit from the better organization of data. Algorithms including the LZ (Lempel-Ziv) algorithm can find matches easily, for instance, while the zeroes are more compressible because the RLE (Run-Length Encoding) algorithm can be effectively applied on them.
Experimental Analysis
In this experiment, the efficiency of the pre-processor is tested using real-world and random data. Real-world data includes data from a Windows registry (.REG) file, while random data includes random printable characters. Both files have the same sizes. Six compression methods are applied to the original and pre-processed samples. Each method is manually set to the maximum compression ratio.
Real-world Data Results
Pre-processing contributes to varying degrees of compression improvements ranging from 3.72 percent to 43.87 percent as shown in Table 1. The RAR algorithm outperforms all the other algorithms for the pre-processed data. ZPAQ and ZIP:Deltate also achieve very good results. The 7z:LZMA algorithm is the most effective for the original data and still achieves good result for the pre-processed data.
Table 1: Real-world Data Results | |||||||
Uncompressed | ZIP:Deflate | BZip2 | 7z:PPMd | 7z:LZMA | RAR | ZPAQ | |
---|---|---|---|---|---|---|---|
Original | 82,097,412 | 2,923,818 | 1,926,411 | 1,607,819 | 1,116,793 | 2,080,856 | 1,205,937 |
Pre-processed | 82,097,412 | 2,328,555 | 1,854,655 | 1,469,189 | 1,007,911 | 1,167,924 | 694,266 |
Gain | 20.36% | 3.72% | 8.62% | 9.75% | 43.87% | 42.43% |
Random Data Results
Findings for the random data show that ZIP and RAR achieve 16.4 percent and 11.53 percent respectively, while the pre-processor for BZip2 and ZPAQ provide negligible improvements.
Table 2: Random Data Results | |||||||
Uncompressed | ZIP:Deflate | BZip2 | 7z:PPMd | 7z:LZMA | RAR | ZPAQ | |
---|---|---|---|---|---|---|---|
Original | 82,097,412 | 40,812,580 | 33,951,889 | 34,784,486 | 35,666,811 | 38,877,871 | 33,679,718 |
Pre-processed | 82,097,412 | 34,102,961 | 33,947,916 | 34,593,752 | 34,268,387 | 34,394,031 | 33,661,096 |
Gain | 16.4% | 0.01% | 0.55% | 3.92% | 11.53% | 0.06% |
Conclusion
The experiment provides conclusive evidence for the benefits of pre-processing printable Unicode text, which makes the data better organized and improves the overall compression ratio.