Pre-processing Printable Unicode Text to Achieve Better Compression Ratio

Posted 9 February 2018 | Modified 21 February 2018
Author Attila Suszter

Introduction

Compression algorithms benefit from pre-processing, aka filtering, which makes data more compressible. Pre-processing printable Unicode text measurably improves the overall compression ratio, thereby providing a significant benefit to compression algorithms.

Pre-processing Unicode Text

Unicode text is characterized by a pattern, which can be visualized on a Hex editor as shown in Figure 1. A printable character follows each zero (non-printable) character. Thus, even-offset characters are printable characters, while odd-offset characters are zeroes.

48 00 65 00 6C 00 6C 00 6F 00 20 00 77 00 6F 00  H.e.l.l.o. .w.o. 
72 00 6C 00 64 00 21 00                          r.l.d.!.

Figure 1: Unicode Text

This understanding is useful for organizing textual data. Bytes at even and odd offsets could be presented sequentially. Placing printable characters before non-printable characters could result in a display shown in Figure 2.

48 65 6C 6C 6F 20 77 6F 72 6C 64 21 00 00 00 00  Hello world!.... 
00 00 00 00 00 00 00 00                          ........

Figure 2: Pre-processed Unicode Text

Compression algorithms benefit from the better organization of data. Algorithms including the LZ (Lempel-Ziv) algorithm can find matches easily, for instance, while the zeroes are more compressible because the RLE (Run-Length Encoding) algorithm can be effectively applied on them.

Experimental Analysis

In this experiment, the efficiency of the pre-processor is tested using real-world and random data. Real-world data includes data from a Windows registry (.REG) file, while random data includes random printable characters. Both files have the same sizes. Six compression methods are applied to the original and pre-processed samples. Each method is manually set to the maximum compression ratio.

Real-world Data Results

Pre-processing contributes to varying degrees of compression improvements ranging from 3.72 percent to 43.87 percent as shown in Table 1. The RAR algorithm outperforms all the other algorithms for the pre-processed data. ZPAQ and ZIP:Deltate also achieve very good results. The 7z:LZMA algorithm is the most effective for the original data and still achieves good result for the pre-processed data.

Table 1: Real-world Data Results
	Uncompressed	ZIP:Deflate	BZip2	7z:PPMd	7z:LZMA	RAR	ZPAQ
Original	82,097,412	2,923,818	1,926,411	1,607,819	1,116,793	2,080,856	1,205,937
Pre-processed	82,097,412	2,328,555	1,854,655	1,469,189	1,007,911	1,167,924	694,266
Gain		20.36%	3.72%	8.62%	9.75%	43.87%	42.43%

Random Data Results

Findings for the random data show that ZIP and RAR achieve 16.4 percent and 11.53 percent respectively, while the pre-processor for BZip2 and ZPAQ provide negligible improvements.

Table 2: Random Data Results
	Uncompressed	ZIP:Deflate	BZip2	7z:PPMd	7z:LZMA	RAR	ZPAQ
Original	82,097,412	40,812,580	33,951,889	34,784,486	35,666,811	38,877,871	33,679,718
Pre-processed	82,097,412	34,102,961	33,947,916	34,593,752	34,268,387	34,394,031	33,661,096
Gain		16.4%	0.01%	0.55%	3.92%	11.53%	0.06%

Conclusion

The experiment provides conclusive evidence for the benefits of pre-processing printable Unicode text, which makes the data better organized and improves the overall compression ratio.

Categories Data Compression