100 likes | 447 Vues
An Overview of Different Compression Algorithms. Their application on compressing inverted files. Alternative Compression Algorithms. Arithmetic coding Huffman coding Character-based Word-based Dictionary-based coding – Ziv-Lempel family of coding. Pros and Cons of Different Algorithms.
E N D
An Overview of Different Compression Algorithms Their application on compressing inverted files
Alternative Compression Algorithms • Arithmetic coding • Huffman coding • Character-based • Word-based • Dictionary-based coding – Ziv-Lempel family of coding
Choosing an Compression Algorithm for inverted files • Factors need to be considered • Compression ratio • Speed • Random access • In modern IR system, Word-based Huffman coding is commonly used • There are a lot of research on Ziv-Lempel family coding to see if they can be applied to indices compression
An Improved Sliding-window Ziv-Lempel Algorithm • Conventional LZ family compression algorithms use a sliding window approach. • Based on longest matching length (m-length) • An improved sliding window LZ algorithm is proposed by Bender and Wolf. • Instead of m-length, the improved algorithm is based on the offset of the length (o-length) and the differential of the length (-length)
Benefits of the Improved Algorithm • Better compression ratio in the experiment • Still linear compression and searching: O(n). • It didn’t really provide an LZ algorithm that support random access.
Another Modified LZ algorithm • Proposed by Williams • Use literal/copy item; • Each step, transmit original if it is a literal item, a pointer if it is a copy item; • Aimed at faster compression speed and smaller memory footprint. • Better used in the embedded system where real-time compression is required. • Inappropriate for index compression.
Conclusion • Up to date, the best practical compression algorithm for index is still word-based Huffman coding. • There are theoretical studies about Ziv-Lempel family coding. Non of them are practically applicable to our problem. But they can be used in other areas.
Reference • An Improved Data Compression Algorithm Based on Ziv-Lempel Data Compression Algorithm, Paul Edward Bender and Jack Keil Wolf; • An Extremely Fast Ziv-Lempel Data Compression Algorithm, Ross N. Williams; • Modern Information Retrieval, Ricardo Baeza-Yates and Berthier Ribeiro-Neto;