90 likes | 260 Vues
Data Compression and Huffman Trees (HW 4). Representing Text (ASCII). Way of representing characters as bits Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘<br>’, ‘t’… Each character is represented by a unique 7 bit code. There are 128 possible characters. STATIC LENGTH ENCODING
E N D
Representing Text(ASCII) • Way of representing characters as bits • Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’… • Each character is represented by a unique 7 bit code. There are 128 possible characters. • STATIC LENGTH ENCODING • To encode a long text, we encode it character by character.
Inefficiency of ASCII • Realization: In many natural files, we are much likelier to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! • Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.
Variable Length Coding • Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) • Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). • Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop.
Encoding Trees • Think of encoding as an (unbalanced) tree. • Data is in leaf nodes only (prefix free). • ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 • How to decode ‘01110’? 1 0 e 0 1 a b
Cost of a Tree • For each character ci let fi be its frequency in the file. • Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). • The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi
Creating an Optimal T • Problem: Find tree T with C(T) minimal. • Solution (Huffman 1952): • Create a tree for each character. The weight of the tree W(T) is the frequency of the character. • Repeat n times (n = number of chars) • Select trees T’, T’’ with lowest weights. Merge them together to form T. • Set W(T)= W(T’) + W(T’’) • Implement Using Min-Heap. • What is running time?
Optimality Intuition • Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σci fi. • The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). • Intuitively when we combine trees we can think of this as a new letter with combined weight.
Homework • Implement: • public class HuffmanTree • public class HuffmanNode • public class BinaryHeap • Read a file ‘huff.txt’ which includes letters and frequencies: • A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 • Create a Huffman Tree using the discussed algorithm (book 389-395) • Print “legend”: the code of each character