Text Compression
This overview explores the principles of text compression, focusing on Huffman coding. It highlights how texts can be encoded for efficient storage and retrieval, allowing for reconstruction from the compressed form. Additionally, it utilizes an example from Henry Wadsworth Longfellow to illustrate communication signaling methods. The document also compares various encoding formats, such as ASCII and Unicode, detailing their efficiency. By analyzing data through variable-length codes and constructing a Huffman tree, we uncover how frequently used characters are optimally compressed.
Text Compression
E N D
Presentation Transcript
Text Compression • Input: String S • Output: String S • Shorter • S can be reconstructed from S
Text Compression • … He said to his friend, "If the British marchBy land or sea from the town to-night,Hang a lantern aloft in the belfry archOf the North Church tower as a signal light,--One if by land, and two if by sea;And I on the opposite shore will be,Ready to ride and spread the alarmThrough every Middlesex village and farm,For the country folk to be up and to arm." … Henry Wadsworth Longfellow (1807-1882) By land 1 By sea 2
1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 a a b b 0 1 e e c c d d Text Compression: Examples • “abcde” in the different formats • ASCII: 01100001011000100110001101100100… • Fixed: • 000001010011100 • Var: • 000110100110 • Encodings ASCII: 8 bits/character Unicode: 16 bits/character Unicode adds 8 more 0’s to left of ASCII
13 7 3 3 s s * * 1 1 2 2 6 6 g g o o 4 4 3 3 3 3 2 2 2 2 2 2 p p p e e e h h h r r r 1 1 1 1 1 1 1 1 1 1 1 1 Huffman coding: go go gophers ASCII(7bits) 3 bits Huffman g 103 1100111 3000 00 o 111 1101111 3001 01 p 112 1110000 1010 1100 h 104 1101000 1011 1101 e 101 1100101 1100 1110 r 114 1110010 1101 1111 s 115 1110011 1110 100 sp. 32 10000002 111 101 • Encoding uses tree: • 0 left/1 right • How many bits? 37!! • Savings? Worth it?
Huffman Coding • D.A Huffman in early 1950’s • Before compressing data, analyze the input stream • Represent data using variable length codes • Variable length codes though Prefix codes • Each letter is assigned a codeword • Codeword is for a given letter is produced by traversing the Huffman tree • Property: No codeword produced is the prefix of another • Letters appearing frequently have short codewords, while those that appear rarely have longer ones • Huffman coding is optimalper-character coding method
Building a Huffman tree • Begin with a forest of single-node trees (leaves) • Each node/tree/leaf is weighted with character count • Node stores two values: character and count • There are n nodes in forest, n is size of alphabet? • Repeat until there is only one node left: root of tree • Remove two minimally weighted trees from forest • Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) • Does this process terminate? How do we get minimal trees? • Remove minimal trees, hmmm……
R N C E F P U L S G T O D B A M I 11 1 2 1 2 4 2 2 5 4 1 3 5 3 3 6 3 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
F A S E N G C B M U O R L P T D I 11 2 6 5 5 1 1 1 2 2 3 2 2 2 4 3 3 4 3 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
L G M E N C F T S P U B R O D A I 11 2 1 1 1 3 5 6 2 5 2 3 2 2 3 2 3 3 2 4 4 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
M E S F T C O P N A R L B D G U I 11 2 2 1 1 3 5 5 6 2 1 4 3 3 4 2 4 3 3 3 2 2 4 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
D O N F S C T P E M R L A B G U I 11 2 2 2 1 1 6 5 5 3 2 1 4 4 3 4 2 3 4 3 4 3 3 4 2 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
D N E F P C S U T M L A G B O R I 11 2 2 2 2 1 6 1 5 5 3 5 1 2 3 5 4 3 4 4 3 4 2 4 4 3 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
P N F B C T U S E L M D G A O R I 11 2 2 2 2 1 6 1 5 5 3 6 1 3 4 6 2 4 2 4 4 4 4 5 5 3 3 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
G F E B C P U R N S D M O A T L I 11 5 1 1 3 1 2 6 2 2 3 2 5 2 6 5 6 4 5 3 4 3 2 4 4 4 4 6 6 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
U B M T O S G D L R P C A F N E I 11 2 3 8 2 2 2 2 1 3 1 1 5 5 6 3 3 3 4 8 6 4 6 6 6 5 4 5 4 4 4 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
U B M T O S G D L R P C A F N E I 11 2 3 8 2 2 2 2 1 3 1 1 5 5 6 3 3 3 5 6 8 4 8 6 8 6 4 6 4 4 5 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O R T S M A B N E U F G C D P L I 11 10 10 4 2 2 1 2 1 3 2 3 5 3 5 3 6 1 6 2 3 4 4 5 5 6 6 4 8 8 6 8 8 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O L E S M N A R B T C P G U D F I 11 11 11 10 10 4 4 2 2 2 3 1 3 1 3 1 3 5 5 8 2 2 3 4 4 5 6 6 8 6 8 8 2 6 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O D M N S E F A L B P T U R G C I 12 11 11 10 12 10 11 1 4 1 3 1 3 2 2 3 2 2 4 3 5 5 2 4 4 5 6 8 8 6 8 8 6 2 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O G F N E S C M D A U B R T L P I 12 11 16 12 10 11 10 16 11 4 1 4 2 3 3 2 3 2 3 1 2 2 3 2 4 5 6 8 6 8 1 5 5 6 4 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O G F N E S C M D A U B R T L P I 16 11 21 12 21 11 10 16 12 4 1 4 2 3 3 2 3 2 3 1 2 2 3 2 4 5 6 8 6 8 1 5 5 6 4 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
S M U A B T O E N F D C L P R G I 21 23 11 11 16 12 16 23 21 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
S M U A B T O E N F D C L P R G I 21 23 11 11 37 12 16 37 23 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
S M U A B T O E N F D C L P R G I 21 23 11 11 37 12 16 60 60 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
Encoding • Count occurrence of all occurring characters O( ) • Build priority queueO( ) • Build Huffman treeO( ) • Create Table of codes from treeO( ) • Write Huffman tree and coded data to fileO( ) N A A log A A log A N
Properties of Huffman coding • Want to minimize weighted path length L(T)of tree T • wi is the weight or count of each codeword i • di is the leaf corresponding to codeword i • How do we calculate character (codeword) frequencies? • Huffman coding creates pretty full bushy trees? • When would it produce a “bad” tree? • How do we produce coded compressed data from input efficiently?
Writing code out to file • How do we go from characters to encodings? • Build Huffman tree • Root-to-leaf path generates encoding • Need way of writing bits out to file • Platform dependent? • Complicated to write bits and read in same ordering • See BitInputStream and BitOutputStream classes • Depend on each other, bit ordering preserved • How do we know bits come from compressed file? • Store a magic number
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 01100000100001001101
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 1100000100001001101
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 100000100001001101
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 00000100001001101
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0000100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 000100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 00100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 00001001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0001001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 001001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 001101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1 GOOD
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01100000100001001101 GOOD
Decoding • Read in tree data O( ) • Decode bit string with tree O( )
Huffman Tree 2 • “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” • E.g. “ A SIMPLE” “10101101001000101001110011100000”