1 / 59

Text Compression

Text Compression. Input: String S Output: String S  Shorter S can be reconstructed from S . Text Compression. …

tavita
Télécharger la présentation

Text Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Compression • Input: String S • Output: String S • Shorter • S can be reconstructed from S

  2. Text Compression • … He said to his friend, "If the British marchBy land or sea from the town to-night,Hang a lantern aloft in the belfry archOf the North Church tower as a signal light,--One if by land, and two if by sea;And I on the opposite shore will be,Ready to ride and spread the alarmThrough every Middlesex village and farm,For the country folk to be up and to arm." … Henry Wadsworth Longfellow (1807-1882) By land 1 By sea 2

  3. 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 a a b b 0 1 e e c c d d Text Compression: Examples • “abcde” in the different formats • ASCII: 01100001011000100110001101100100… • Fixed: • 000001010011100 • Var: • 000110100110 • Encodings ASCII: 8 bits/character Unicode: 16 bits/character Unicode adds 8 more 0’s to left of ASCII

  4. 13 7 3 3 s s * * 1 1 2 2 6 6 g g o o 4 4 3 3 3 3 2 2 2 2 2 2 p p p e e e h h h r r r 1 1 1 1 1 1 1 1 1 1 1 1 Huffman coding: go go gophers ASCII(7bits) 3 bits Huffman g 103 1100111 3000 00 o 111 1101111 3001 01 p 112 1110000 1010 1100 h 104 1101000 1011 1101 e 101 1100101 1100 1110 r 114 1110010 1101 1111 s 115 1110011 1110 100 sp. 32 10000002 111 101 • Encoding uses tree: • 0 left/1 right • How many bits? 37!! • Savings? Worth it?

  5. Huffman Coding • D.A Huffman in early 1950’s • Before compressing data, analyze the input stream • Represent data using variable length codes • Variable length codes though Prefix codes • Each letter is assigned a codeword • Codeword is for a given letter is produced by traversing the Huffman tree • Property: No codeword produced is the prefix of another • Letters appearing frequently have short codewords, while those that appear rarely have longer ones • Huffman coding is optimalper-character coding method

  6. Building a Huffman tree • Begin with a forest of single-node trees (leaves) • Each node/tree/leaf is weighted with character count • Node stores two values: character and count • There are n nodes in forest, n is size of alphabet? • Repeat until there is only one node left: root of tree • Remove two minimally weighted trees from forest • Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) • Does this process terminate? How do we get minimal trees? • Remove minimal trees, hmmm……

  7. R N C E F P U L S G T O D B A M I 11 1 2 1 2 4 2 2 5 4 1 3 5 3 3 6 3 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  8. F A S E N G C B M U O R L P T D I 11 2 6 5 5 1 1 1 2 2 3 2 2 2 4 3 3 4 3 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  9. L G M E N C F T S P U B R O D A I 11 2 1 1 1 3 5 6 2 5 2 3 2 2 3 2 3 3 2 4 4 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  10. M E S F T C O P N A R L B D G U I 11 2 2 1 1 3 5 5 6 2 1 4 3 3 4 2 4 3 3 3 2 2 4 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  11. D O N F S C T P E M R L A B G U I 11 2 2 2 1 1 6 5 5 3 2 1 4 4 3 4 2 3 4 3 4 3 3 4 2 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  12. D N E F P C S U T M L A G B O R I 11 2 2 2 2 1 6 1 5 5 3 5 1 2 3 5 4 3 4 4 3 4 2 4 4 3 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  13. P N F B C T U S E L M D G A O R I 11 2 2 2 2 1 6 1 5 5 3 6 1 3 4 6 2 4 2 4 4 4 4 5 5 3 3 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  14. G F E B C P U R N S D M O A T L I 11 5 1 1 3 1 2 6 2 2 3 2 5 2 6 5 6 4 5 3 4 3 2 4 4 4 4 6 6 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  15. U B M T O S G D L R P C A F N E I 11 2 3 8 2 2 2 2 1 3 1 1 5 5 6 3 3 3 4 8 6 4 6 6 6 5 4 5 4 4 4 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  16. U B M T O S G D L R P C A F N E I 11 2 3 8 2 2 2 2 1 3 1 1 5 5 6 3 3 3 5 6 8 4 8 6 8 6 4 6 4 4 5 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  17. O R T S M A B N E U F G C D P L I 11 10 10 4 2 2 1 2 1 3 2 3 5 3 5 3 6 1 6 2 3 4 4 5 5 6 6 4 8 8 6 8 8 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  18. O L E S M N A R B T C P G U D F I 11 11 11 10 10 4 4 2 2 2 3 1 3 1 3 1 3 5 5 8 2 2 3 4 4 5 6 6 8 6 8 8 2 6 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  19. O D M N S E F A L B P T U R G C I 12 11 11 10 12 10 11 1 4 1 3 1 3 2 2 3 2 2 4 3 5 5 2 4 4 5 6 8 8 6 8 8 6 2 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  20. O G F N E S C M D A U B R T L P I 12 11 16 12 10 11 10 16 11 4 1 4 2 3 3 2 3 2 3 1 2 2 3 2 4 5 6 8 6 8 1 5 5 6 4 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  21. O G F N E S C M D A U B R T L P I 16 11 21 12 21 11 10 16 12 4 1 4 2 3 3 2 3 2 3 1 2 2 3 2 4 5 6 8 6 8 1 5 5 6 4 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  22. S M U A B T O E N F D C L P R G I 21 23 11 11 16 12 16 23 21 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  23. S M U A B T O E N F D C L P R G I 21 23 11 11 37 12 16 37 23 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  24. S M U A B T O E N F D C L P R G I 21 23 11 11 37 12 16 60 60 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

  25. Encoding • Count occurrence of all occurring characters O( ) • Build priority queueO( ) • Build Huffman treeO( ) • Create Table of codes from treeO( ) • Write Huffman tree and coded data to fileO( ) N A A log A A log A N

  26. Properties of Huffman coding • Want to minimize weighted path length L(T)of tree T • wi is the weight or count of each codeword i • di is the leaf corresponding to codeword i • How do we calculate character (codeword) frequencies? • Huffman coding creates pretty full bushy trees? • When would it produce a “bad” tree? • How do we produce coded compressed data from input efficiently?

  27. Writing code out to file • How do we go from characters to encodings? • Build Huffman tree • Root-to-leaf path generates encoding • Need way of writing bits out to file • Platform dependent? • Complicated to write bits and read in same ordering • See BitInputStream and BitOutputStream classes • Depend on each other, bit ordering preserved • How do we know bits come from compressed file? • Store a magic number

  28. O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 01100000100001001101

  29. O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 1100000100001001101

  30. O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 100000100001001101

  31. O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 00000100001001101

  32. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0000100001001101 G

  33. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 000100001001101 G

  34. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 00100001001101 G

  35. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0100001001101 G

  36. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 100001001101 G

  37. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 00001001101 GO

  38. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0001001101 GO

  39. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 001001101 GO

  40. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01001101 GO

  41. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1001101 GO

  42. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 001101 GOO

  43. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01101 GOO

  44. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1101 GOO

  45. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 101 GOO

  46. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01 GOO

  47. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1 GOOD

  48. T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01100000100001001101 GOOD

  49. Decoding • Read in tree data O( ) • Decode bit string with tree O( )

  50. Huffman Tree 2 • “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” • E.g. “ A SIMPLE”  “10101101001000101001110011100000”

More Related