1 / 28

compress!

prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008. cdwd 0 100 101 11100 110 11101 11110 11111. prob. 0.8 0.2. cdwd 0 1. AAA AAB ABA ABB BAA BAB BBA BBB. A B. compress!. From theoretical viewpoint ... block Huffman codes achieve the best efficiency.

Télécharger la présentation

compress!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008 cdwd 0 100 101 11100 110 11101 11110 11111 prob. 0.8 0.2 cdwd 0 1 AAA AAB ABA ABB BAA BAB BBA BBB A B compress! From theoretical viewpoint... • block Huffman codes achieve the best efficiency. for one symbol for three symbols for one symbol

  2. prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008 cdwd 0 100 101 11100 110 11101 11110 11111 AAA AAB ABA ABB BAA BAB BBA BBB problem of block Huffman codes From practicalviewpoint... • block Huffman codes have some problems: • a large tableis needed for the encoding/decoding  run-lengthHuffman code  arithmetic code • probabilities must be known in advance  Lempel-Ziv codes three coding techniques

  3. 1/3 run-length Huffman code a coding scheme which is good for “biased” sequences • we focus binaryinformation source • alphabet = , with • data compression in the facsimile system

  4. run and run-length run = a sequence of consecutive identical symbol A B B A A A A A B A A A B of A run of length = 3 run of length = 1 run of length = 0 run of length =5 The message is recovered if the lengths of runs are given.  encode the length of runs, not the pattern itself

  5. upper-bound the run-length small problem? ... there can be very, very, very long run  put an upper-bound limit : run-length limited (RLL) coding upper-bound = 3 • ABBAAAAABAAAB = • one “A” followed by B • zero“A” followed by B • three or more “A”s • two“A”s followed by B • three or more “A”s • zero “A” followed by B run length 0 1 2 3 4 5 6 7 : representation 0 1 2 3+0 3+1 3+2 3+3+0 3+3+1 :

  6. run-length Huffman code ... is a Huffman code defined to encode the length or runs • effective when there is bias of symbol probabilities p(A) = 0.9, p(B) = 0.1 run length 0 1 2 3 or more block pattern B AB AAB AAA prob. 0.1 0.09 0.081 0.729 codeword 10 110 111 0 • ABBAAAAABAAAB: 1, 0, 3+, 2, 3+, 0 ⇒ 110 10 0 111 0 10 • AAAABAAAAABAAB: 3+, 1, 3+, 2, 2 ⇒ 0 110 0 111 111 • AAABAAAAAAAAB: 3+, 0, 3+, 3+, 2 ⇒ 0 10 0 0 111

  7. comparison • P(A) = 0.9, p(B) = 0.1 • the entropy of X: H(X) = –0.9log20.9 – 0.1log20.1=0.469 bit symbol A B prob. 0.9 0.1 codeword 0 1 • code 1: a naive Huffman code average codeword length = 1 • code 2: blocked (3bit) average codeword length = 1.661/3symbols = 0.55/symbol AAA AAB ABA ABB 0.729 0.081 0.081 0.009 0 100 110 1010 BAA BAB BBA BBB 0.081 0.009 0.009 0.009 1110 1011 11110 11111

  8. comparison (cnt’d) consider typical runs... • before: ; A or Bs • after: ; 0 or 1s the average codeword length per symbol = 2.466 / 5.215 = 0.47 • code 3: run-length Huffman (upper-bound = 8) length 0 1 2 3 prob. 0.1 0.09 0.081 0.073 codeword 110 1000 1001 1010 length 4 5 6 7+ prob. 0.066 0.059 0.053 0.478 codeword 1011 1110 1111 0 RLL is a small trick, but it fully utilizes Huffman coding technique

  9. 2/3 arithmetic code a coding scheme which does not use the translation table • table-lookupis replaced by “on-the-fly” computation • translation table is not needed • slightly complicated computation is needed • It is proved that its average codeword length

  10. # 0 1 2 3 4 5 6 7 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 0 0.343 0.490 0.637 0.700 0.847 0.910 0.973 preliminary • -th order extended source with • we encode one of patterns in • 8 data patterns • in the dictionary order • :prob. that occurs • :accumulation ofprobs. ↑ accumulation of before

  11. 0 0.343 0.490 0.637 0.700 0.847 0.910 0.973 illustration of probabilities • the 8 data patterns define a partition of the interval [0, 1]; 0 0.5 1.0 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 A(ABB) A(BAA) = A(ABB)+P(ABB) # 0 1 2 3 4 5 6 7 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 • occupies the interval basic idea: • represent by a value problem to solve: • need a translation between and ↑ ↑ size & left-endof the interval

  12. A B AA AB BA AB AAA AAB ABA ABB BAA BAB BAB BBB 0.027 0.343 0.147 0.147 0.063 0.147 0.063 0.063 about the translation two directions of the translation: • [encode] the translation from to • [decode] the translation from to ...use recursive computation instead of a static table “a land of a parent is divided & inherited to two children” P(w) A(w) P(wA) P(wB) A(wA) A(wB)

  13. [encode] the translation from to recursively determine and for prefixes of • (is a null string) • for , • for , the interval of ABB?  A B AA AB ABBinherits [0.637, 0.637 + 0.063) ABA ABB

  14. [encode] the translation from to (cnt’d) We know the interval ; which of should we choose? • should have the shortest binary representation • choose but trim atplaces 0.aa...aaa...a +0.00...01b...b 0.aa...acc...c 0.aa...ac0...0 the length of ≈ – log2 = most significant non-zero place of 0.aa...ac almost ideal! 0.aa...aaa...a 0.aa...acc...c

  15. choice of (sketch in decimal notation) Find that is the shortest in decimal. 0.123456 0.126543 0.12654 0.1265 0.126 0.126543 ) 0.123456 0.003087 0.12 round off some digits of 0.126543, but not too many... # of fraction places that must have = the most significant nonzero place of =

  16. [decode] the translation from to given , determine the leaf node whose interval contains • almost the same as the first half of the encoding translation • compute, compare, and move to the left or right  A B AA AB threshold value ABA ABB 0.600 is contained in the interval of ABA...decoding completed

  17. performance, summary an -symbol pattern with probability  encoded to a codeword with length • the average codeword length per symbol is • almost optimum coding without using a translation table however... • we need much computation with good precision ( use approximation?)

  18. 3/3 Lempel-Ziv codes a coding scheme which does not need probability distribution • the encoder learns the statistical behavior of the source • the translation table is constructed in an adaptive manner • works finely even for information sources with memory

  19. probability in advance? so far, we assumed that the probabilities of symbols are known... in the real world... • the symbol probabilities are often not known in advance • scan the data twice? • first scan...count the number of symbol occurrences • second scan...Huffman coding • delay of the encoding operation... • overhead to transmit the translation table...

  20. Lempel-Ziv algorithms for information sources whose symbol probability is not known... • LZ77 • lha, gzip, zip, zoo, etc. • LZ78 • compress, arc, stuffit, etc. • LZW • GIF, TIFF, etc. work fine for any information sources  universal coding

  21. LZ77 L • proposed by A. Lempel and J. Zivin 1977 • represent a data substring by using a substring which has been occurred previously algorithm overview • process the data from the beginning • partition the data to blocks in a dynamic manner • represent a block by a three-tuple • “rewind symbols, copy symbols, and append ” Z –1 0 encoding completed

  22. encoding example of LZ77 • consider to encode ABCBCDBDCBCD symbol A B C B C D B D C B C D history first time first time first time = (here) – 2 = (here) – 2 ≠ (here) – 2 = (here) – 3 ≠ (here) – 3 = (here) – 6 = (here) – 6 = (here) – 6 = (here) – 6 codeword (0, 0, A) (0, 0, B) (0, 0, C) (2, 2, D) (3, 1, D) (6, 4, *)

  23. decoding example of LZ77 • decode (0, 0, A), (0, 0, B), (0, 0, C), (2, 2, D), (3, 1, D), (6, 4, *) possible problem: • large block is good, because we can copy more symbols • large block is bad, because a codeword contains a large integer ... the trade-off degrades the performance.

  24. –1 0 encoding completed LZ78 • proposed by A. Lempel and J. Ziv in 1978 • represent a block by a thw-tuple • “copy the -thblock before, and append ”

  25. encoding example of LZ78 • consider to encode ABCBCBCDBCDE symbol A B C B C B C D B C D E history first time first time first time = (here) – 2 block = (here) – 1 block = (here) – 1 block codeword (0, A) (0, B) (0, C) (2, C) (1, D) (1, E) block # 1 2 3 4 5 6

  26. decoding example of LZ78 • decode (0, A), (0, B), (0, C), (2, C), (1, D), (1, E) advantage against LZ77: • large block is good, because we can copy more symbols • is there anything wrong with large blocks?  the performance slightly better than LZ78

  27. summary of LZ algorithms in LZ algorithms, the translation table is constructed adoptively • information sources with unknown symbol probabilities • information sources with memory • LZW: good material to learn intellectual property (知的財産) • UNISYS, CompuServe, GIF format, ...

  28. summary of today’s class Huffman codes are good, but not practical sometimes... • run-length Huffman code • simple but effective for certain types of sources • arithmetic code • not so practical, but has strong back-up from theory • LZ codes • practical, practical, practical

More Related