1 / 29

Text compression

Text compression. Text compression. Goal: to find a scheme that encodes (that is, that represents, or replaces, in a fashion that permits us to recover the original symbols) a string of symbols.

lanai
Télécharger la présentation

Text compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text compression

  2. Text compression Goal: to find a scheme that encodes (that is, that represents, or replaces, in a fashion that permits us to recover the original symbols) a string of symbols. Methods for doing this in a fashion in which a symbol of probability p is represented by approximately – 1* log2 p.

  3. Huffman encoding An algorithm that gets an encoding scheme that approximates the information theoretic best scheme:

  4. Step 1: Lay out the codewords in decreasing frequency. Step 2: Take the two items with lowest frequency, and make them a “constituent”. Its frequency, of course, is the sum of the frequencies that comprise it. Repeat Step 2 as necessary, treating “constituents” the same as original items. Step 3: Use the tree thereby constructed to assign 0’s and 1’s to the leaves of the tree, based on direction of branching.

  5. a .670 0 • b .096 100 • c .096 101 • d .096 110 • e .014 11100 • f .014 11101 • g .014 11110 • h .002 11111

  6. .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  7. .016 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  8. .016 .028 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  9. .044 .016 .028 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  10. .130 .044 .016 .028 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  11. .130 .044 .016 .028 .192 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  12. .322 .130 .044 .016 .028 .192 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  13. .322 .130 .044 .016 .028 .192 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  14. Encode with the tree: • Each left branch: assign a 0 • Each right branch: assign a 1 a .670 0 b .096 100 c .096 101 d .096 110 e .014 11100 f .014 11101 g .014 11110 h .002 11111

  15. ‘1’ .322 ‘1’ .130 ‘1’ ‘0’ ‘0’ .044 ‘0’ ‘1’ .016 .028 .192 ‘0’ ‘1’ ‘1’ ‘0’ ‘0’ .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

  16. Arithmetic encoding

  17. Arithmetic encoding • Conceptually simpler than Huffman encoding, and gets you to the theoretical limit of compression much faster and more easily. • Remember, these encoding schemes are shared between the sender and the receiver ahead of time. • Text compression, by Bell, Cleary, and Witten 1990

  18. The probability of all possible messages adds up to one. If we imagine laying side-by-side a set of intervals, one for each possible message, they would cover a span exactly 1.0000 in length. Suppose we have such a chart already laid out. Then the easiest way to identify a particular message is by sending a (binary-based, if you wish) number that lies strictly within the interval you care about. Choose the number whose expansion takes the fewest number of bits….

  19. That number of bits will be no more that log2 (size of interval) + 1. • Imagine choice of two messages, a and b. a’s probability is .55, b’s is .45. To send the message a, just send 0; to send b, send .11 (that’s a base 2 number between 0 and 1) • Suppose the choices are...

  20. 0.25, 0.125, 0.125,0.125…,0.125 (6 of them). • Then the first can be sent with a 0;the second with .01 (=0.25); • the third with .011(=.0375); • the fourth with .1 (=0.5); • the fifth with .101 etc.

  21. Then you could transmit any number inside a given letter’s range,and the receiver would know which letter you were compressing. • The less common letters have a smaller range, and therefore you’ll need a relatively longer decimal fraction to specify a number in that range

  22. [ a )[ b )[ c ] 0.0 0.1 0.11 1.0 Assume we wish to compress a message in a language with three words: a, b, c. (In a complete case, we’d include a symbol to mean end of message if the message was bounded in size). prob(a) = 0.5 = 0.1 binary prob(b) = 0.25 = 0.01 binary prob(c) = 0.25 = 0.01 binary.

  23. Basic idea If we have a distribution assigned to the k symbols we will encode, we can think of that geometrically as a stretchabletemplate T that can be placed over [0,1), and equally well over any subinterval. So think of T being placed over [0,1) (this will be used to encode the first symbol of the k symbols) and a shrunken copy of T being placed over each of the k subintervals of T (one of them will be used to encode the second of the k symbols)…and shrunken copies placed over all of the k2 sub-sub-intervals (used to encode the 3rd symbol)… and so on.

  24. [ a )[ b )[ c ) 0.0 0.1 0.11 1.0 [ )[ )[ ) 0 0.1 0.11 1.0 Any message that starts with “a…” will be represented as a point inside here. So any number in here starts with 0; so 0 encodes ‘a’ perfectly well.

  25. Message is “aa…” [ )[ )[ ) [ a )[ b )[ c ) 0 0.1 0.11 1.0 [ aa )[ ab )[ ac ) 0 0.01 0.011 .1 encoding must start….how?

  26. Message starts, “aac…” [ aa )[ ab )[ ac ) 0 0.01 0.011 .1 [ aaa )[ aab )[ aac ) 0 0.001 0.0011 .01

  27. Message starts, “aacb…” [ aac )[ aac )[ aac ) .0011 0.00111 0.001111 .01 [ aaca )[ aacb )[ aacb ) 0.001111 0.0011111 0.00111111 .01

  28. So the number must be at least 0.001 111 1 but less than .001 111 11. So the shortest one is .001 111 1

  29. Compression: replacing real text by pointers to the words • This is often the best way to think of compression: each word is replaced by a shorter representation, which is a pointer to that word on a list somewhere. • How “big” is a word? Each letter is about 5 bits (25 = 32), and a typical word is 5-6 letters long, i.e., about 25-30 bits. • How many words can you have on a list where no pointer is longer than 30 bits?

More Related