Text compression

Text compression

Text compression Goal: to find a scheme that encodes (that is, that represents, or replaces, in a fashion that permits us to recover the original symbols) a string of symbols. Methods for doing this in a fashion in which a symbol of probability p is represented by approximately – 1* log2 p.

Huffman encoding An algorithm that gets an encoding scheme that approximates the information theoretic best scheme:

Step 1: Lay out the codewords in decreasing frequency. Step 2: Take the two items with lowest frequency, and make them a “constituent”. Its frequency, of course, is the sum of the frequencies that comprise it. Repeat Step 2 as necessary, treating “constituents” the same as original items. Step 3: Use the tree thereby constructed to assign 0’s and 1’s to the leaves of the tree, based on direction of branching.

a .670 0 • b .096 100 • c .096 101 • d .096 110 • e .014 11100 • f .014 11101 • g .014 11110 • h .002 11111

.670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

.016 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

.016 .028 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

.044 .016 .028 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

.130 .044 .016 .028 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

.130 .044 .016 .028 .192 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

.322 .130 .044 .016 .028 .192 .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

Encode with the tree: • Each left branch: assign a 0 • Each right branch: assign a 1 a .670 0 b .096 100 c .096 101 d .096 110 e .014 11100 f .014 11101 g .014 11110 h .002 11111

‘1’ .322 ‘1’ .130 ‘1’ ‘0’ ‘0’ .044 ‘0’ ‘1’ .016 .028 .192 ‘0’ ‘1’ ‘1’ ‘0’ ‘0’ .670 .096 .096 .096 .014 .014 .014 .002 a b c d e f g h

Arithmetic encoding

Arithmetic encoding • Conceptually simpler than Huffman encoding, and gets you to the theoretical limit of compression much faster and more easily. • Remember, these encoding schemes are shared between the sender and the receiver ahead of time. • Text compression, by Bell, Cleary, and Witten 1990

The probability of all possible messages adds up to one. If we imagine laying side-by-side a set of intervals, one for each possible message, they would cover a span exactly 1.0000 in length. Suppose we have such a chart already laid out. Then the easiest way to identify a particular message is by sending a (binary-based, if you wish) number that lies strictly within the interval you care about. Choose the number whose expansion takes the fewest number of bits….

That number of bits will be no more that log2 (size of interval) + 1. • Imagine choice of two messages, a and b. a’s probability is .55, b’s is .45. To send the message a, just send 0; to send b, send .11 (that’s a base 2 number between 0 and 1) • Suppose the choices are...

0.25, 0.125, 0.125,0.125…,0.125 (6 of them). • Then the first can be sent with a 0;the second with .01 (=0.25); • the third with .011(=.0375); • the fourth with .1 (=0.5); • the fifth with .101 etc.

Then you could transmit any number inside a given letter’s range,and the receiver would know which letter you were compressing. • The less common letters have a smaller range, and therefore you’ll need a relatively longer decimal fraction to specify a number in that range

[ a )[ b )[ c ] 0.0 0.1 0.11 1.0 Assume we wish to compress a message in a language with three words: a, b, c. (In a complete case, we’d include a symbol to mean end of message if the message was bounded in size). prob(a) = 0.5 = 0.1 binary prob(b) = 0.25 = 0.01 binary prob(c) = 0.25 = 0.01 binary.

Basic idea If we have a distribution assigned to the k symbols we will encode, we can think of that geometrically as a stretchabletemplate T that can be placed over [0,1), and equally well over any subinterval. So think of T being placed over [0,1) (this will be used to encode the first symbol of the k symbols) and a shrunken copy of T being placed over each of the k subintervals of T (one of them will be used to encode the second of the k symbols)…and shrunken copies placed over all of the k2 sub-sub-intervals (used to encode the 3rd symbol)… and so on.

[ a )[ b )[ c ) 0.0 0.1 0.11 1.0 [ )[ )[ ) 0 0.1 0.11 1.0 Any message that starts with “a…” will be represented as a point inside here. So any number in here starts with 0; so 0 encodes ‘a’ perfectly well.

Message is “aa…” [ )[ )[ ) [ a )[ b )[ c ) 0 0.1 0.11 1.0 [ aa )[ ab )[ ac ) 0 0.01 0.011 .1 encoding must start….how?

Message starts, “aac…” [ aa )[ ab )[ ac ) 0 0.01 0.011 .1 [ aaa )[ aab )[ aac ) 0 0.001 0.0011 .01

Message starts, “aacb…” [ aac )[ aac )[ aac ) .0011 0.00111 0.001111 .01 [ aaca )[ aacb )[ aacb ) 0.001111 0.0011111 0.00111111 .01

So the number must be at least 0.001 111 1 but less than .001 111 11. So the shortest one is .001 111 1

Compression: replacing real text by pointers to the words • This is often the best way to think of compression: each word is replaced by a shorter representation, which is a pointer to that word on a list somewhere. • How “big” is a word? Each letter is about 5 bits (25 = 32), and a typical word is 5-6 letters long, i.e., about 25-30 bits. • How many words can you have on a list where no pointer is longer than 30 bits?

Text compression

Text compression

Presentation Transcript

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval

Compression

Compression

Text Text Text Text Text Text Text Text Text Text Text Text Text Text

Compressed indices for text based on Ziv-Lempel compression

Speeding up pattern matching by text compression

On Compression-Based Text Classification

Text Compression

Compression

New Compression Codes for Text Databases

Text Compression

Text Compression

Text Compression Huffman Coding

Compression

Language-Model Based Text-Compression