Basics of Data Compression

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

0 1 a 0 1 d b c Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1

Average Length For a code C with codeword length L[s], the average length is defined as p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--] La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

0 <= H <= log |S| H -> 0, skewed distribution H max, uniform distribution Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) i(s) 0-th order empirical entropy of string T

Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. Empirical H vs Compression ratio Shannon In practice Avg cw length p(A) = .7, p(B) = p(C) = p(D) = .1 H≈ 1.36 bits Huffman ≈ 1.5 bits per symb An optimal code is surely one that…

Index construction:Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

g-code for integer encoding Length-1 • x > 0 and Length = log2 x +1 e.g., 9 represented as <000,1001>. • g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal) • Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding… • Given the following sequence of g-coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 59 7 6 3

d-code for integer encoding • Use g-coding to reduce the length of the first field • Useful for medium-sized integers e.g., 19 represented as <00,101,10011>. • d-coding x takes about log2 x + 2 log2( log2 x ) + 2 bits. • Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

Variable-bytecodes [10.2 bits per TREC12] • Wish to get very fast (de)compress  byte-align • Given a binary representation of an integer • Append 0s to front, to get a multiple-of-7 number of bits • Form groups of 7-bits each • Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=214+1  binary(v) = 100000000000001 10000001 10000000 00000001 Note:We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

PForDelta coding Use b (e.g. 2) bits to encode 128 numbers or create exceptions 3 42 2 3 3 1 1 … 3 3 23 1 2 11 10 11 11 01 01 … 11 11 01 10 42 23 a block of 128 numbers = 256 bits = 32 bytes Translate data: [base, base + 2b-1]  [0,2b-1] Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

Index construction:Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Raw docs are needed

Various Approaches Statistical coding • Huffman codes • Arithmetic codes Dictionary coding • LZ77, LZ78, LZSS,… • Gzip, zippy, snappy,… Text transforms • Burrows-Wheeler Transform • bzip

Document Compression Huffman coding

Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms • gzip, bzip, jpeg (as option), fax compression,… Properties: • Generates optimalprefix codes • Cheap to encode and decode • La(Huff) = H if probabilities are powers of 2 • Otherwise, La(Huff)< H +1  < +1 bit per symb on avg!!

0 1 1 (.3) 1 0 (.5) 0 (1) Running Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) a=000, b=001, c=01, d=1 There are 2n-1“equivalent” Huffman trees What about ties (and thus, tree depth) ?

Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. 1 0 (.5) d(.5) abc... 00000101 1 0 (.3) 101001...  dcb c(.2) 0 1 a(.1) b(.2)

Huffman in practice The compressed file of n symbols, consists of: • Preamble: tree encoding + symbols in leaves • Body: compressed text of n symbols Preamble = Q(|S| log |S|)bits Body is at least nH and at most nH+n bits Extra+n is bad for very skewed distributions,namely ones for which H -> 0 Example: p(a) = 1/n, p(b) = n-1/n

There are better choices T=aaaaaaaaab • Huffman = {a,b}-encoding + 10 bits • RLE = <a,9><b,1> = g(9) + g(1) + {a,b}-encoding = 0001001 1 + {a,b}-encoding So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Idea on Huffman? Goal: Reduce the impact of the +1 bit Solution: • Divide the text into blocks of k symbols • The +1 is spread over k symbols • So the loss is 1/k per symbol Caution: Alphabet = Sk, preamble gets larger. At the limit, preamble = 1 block equal to the input text, and compressed text is 1 bit only. No compression!

Document Compression Arithmetic coding

Ideal performance. In practice, it is 0.02 * n Introduction It uses “fractional” parts of bits!! Gets nH(T) + 2 bits vs. nH(T)+n of Huffman Used in JPEG/MPEG (as option), Bzip More time costly than Huffman, but integerimplementation is not too bad.

Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. 1.0 cum[c] = p(a)+p(b) = .7 cum[b] = p(a) = .2 cum[a] = .0 c = .3 0.7 b = .5 0.2 a = .2 0.0 The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

0.7 1.0 0.3 c c c = .3 0.7 0.55 0.27 b b b = .5 0.2 0.3 0.22 a a a = .2 0.0 0.2 0.2 Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) (0.3-0.2)*0.3=0.03 (0.7-0.2)*0.3=0.15 (0.7-0.2)*0.5 = 0.25 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1

The algorithm To code a sequence of symbols with probabilitiespi (i = 1..n) use the following algorithm: 0.3 p(c) = .3 0.27 p(b) = .5 p(a) = .2 0.2

The algorithm Each message narrows the interval by a factor p[Ti] Final interval size is Sequence interval [ ln , ln + sn ) Take a number inside

0.7 0.55 1.0 c c c = .3 0.7 0.55 0.475 b b b = .5 0.49 0.49 0.49 0.2 0.3 0.35 a a = .2 a 0.0 0.2 0.3 Decoding Example Decoding the number .49, knowing the message is of length 3: The message is bbc.

How do we encode that number? If x = v/2k (dyadic fraction)then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

How do we encode that number? Binary fractional representation: • FractionalEncode(x) • x = 2 * x • If x < 1 output 0 • else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3 Incremental Generation

Which number do we encode? Truncate the encoding to the first d = log2 (2/sn)bits Truncation gets a smaller number… how much smaller? ln + sn ln + sn/2 Truncation  Compression ln 0∞

Bound on code length Theorem:For a text T of length n, the Arithmetic encoder generates at most log2 (2/sn)< 1 + log2 2/sn = 1 + (1 - log2 sn) = 2 - log2(∏ i=1,n p(Ti)) = 2 - log2(∏s [p(s)occ(s)]) = 2 - ∑s occ(s) * log2 p(s) ≈ 2 + ∑s ( n*p(s) ) * log2 (1/p(s)) = 2 + n H(T)bits T = acabc sn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2

Document Compression Dictionary-based compressors

LZ77 Algorithm’s step: • Output <dist, len, next-char> • Advance by len + 1 A buffer “window” has fixed length and moves a a c a a c a b c a a a a a a a c <6,3,a> Dictionary(all substrings starting here) a c a a c a a c a b c a a a a a a c <3,4,c>

LZ77 Decoding Decoder keeps same dictionary window as encoder. • Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) • E.g. seen = abcd, next codeword is (2,9,e) • Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] • Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length)or(1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

You find this at: www.gzip.org/zlib/

Google’s solution

Basics of Data Compression

Basics of Data Compression

Presentation Transcript

Data Compression

Data Compression Basics

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression