Some aspects of information theory for a computer scientist

Some aspects of information theory for a computer scientist Eric Fabre http://people.rennes.inria.fr/Eric.Fabre http://www.irisa.fr/sumo 11 Sep. 2014

Outline 1. Information: measure and compression 2.Reliable transmission of information 3.Distributed compression 4.Fountain codes 5.Distributedpeer-to-peerstorage

1 Information: measure and compression

Let’splay… One card is drawn at random in the following set. Guess the color of the card, with a minimum of yes/no questions • One strategy • is it hearts ? • if not, is it clubs ? • if not, is it diamonds ? • Wins in • 1 guess, with probability ½ • 2 guesses, with prob. ¼ • 3 guesses, with prob. ¼ • 1.75 questions on average Is there a better strategy ?

1 000 001 01 Observation • Lessons • more likely means easier to guess (carries less information) • amount of information depends only on the log likelihood of an event • guessing with yes/no questions = encoding with bits = compressing

1 • Important remark: • codes like the one below are not permitted • they cannot be uniquely decoded if one transmits sequences of encoded values of Xe.g. sequence 11 can encode “Diamonds” or “Hearts,Hearts” • one would need one extra symbol to separate “words” 00 11 0

Entropy Source of information = random variable notation: variables X, Y, … taking values x, y, … information carried by event “X=x” average information carried by X H(X) measures the average difficulty to encode/describe/guess random outcomes of X

Properties with equality iff X not random with equality iff X and Y independent(i.e. ) with equality iff is uniform Bernouilli distribution

Conditionalentropy uncertainty left on Y when X is known Property with equality iff Y and X independent

Example : X = color, Y = value average recall so one checks Exercise : check that

A visualrepresentation

Data compression CoDecfor source X, with R bits/sample on average rate R is achievableiff there exists CoDec pairs (fn,gn) of rate R with vanishing error probability : • Theorem (Shannon, ‘48) : • a lossless compression scheme for source X must have a rate R ≥ H(X) bits/sample on average • the rate H(X) is (asymptotically) achievable Usage: there was no better strategy for our card game !

Proof Necessity : if R achievable, then R ≥ H(X), quite easy to prove Sufficiency : for R > H(X), it requires to build a lossless coding scheme of using R bits/sample on average • Solution 1 • use a known optimal lossless coding scheme for X : the Huffman code • then prove H(X) ≤ L < H(X) + 1 • over n independent symbols X1,…,Xn, one has Solution 2 : encoding only “typical sequences”

Typicalsequences Let X1,…,Xn be independent, same law By the law of large numbers, one has the a.s. convergence Sequence is typicaliff or equivalently Set of typical sequences :

AEP : asymptotic equipartition property • one has • and So non typical sequences count for 0, and there are approximately typical sequences, each of probability Kn=2n log2 K sequences, where 2nH(X) typical sequences • Optimal lossless compression • encode a typical sequence with nH(X) bits • encode a non-typical sequence with n log2 K bits • add 0 / 1 as prefix to mean typ. / non-typ.

Practicalcodingschemes • Encoding by typicality is unpractical ! • Practical codes : • Huffman code • arithmetic coding (adapted to data flows) • etc. • All require to know the distribution of the source to be efficient. • Universal code: • does not need to know the source distribution • for long sequences X1…Xn, converge to the optimal rate H(X) bits/symbol • example: Lempel-Ziv algorithm (used in ZIP, Compress, etc.)

2 Reliable transmission of information

Mutual information measures how many bits X and Y have in common (on average) Properties with equality iff X and Y are independent

Noisy channel Channel = input alphabet, output alphabet, transition probability A B A B B A observe that is left free Capacity bits / use of channel maximizes the coupling between input and output letters favors letters that are the less altered by noise

Example The erasure channel : a proportion of p bits are erased B A E 0 1 Define the erasure variable E = f(B) with E=1 when an erasure occurred, and E=0 otherwise and So

Protection againsterrors Idea: add extra bits to the message, to augment its inner redundancy (this is exactly the converse of data compression) Coding scheme noisy channel fn gn • X takes values in { 1, 2, … , M=2nR } • rate of the codec R = log2(M) / n transmitted bits / channel use • R is achievableiff there exists a series of (fn,gn) CoDecs of rate R such that where

Error correction (for a binarychannel) Repetition • useful bit U sent 3 times : A1=A2=A3=U • decoding by majority • detects and corrects one error… but R’=R/3 Parity checks • X = k useful bits U1…Uk, expanded into n bits A1…An • rate R = k/n • for example: add extra redundant bits Vk+1…Vn that are linear combinations of the U1…Uk • examples: • ASCII code k=7, n=8 • ISBN • social security number • credit card number Questions: how ??? and how many extra bits ???

How ? The Hamming code • 4 useful bits U1…U4 • 3 redundant bits V1…V3 • rate R = 4/7 • detects and corrects 1 error (exercise…) • trick : 2 codewords differ by at least 3 bits U2 V1 V3 U4 Generating matrix (of a linear code) U3 U1 V2 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 [ U1 … U4 ] = [ U1 … U4 V1 … V3] Almost all channel codes are linear :Reed-Solomon, Reed-Muller, Golay, BCH, cyclic codes, convolutional codes… Use finite field theory, and algebraic decoding techniques.

How much ? what Shannon proved in ’48 what people believed before ‘48 • Theorem (Shannon, ‘48) : • any achievable transmission rate R must satisfyR ≤ C transmitted bits / channel use • any transmission rate R < C is achievable Usage: measures the efficiency of an error correcting code for some channel

Proof Necessity: if a coding is (asympt.) error free, then its rate satisfies R≤ C, rather easy to prove Sufficiency: any rate R<C is achievable, demands to build a coding scheme ! Idea= random coding ! • best distribution on the input alphabet of the channel • build a random codeword w = a1…an drawing letters according to(w is a typical sequence) • sending w over the channel yields output w’ = b1…bnwhich is a typical sequence for • and the pair (w,w’) is jointly typical for

A1…An B1…Bn w’1 w’2 w’M w1 w2 wM typical sequences jointly typical with w1 M typical sequences as codewords possible typical sequences at output . . . • if M is small enough, the output cones do not overlap (with high probability) • maximal number of input codewords:which proves that any R <C is achievable !

Perfectcoding Perfect code = error-free and achieves capacity. What does it look like ? • by the data processing inequalitynR = H(X) = I(X;X) ≤ I(A1…An;B1…Bn) ≤ nC • if R = C, then I(A1…An;B1…Bn) = nC • possible iff letters of the codeword Ai are independent,and each I(Ai;Bi)=C, i.e. each Ai carries R=C bits noisy channel fn gn For a binary channel: R = k / n a perfect code spreads information uniformly over a larger number of bits channel k useful bits n transmitted bits

In practice • Random coding unpractical: relies on a (huge) codebook for cod./dec. • Algebraic (linear) codes were preferred for long : more structure, cod./dec. with algorithms • But in practice, they remained much below optimal rates ! • Things changed in 1993 when Berrou & Glavieux invented the turbo-codes • followed by the rediscovery of the low-density parity check codes(LDPC)invented by Gallager in his PhD… in 1963 ! • both code families behave like random codes… but come with low-complexity cod./dec. algorithms

Can feedback improve capacity ? • Principle • the outputs of the channel are revealed to the sender • the sender can use this information to adapt its next symbol Theorem: Feedback does not improve channel capacity. But is can greatly simplify coding, decoding, andtransmission protocols. channel

2nd PART Information theory was designed for point-to-point communications.Which was soon considered as a limitation… broadcast channel: each user has a different channel multiple access channel: interferences s d Spread information:which structure for this object ? how to regenerate / transmit it ?

a a b a 2nd PART a a +b a +b a +b b b b b a What is the capacity of a network ? Are network links just pipes, with capacity, in which information flows like a fluid ? A C How many transmissions to broadcast from A to C,D and from B to C,D ? a a b a a b a b By network coding, one transmission over link E—F can be saved. a E F b b a b b B D Medard& Koetter 2003

Outline 1. Information: measure and compression 2.Reliable transmission of information 3.Distributed compression 4.Fountain codes 5.Distributedpeer-to-peerstorage

3 Distributed source coding

Collectingspread information • X, Y are two distant but correlated sources • transmit their value to a unique receiver (perfect channels) • no communication between the encoders K H(X|Y) rate R1 X encoder 1 I(X;Y) H(Y|X) distance no communication joint decoder X,Y rate R2 Y encoder 2 • Naive solution = ignore correlation, compress and send each source separately : rates R1=H(X), R2=H(Y) • Can one do better, and take advantage of the correlation of X and Y ?

Example Y • X = weather in Brest, Y = weather in Quimper • probability that weathers are identical is 0.89 • one wishes to send the observed weather of 100 days in both cities sun rain sun X rain • One has H(X) = 1 = H(Y), so naïve encoding requires 200 bits • I(X;Y) = 0.5, so not sending the “common information” saves 50 bits

Necessary conditions A pair (R1,R2) is achievable is there exist separate encoders fnXand fnYof sequences X1…Xn and Y1…Yn resp., and a joint decoder gn, that are asymptotically error-free. Question: what are the best possible achievable transmission rates ? rate R1 H(X|Y) X encoder 1 I(X;Y) rate R2 H(Y|X) distance no communication joint decoder X,Y Y encoder 2 • Jointly, both coders must transmit the full pair (X,Y), so • R1+R2 ≥ H(X,Y) • Each coder alone must transmit the private information that is not accessible through the other variable, so • R1 ≥ H(X|Y) and R2≥ H(Y|X)

Result • Theorem (Slepian & Wolf, ‘75) : • The achievable region is defined by • R1 ≥ H(X|Y) • R2 ≥ H(Y|X) • R1+R2 ≥ H(X,Y) R2 achievable region H(Y) H(Y|X) R1 H(X|Y) H(X) The achievable region is easily shown to be convex, upper-right closed.

Compression by randombinning • encode only typical sequences w = x1…xn= • throw then at random into 2nR bins, with R>H(X) … codeword, on R bits/symbol 1 2 3 2nR Encoding of w = the number b of the bin where w lies Decoding : if w= unique typical sequence in bin number b, output wotherwise, output “error” Error probability

Proof of Slepian-Wolf • fX and fY are two independent random binnings of rates R1 and R2for x = x1…xnand y = y1…ynresp. • to decode the pair of bin numbers (bX,bY) = (fX(x),fY(y)), g outputs the unique pair (x,y) of jointly typical sequences in box (bX,bY) or “error” if there are more than one such pair. x … 2nR1 3 2 1 1 2 y 3 jointly typical pairs (x,y) … 2nR2 • R2>H(Y|X) : given x, there are 2nH(Y|X) sequences y that are jointly typical with x • R1+R2 > H(X,Y) : the number of boxes 2n(R1+R2) must be greater than 2nH(X,Y)

Example Y X 1.25 1.25 0.5 X= color Y=value Y Questions: 1. Is there an instantaneous* transmission protocol for rates RX=1.25=H(X|Y), RY=1.75=H(Y) ? 110 0 10 111 ? • send Y (always) : 1.75 bits • what about X ?(caution: the code for X should be uniquely decodable) ? K X 2. What about RX=RY=1.5 ? ? ? (*) i.e. for sequences of length n=1

In practice • The Slepian-Wolf theorem extends to N sources. • It long remained an academic result, since no practical coders existed. • Beginning of the 2000s, practical coders and applications appeared • compression of correlated images (e.g. same scene, 2 angles) • sensor networks (e.g. measure of a temperature field) • case of a channel with side information • acquisition of structured information, without communication

4 Fountain codes

Network protocols TCP/IP (transmission control protocol) 7 6 5 4 3 2 1 network (erasure channel) 4 3 1 2 ack 2 ack 2 Drawbacks • slow for huge files over long-range connexions (e.g. cloud backups…) • feedback channel… but feedback does not improve capacity ! • repetition code… the worst rate among error correcting codes ! • designed by engineers who ignored information theory ? :o) However • the erasure rate of the channel (thus capacity) is unknown / changing • feedback make protocols simpler • there exist faster protocols (UDP) for streaming feeds

A fountain of information bits… • How to quickly and reliably transmit K packets of b bits? • Fountain code: • from k packets, generate and send a continuous flow of packets • some get lost, some go through ; no feedback • as soon as a proportion K(1+ε) of them are received, any of them, decoding becomes possible Fountain codes are example of rateless codes (no predefined rate),or universal codes : they adapt to the channel capacity.

Random coding… Packet tn sent at time n is a random linear combinations of the K packets s1…sKto transmit. = where the Gn,kare random IID binary variables. K * t1 tK’ s1 sK t1 tK’ s1 sK … … … … G b bits 1001011 … 1 01 ... 0 0110100 … 1 … 1011010 … 0 K packets K’

Decoding Some packets are lost, and N out of K’ are received. This is equivalent to another random code with generating matrix G’. = K * G’ … 1 … 0 00 … 1 … 11 … 0 r1 rN … * = K N s1 s1 sK sK t1 tK’ … … … G 1001011 … 1 01 ... 0 0110100 … 1 … 1011010 … 0 How big should N be to enable decoding ? K’

Decoding One has where G’ is a random K*N binary matrix. If G’ is invertible, one can decode by • For N=K, what is the probability that G’ is invertible ? Answer: converges quickly to 0.289 (as soon as K>10). • What about N=K+E ? What is the probability P that at least one K*K sub-matrix of G’ is invertible ? Answer: P =1-δ(E) where δ(E) ≤ 2-E ( δ(E)<10-6 for E=20) exponential convergence to 1 with E, regardless of K. Complexity • K/2 operations per gerenated packet, so O(K2) for encoding • decoding: K3 for matrix inversion • one would like better complexities… linear ?

LT codes Invented by Michael Luby (2003), and inspired from LDPC codes (Gallager, 1963). Idea : linear combinations of packets should be “sparse” • Encoding • for each packet tn, randomly select a “degree” dnaccording to some distribution ρ(d) on degrees • choose at random dn packets among s1…sKand take as tn the sum of these dn packets • some nodes have low degree, others have high degree:makes the graph a small world s1 sK … … t1 tN

Decoding LT codes Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1 0 1 1

Decoding LT codes Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving 1 Example 1 0 1 1

Some aspects of information theory for a computer scientist