COT 5611 Operating Systems Design Principles Spring 2012

COT 5611 Operating SystemsDesign Principles Spring 2012 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 5:00-6:00 PM

Lecture 17 – Wednesday March 14, 2012 • Reading assignment: • Chapter 8 from the on-line text • Claude Shannon’s paper • Last time - Information Theory • Information theory - a statistical theory of communication • Random variables, probability density functions (PDF), cumulative distribution functions (CDF), • Thermodynamic entropy • Shannon entropy • Joint and conditional entropy • Mutual information • Shannon’s source coding theorem • Channel capacity Lecture 17

Today • Information Theory • Applications of information theory • Properties of Shannon’s entropy • Joint and conditional entropy • Mutual information • Shannon’s source coding theorem • Channel capacity • Error detection and error correction Lecture 17

Applications of information theory Error detection and error correction  increase redundancy to protect the message. Data compression  remove redundancy. Encryption  transform information to protect it. Lecture 17

Properties of binary Shannon’s entropy H(X) > 0 for 0 < p < 1; H(X) is symmetric about p = 0:5; limp0H(X) = limp1H(X) = 0; H(X) is increasing for 0 < p < 0:5, decreasing for 0:5 < p < 1 and has a maximum for p = 0:5. The binary entropy is a concave function of p, the probability of an outcome. Note: A function f(x) is convex over an interval (a,b) if f[kx1+(1-k)x2] ≤ kf(x1)+(1-k)x2 for all (x1,x2 ) in (a,b) and 0 ≤ k ≤ 1. A function is concave over an interval (a,b) if [-f(x)] is convex over (a,b). Lecture 17

Shannon entropy is a concave function of p Lecture 17

Joint and conditional entropy; mutual information Lecture 17

Lecture 17

Properties of joint and conditional entropy H(X, Y) = H(Y,X) symmetry of joint entropy H(X, Y) ≥ 0  nonnegativity of joint entropy H(X | Y) ≥ 0; H(Y | X) ≥ 0  nonnegativity of conditional entropy H(X | Y) = H(X,Y) - H(Y)  conditional and joint entropy relation H(X,Y) ≥ H(Y )  joint entropy vs. entropy of a single rv H(X,Y) ≤ H(X) + H(Y )  subadditivity H(X, Y, Z) + H(Y) ≤H(X,Y) + H(Y,Z)  strong subadditivity H(X | Y) ≤ H(X)  reduction of uncertainty by conditioning H(X,Y,Z) = H(X) + H(Y | X) + H(Z | X, Y ) chainrule for joint entropy H(X,Y | Z) = H(Y | X,Z) + H(X | Z) chain rule for conditional entropy: Lecture 17

Properties of mutual information I(X; Y) = I(Y ;X) symmetry of mutual entropy I(X; Y) = H(X) - H(X j Y ) mutual information, entropy, and conditional entropy I(X; Y) = H(Y ) - H(Y|X) mutual information, entropy, and conditional entropy I(X;X) = H(X) mutual self information and entropy I(X;X) ≥0; non-negativity of mutual self information I(X;Y) = H(X) + H(Y ) - H(X,Y ) mutual information, entropy, and joint entropy I(X; Y | Z) = H(X | Z) - H(X | Y,Z) conditional mutual information and conditional entropy I(X, Y;Z) = I(X;Z | Y ) + I(Y ;Z) chain rule for mutual information I(X; Y) ≤ I(X;Z) if X  Y  Z data processing inequality Lecture 17

Shannon’s source coding theorem lX(n) = nH(X) + O(n) Informally, Shannon source encoding theorem states that a message containing n independent, identically distributed samples of a random variable X with entropy H(X) can be compressed to a length The justification of this theorem is based on the weak law of large numbers The mean of a large number of independent, identically distributed random variables, xi, approaches the average, with a high probability when n is large with 𝜺 and 𝛅 arbitrary. Lecture 17

Shannon’s source coding theorem When the source has an alphabet with m symbols and messages consist of n independently selected symbols from this alphabet a large number of these sequences are typical. There are 2nH(A) typical strings, therefore we need log 2nH(A) = nH(A) bits to encode all possible typical strings; this is the upper bound for the data compression provided by Shannon's source encoding theorem. Lecture 17

Binary erasure channel Lecture 17

Channel capacity Discrete memoryless channel: C= maxp(x)I(X;Y) the maximum of mutual information between the input X and the output Y. The capacity of a noisy channel The noisy binary symmetric channel: p probability of error; q=Prob(X=0) I(X;Y) = H(Y) –H(Y|X) H(Y | X) = -{ q [p log p + (1 - p) log(1 - p)] + (1 - q) [p log p + (1 - p) log(1 - p)]} = [p log p + (1 - p) log(1 - p)] We maximize I(X;Y) by making H(Y)=1  C=1 - [p log p + (1 - p) log(1 - p)] p=1/2  C=0 because the output is independent of the input; p=0 or p=1  C=1 we have a noiseless channel The capacity of the binary erasure channel with pe the probability of erasure Ce= 1- pe Lecture 17

Error detection and error correction (ECC) • Error detection and error correction based on schemes to increase the redundancy of a message. • A crude analogy is to bubble wrap a fragile item and place it into a box to reduce the chance that the item will be damaged during transport. Redundant information plays the role of the packing materials; it increases the amount of data transmitted, but it also increases the chance that we will be able to restore the original contents of a message distorted during communication. • Coding corresponds to the selection of both the packing materials and the strategy to optimally pack the fragile item subject to the obvious constraints: use the least amount of packing materials and the least amount of effort to pack and unpack. • Error detection  compare what you received with the code words from the common dictionary; if there is no match error(s) have occurred • Error correction  map the received message to a valid code word. Lecture 17

Examples and limitations of ECC • A trivial example of an error detection scheme the addition of a parity check bit to a word of a given length. • This is a simple scheme but very powerful; it allows us to detect an odd number of errors, but fails if an even number of errors occur. For example, consider a system that enforces even parity for an eight-bit word. Given the string 10111011, we add one more bit to ensure that the total number of 1s is even, in this case a 0, and we transmit the nine-bit string 101110110. The error detection procedure is to count the number of 1s; we decide that the string is in error if this number is odd. • This example also hints to the limitations of error detection mechanisms. A code is designed with certain error detection or error correction capabilities and fails to detect, or to correct error patterns not covered by the original design of the code. • In the previous example we transmit 101110110 and when two errors occur, in the 4-th and the 7-th bits we receive 101010010. • This tuple has even parity (an even number of 1's) and our scheme for error detection fails. Lecture 17

Code n-tuple  a set of n-symbols from an alphabet A. Example A={0,1,2} and n=6  000000, 211101, 111122, etc. A={0,1) (binary alphabet) n=3  000, 001,010,100, 110, 101, 011, 111 Code  a set of n-tuples. Example: Binarycode C  select 2kcodewords from the 2n possible binary n-tuples The sender and the receiver share the knowledge of all the code words in C Hamming distance  the number of positions two code words differ Distance d of a code C the minimum distance between any pair of code words of C Hamming sphere of radius d around a code w – the set of all n-tuples at distance at most d from w. Lecture 17

Block codes A block code C=[n,M] consists of code words of length n and allows the encoding of M messages. Example: consider binary [n,M] codes; for example n=6 and M=4. The code: C={c0,c1,c2,c3} with c0=00000, c1= 101101, c2 = 010110, c3=111011 Out of the 26 possible binary 6-tuples we have selected 4 as code words. Hamming distance of two code words: the number of bit position they differ d(c1,c3) =3 The Hamming distance of the code C the minimum distance between any pair of code words: d(C)=3  Indeed d(c0,c1) =4, d(c0,c2) =3, d(c0,c3) =5, d(c1,c2) =5, d(c1,c3) =3, d(c2,c3) =3 To compute the Hamming distance for an [n,M] code, it is necessary to compute the distance between CM2pairs of codewords and then to find the pair with the minimum distance. Lecture 17

Encoding Encoding  map k information symbols into n = k+rby adding r redundancy symbols Example: repetitive code: Encode 0  000 and 1  111. Then the two code words are 000 and 111; the other 3-tuples are: 100, 010, 001, 011, 101, 110 decode any received 3-tuple with one error as follows 100, 010, 001  0 011, 101, 110 1 The Hamming sphere of radius 1 around 000 and 111 Lecture 17

Lecture 17

Errors Lecture 17

Decoding in the presence of errors Send c; receive v=c+e Minimum distance or nearest neighbor decoding. If an n-tuple v is received, and there is a unique codeword c such that d (v,c) is the minimum over all codewords of C then correct v as the codeword c. If no such c exists, report that errors have been detected, but no correction is possible. If multiple codewordsare at the same minimum distance from the received codeword select at random one of them and decode v as that codeword. Maximum likelihood decoding. Under this decoding policy, of all possible codewordsc the n-tuple v is decoded to that codeword c which maximizes the probability P(v,c) that v is received, given that c is sent. Lecture 17

Example of maximum likelihood decoding Consider the same code C= {c0=00000, c1= 101101, c2 = 010110, c3=111011} Probability of a bit in error is p=0.15 When we receive v =111111 we decode it as 111011. p(v, 000000) = (0.15)6= 0.000011 p(v,101100) = (0.15)3 x (0.85)3 = 0.002076 p(v,010110) = (0.15)3 x (0.85)3 = 0.002076 p(v,111011) = (0.15)1x (0.85)5= 0.066555 Lecture 17

Error detecting and error correcting codes • The error detection and error correction capabilities of a code are determined by the distance d of the code (minimum Hamming distance between any par of code words) • To detect e errors d > e+1 • To correct e errors  d ≥ 2e+1 Lecture 17

Lecture 17

COT 5611 Operating Systems Design Principles Spring 2012