490 likes | 501 Vues
This text introduces the concepts of information theory and complexity, covering topics such as entropy, self-information, and probability spaces. It explores the relationship between information and complexity in complex systems.
 
                
                E N D
6 5 4 I(p) 3 2 1 0 1 0 0.5 p NECSI Summer School 2008Week 3: Methods for the Study of Complex SystemsInformation Theory Hiroki Sayama sayama@binghamton.edu
Four approaches to complexity Nonlinear Dynamics Complexity = No closed-form solution, Chaos Information Complexity = Length of description, Entropy Computation Complexity = Computational time/space, Algorithmic complexity Collective Behavior Complexity = Multi-scale patterns, Emergence
Information? • Matter Known since ancient times • Energy Knows since 19th century (industrial revolution) • Information Known since 20th century (WW’s, rise of computers)
An informal definition of information Aspects of some physical phenomenon that can be used to select a smaller set of options out of the original set of options (Things that reduce the number of possibilities) • An observer or interpreter involved • A default set of options needed
Quantitative definition of information • If something is expected to occur almost certainly, its occurrence should have nearly zero information • If something is expected to occur very rarely, its occurrence should have very large information If an event is expected to occur with probability p,the information produced by its occurrence (self-information) is given by I(p) = - log p
6 5 4 3 2 1 0 0 0.5 1 Quantitative definition of information I(p) = - log p • 2 is often used as the base of log • Unit of information is bit (binary digit) I(p) p
Why log? • To fulfill the additivity of information • For independent events A and B: Self-information of “A happened”: I(pA) Self-information of “B happened”: I(pB) Self-information of “A and B happened”: I(pApB) = I(pA) + I(pB) “I(p) = - log p” satisfies this additivity
Exercise • You picked up a card from a well-shuffled deck of cards (w/o jokers): • How much self-information does the event “the card is of spade” have? • How much self-information does the event “the card is a king” have? • How much self-information does the event “the card is a king of spades” have?
Some terminologies • Event: An individual outcome (or a set of outcomes) to which a probability of its occurrence can be assigned • Sample space: A set of all possible individual events • Probability space: A combination of sample space and probability distribution (i.e., probabilities assigned to individual events)
Probability distribution and expected self-information • Probability distribution in probability space A: pi (i = 1…n, Si pi = 1) • Expected self-information H(A) when one of the individual events happened: H(A) = Si pi I(pi) = - Si pi log pi
What does H(A) mean? • Average amount of self-information the observer could obtain by one observation • Average “newsworthiness” the observer should expect for one event • Ambiguity of knowledge the observer had about the system before observation • Amount of “ignorance” the observer had about the system before observation
What does H(A) mean? • Amount of “ignorance” the observer had about the system before observation • It quantitatively shows the lack of information (not the presence of information) before observation Information Entropy
Information entropy • Similar to thermodynamic entropy both conceptually and mathematically • Entropy is zero if the system state is uniquely determined with no fluctuation • Entropy increases as the randomness increases within the system • Entropy is maximal if the system is completely random (i.e., if every event is equally likely to occur)
Exercise • Prove the following: Entropy is maximal if the system is completely random (i.e., if every event is equally likely to occur) • Show that f(p1, p2, …, pn) = - Si=1~n pi log pi(with Si=1~n pi = 1) takes its maximum when pi = 1/n • Remove one variable using the constraint • Or use the method of Lagrange multipliers
Entropy and complex systems • Entropy shows how much information would be needed to fully specify the system’s state in every single detail • Ordered -> low information entropy • Disordered -> high information entropy • May not be consistent with the usual notion of “complexity” • Multiscale views are needed to address this issue
Probability of composite events • Probability of composite event (x, y): p(x, y) = p(y, x) = p(x | y) p(y) = p(y | x) p(x) • p(x | y): Conditional probability for x to occur when y already occurred • p(x | y) = p(x) if X and Y are independent from each other
Exercise: Bayes’ theorem • Define p(x | y) using p(y | x) and p(x) • Use the following formula as needed • p(x) = Sy p(x, y) • p(y) = Sx p(x, y) • p(x, y) = p(y | x) p(x) = p(x | y) p(y)
Product probability space • Prob. space X: {x1, x2}, {p(x1), p(x2)} • Prob. space Y: {y1, y2}, {p(y1), p(y2)} • Product probability space XY: {(x1, y1), (x1, y2), (x2, y1), (x2, y2)}, {p(x1, y1), p(x1, y2), p(x2, y1), p(x2, y2)} Composite events
Joint entropy • Entropy of product probability space XY: H(XY) = - SxSy p(x, y) log p(x, y) • H(XY) = H(YX) • If X and Y are independent: H(XY) = H(X) + H(Y) • If Y completely depends on X: H(XY) = H(X) ( >= H(Y) )
Conditional entropy • Expected entropy of Y when a specific event occurred in X: H(Y | X) = Sx p(x) H(Y | X=x) = - Sx p(x) Sy p(y | x) log p(y | x) = - SxSy p(y, x) log p(y | x) • If X and Y are independent: H(Y | X) = H(Y) • If Y completely depends on X: H(Y | X) = 0
Exercise • Prove the following: H(Y | X) = H(YX) - H(X) • Hint: Use Bayes’ theorem
I(Y; X)= Mutual information Mutual information • Conditional entropy measures how much ambiguity still remains on Y after observing an event on X • Reduction of ambiguity on Y by one observation on X can be written as: H(Y) – H(Y | X)
Symmetry of mutual information I(Y; X) = H(Y) – H(Y | X) = H(Y) + H(X) – H(YX) = H(X) + H(Y) – H(XY) = I(X; Y) Mutual information is symmetric in terms of X and Y
Exercise • Prove the following: • If X and Y are independent: I(X; Y) = 0 • If Y completely depends on X: I(X; Y) = H(Y)
Exercise • Measure the mutual information between the two systems on the right:
Use of mutual information • Mutual information can be used to measure how much interaction exists between two subsystems in a complex system • Correlation only works for quantitative measures and detects only linear relationships • Mutual information works for qualitative (discrete, symbolic) measures and nonlinear relationships as well
Information source • Sequence of values of a random variable that obeys some probabilistic rules • Sequence may be over time or space • Values (events) may or may not be independent from each other • Example: • Repeated coin tosses • Sound • Visual image
Memoryless and Markov information sources 01010010001011011001101000110 Memoryless information source p(0) = p(1) = 1/2 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4
Markov information source • Information source whose probability distribution at time t depends only on its immediate past value Xt-1 (or past n valuesXt-1, Xt-2, ..., Xt-n) • Cases n>1 can be converted into n=1 form by defining composite events • Probabilistic rules are given as a set of conditional probabilities, which can be written in the form of a transition probability matrix (TPM)
1/4 0 1 3/4 3/4 1/4 State-transition diagram 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4
Probability vector at time t Probability vector at time t-1 TPM Matrix representation 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 p0 p1 p0 p1 =
Exercise abcaccaabccccaaabc aaccacaccaaaaabcc • Consider the above sequence as a Markov information source and create its state-transition diagram and matrix representation
Review: Convenient properties of transition probability matrix • The product of two TPMs is also a TPM • All TPMs have eigenvalue 1 • |l|  1 for all eigenvalues of any TPM • If the transition network is strongly connected, the TPM has one and only one eigenvalue 1 (no degeneration)
Review: TPM and asymptotic probability distribution • |l|  1 for all eigenvalues of any TPM • If the transition network is strongly connected, the TPM has one and only one eigenvalue 1 (no degeneration) → This eigenvalue is a unique dominant eigenvalue and the probability vector will eventually converge to its corresponding eigenvector
Exercise • Calculate the asymptotic probability distribution of the following: 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 p0 p1 p0 p1 =
Review: Information entropy • Expected information H(A) when one of the individual events happened: H(A) = Si pi I(pi) = - Si pi log pi • This applies only to memoryless information source in which events are independent from each other
Generalizing information entropy • For other types of information source where events are not independent, information entropy is defined as: H{X} = limk→∞ H(Xk+1 | X1X2…Xk) Xk: k-th value of random variable X
Calculating information entropy of Markov information source (1) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) • This means the expected entropy of the k+1-th value given a specific history of past k values • All that matter is the last value of the history, so let’s focus on Xk
Calculating information entropy of Markov information source (2) • p(Xk=x): Probability for the last (k-th) value to be x H(Xk+1 | X1X2…Xk) = Sx p(Xk=x) H(Xk+1 | Xk=x) = - Sx p(Xk=x) Sy ayx log ayx = Sx p(Xk=x) h(ax) • ayx: y-th row x-th column element in TPM • h(ax): Entropy of x-th column vector in TPM
Calculating information entropy of Markov information source (3) H(Xk+1 | X1X2…Xk) = Sx p(Xk=x) h(ax) • If the information source has only one asymptotic probability distribution q: limk→∞ p(Xk=x) = qx (q’s x-th element) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) = h·q • h: A row vector whose x-th element is h(ax)
Calculating information entropy of Markov information source (4) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) = h·q • Information entropy of Markov information source is given by the average of entropies of its TPM’s column vectors weighted by its asymptotic probability distribution • If the information source has only one asymptotic probability distribution
Exercise • Calculate information entropy of the following Markov information source we discussed earlier: 01000000111111001110001111111 abcaccaabccccaaabc aaccacaccaaaaabcc
Summary • Complexity of a system may be characterized using information • Length of description • Entropy (ambiguity of knowledge) • Mutual information quantifies the coupling between two components within a system • Entropy may be measured for Markov information sources as well