1.42k likes | 1.77k Vues
Information Theory For Data Management. Divesh Srivastava Suresh Venkatasubramanian. Motivation. Information Theory is relevant to all of humanity. -- Abstruse Goose (177). Information Theory for Data Management - Divesh & Suresh. Background.
E N D
Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian
Motivation Information Theory is relevant to all of humanity... -- Abstruse Goose (177) Information Theory for Data Management - Divesh & Suresh
Background • Many problems in data management need precise reasoning about information content, transfer and loss • Structure Extraction • Privacy preservation • Schema design • Probabilistic data ? Information Theory for Data Management - Divesh & Suresh
Information Theory • First developed by Shannon as a way of quantifying capacity of signal channels. • Entropy, relative entropy and mutual information capture intrinsic informational aspects of a signal • Today: • Information theory provides a domain-independent way to reason about structure in data • More information = interesting structure • Less information linkage = decoupling of structures Information Theory for Data Management - Divesh & Suresh
Tutorial Thesis Information theory provides a mathematical framework for the quantification of information content, linkage and loss. This framework can be used in the design of data management strategies that rely on probing the structure of information in data. Information Theory for Data Management - Divesh & Suresh
Tutorial Goals • Introduce information-theoretic concepts to DB audience • Give a ‘data-centric’ perspective on information theory • Connect these to applications in data management • Describe underlying computational primitives Illuminate when and how information theory might be of use in new areas of data management. Information Theory for Data Management - Divesh & Suresh
Outline Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems Information Theory for Data Management - Divesh & Suresh
X f(X) X p(X) X x1 4 x1 0.5 aggregate counts x1 x2 2 x2 0.25 normalize x3 1 x3 0.125 x3 x4 1 x4 0.125 x2 x4 Probability distribution Histogram x1 x1 Column of data x2 x1 Histograms And Discrete Distributions Information Theory for Data Management - Divesh & Suresh
X f(X) X p(X) X x1 4 x1 0.667 aggregate counts x1 x2 2 x2 0.2 x3 1 x3 0.067 x3 x4 1 x4 0.067 x2 x4 Probability distribution Histogram x1 x1 Column of data x2 x1 Histograms And Discrete Distributions reweight normalize Information Theory for Data Management - Divesh & Suresh
From Columns To Random Variables • We can think of a column of data as “represented” by a random variable: • X is a random variable • p(X) is the column of probabilities p(X = x1), p(X = x2), and so on • Also known (in unweighted case) as the empirical distribution induced by the column X. • Notation: • X (upper case) denotes a random variable (column) • x (lower case) denotes a value taken by X (field in a tuple) • p(x) is the probability p(X = x) Information Theory for Data Management - Divesh & Suresh
Joint Distributions Discrete distribution: probability p(X,Y,Z) p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z) Information Theory for Data Management - Divesh & Suresh
Let h(x) = log2 1/p(x) h(X) is column of h(x) values. H(X) = EX[h(x)] = SX p(x) log2 1/p(x) Two views of entropy It captures uncertainty in data: high entropy, more unpredictability It captures information content: higher entropy, more information. Entropy Of A Column H(X) = 1.75 < log |X| = 2 Information Theory for Data Management - Divesh & Suresh
Examples • X uniform over [1, ..., 4]. H(X) = 2 • Y is 1 with probability 0.5, in [2,3,4] uniformly. • H(Y) = 0.5 log 2 + 0.5 log 6 ~= 1.8 < 2 • Y is more sharply defined, and so has less uncertainty. • Z uniform over [1, ..., 8]. H(Z) = 3 > 2 • Z spans a larger range, and captures more information X Y Z Information Theory for Data Management - Divesh & Suresh
Comparing Distributions • How do we measure difference between two distributions ? • Kullback-Leibler divergence: • dKL(p, q) = Ep[ h(q) – h(p) ] = Si pi log(pi/qi) Inference mechanism Prior belief Resulting belief Information Theory for Data Management - Divesh & Suresh
Comparing Distributions • Kullback-Leibler divergence: • dKL(p, q) = Ep[ h(q) – h(p) ] = Si pi log(pi/qi) • dKL(p, q) >= 0 • Captures extra information needed to capture p given q • Is asymmetric ! dKL(p, q) != dKL(q, p) • Is not a metric (does not satisfy triangle inequality) • There are other measures: • 2-distance, variational distance, f-divergences, … Information Theory for Data Management - Divesh & Suresh
Conditional Probability • Given a joint distribution on random variables X, Y, how much information about X can we glean from Y ? • Conditional probability: p(X|Y) • p(X = x1 | Y = y1) = p(X = x1, Y = y1)/p(Y = y1) Information Theory for Data Management - Divesh & Suresh
Conditional Entropy • Let h(x|y) = log2 1/p(x|y) • H(X|Y) = Ex,y[h(x|y)] = SxSy p(x,y) log2 1/p(x|y) • H(X|Y) = H(X,Y) – H(Y) • H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5 = 0.75 • If X, Y are independent, H(X|Y) = H(X) Information Theory for Data Management - Divesh & Suresh
Mutual Information • Mutual information captures the difference between the joint distribution on X and Y, and the marginal distributions on X and Y. • Let i(x;y) = log p(x,y)/p(x)p(y) • I(X;Y) = Ex,y[I(X;Y)] = SxSy p(x,y) log p(x,y)/p(x)p(y) Information Theory for Data Management - Divesh & Suresh
Mutual Information: Strength of linkage • I(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) • If X, Y are independent, then I(X;Y) = 0: • H(X,Y) = H(X) + H(Y), so I(X;Y) = H(X) + H(Y) – H(X,Y) = 0 • I(X;Y) <= max (H(X), H(Y)) • Suppose Y = f(X) (deterministically) • Then H(Y|X) = 0, and so I(X;Y) = H(Y) – H(Y|X) = H(Y) • Mutual information captures higher-order interactions: • Covariance captures “linear” interactions only • Two variables can be uncorrelated (covariance = 0) and have nonzero mutual information: • X R [-1,1], Y = X2. Cov(X,Y) = 0, I(X;Y) = H(X) > 0 Information Theory for Data Management - Divesh & Suresh
Information Theory: Summary • We can represent data as discrete distributions (normalized histograms) • Entropy captures uncertainty or information content in a distribution • The Kullback-Leibler distance captures the difference between distributions • Mutual information and conditional entropy capture linkage between variables in a joint distribution Information Theory for Data Management - Divesh & Suresh
Outline Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems Information Theory for Data Management - Divesh & Suresh
Data Anonymization Using Randomization Goal: publish anonymized microdata to enable accurate ad hoc analyses, but ensure privacy of individuals’ sensitive attributes Key ideas: Randomize numerical data: add noise from known distribution Reconstruct original data distribution using published noisy data Issues: How can the original data distribution be reconstructed? What kinds of randomization preserve privacy of individuals? Information Theory for Data Management - Divesh & Suresh
Data Anonymization Using Randomization Many randomization strategies proposed [AS00, AA01, EGS03] Example randomization strategies: X in [0, 10] R = X + μ (mod 11), μ is uniform in {-1, 0, 1} R = X + μ (mod 11), μ is in {-1 (p = 0.25), 0 (p = 0.5), 1 (p = 0.25)} R = X (p = 0.6), R = μ, μ is uniform in [0, 10] (p = 0.4) Question: Which randomization strategy has higher privacy preservation? Quantify loss of privacy due to publication of randomized data Information Theory for Data Management - Divesh & Suresh
Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh
Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} → Information Theory for Data Management - Divesh & Suresh
Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} → Information Theory for Data Management - Divesh & Suresh
Reconstruction of Original Data Distribution X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Reconstruct distribution of X using knowledge of R1 and μ EM algorithm converges to MLE of original distribution [AA01] → → Information Theory for Data Management - Divesh & Suresh
Analysis of Privacy [AS00] X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} If X is uniform in [0, 10], privacy determined by range of μ → → Information Theory for Data Management - Divesh & Suresh
Analysis of Privacy [AA01] X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} If X is uniform in [0, 1] [5, 6], privacy smaller than range of μ → → Information Theory for Data Management - Divesh & Suresh
Analysis of Privacy [AA01] X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} If X is uniform in [0, 1] [5, 6], privacy smaller than range of μ In some cases, sensitive value revealed → → Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) Smaller H(X|R) more loss of privacy in X by knowledge of R Larger I(X;R) more loss of privacy in X by knowledge of R I(X;R) = H(X) – H(X|R) I(X;R)used to capture correlation between X and R p(X) is the prior knowledge of sensitive attribute X p(X, R) is the joint distribution of X and R Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} I(X;R) = 0.33 Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1} I(X;R1) = 0.33, I(X;R2) = 0.5 R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy [AA01] Equivalent goal: quantify loss of privacy based on H(X|R) X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1} Intuition: we know more about X given R2, than about X given R1 H(X|R1) = 0.67, H(X|R2) = 0.5 R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh
Quantify Loss of Privacy Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) Is R3 or R4 a bigger privacy risk? Information Theory for Data Management - Divesh & Suresh
Worst Case Loss of Privacy [EGS03] Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) I(X;R3) = 0.0001 << I(X;R4) = 0.028 Information Theory for Data Management - Divesh & Suresh
Worst Case Loss of Privacy [EGS03] Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) I(X;R3) = 0.0001 << I(X;R4) = 0.028 But R3 has a larger worst case risk Information Theory for Data Management - Divesh & Suresh
Worst Case Loss of Privacy [EGS03] Goal: quantify worst case loss of privacy in X by knowledge of R Use max KL divergence, instead of mutual information Mutual information can be formulated as expected KL divergence I(X;R) = ∑x ∑r p(x,r)*log2(p(x,r)/p(x)*p(r)) = KL(p(X,R) || p(X)*p(R)) I(X;R) = ∑r p(r) ∑x p(x|r)*log2(p(x|r)/p(x)) = ER [KL(p(X|r) || p(X))] [AA01] measure quantifies expected loss of privacy over R [EGS03] propose a measure based on worst case loss of privacy IW(X;R) = MAXR [KL(p(X|r) || p(X))] Information Theory for Data Management - Divesh & Suresh
Worst Case Loss of Privacy [EGS03] Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) IW(X;R3) = max{0.0, 1.0, 1.0} > IW(X;R4) = max{0.028, 0.028} Information Theory for Data Management - Divesh & Suresh
Worst Case Loss of Privacy [EGS03] Example: X is uniform in [5, 6] R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} R2 = X + μ (mod 11), μ is uniform in {0, 1} IW(X;R1) = max{1.0, 0.0, 0.0, 1.0} = IW(X;R2) = {1.0, 0.0, 1.0} Unable to capture that R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh
Data Anonymization: Summary Randomization techniques useful for microdata anonymization Randomization techniques differ in their loss of privacy Information theoretic measures useful to capture loss of privacy Expected KL divergence captures expected privacy loss [AA01] Maximum KL divergence captures worst case privacy loss [EGS03] Both are useful in practice Information Theory for Data Management - Divesh & Suresh
Outline Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems Information Theory for Data Management - Divesh & Suresh
Information Dependencies [DR00] Goal: use information theory to examine and reason about information content of the attributes in a relation instance Key ideas: Novel InD measure between attribute sets X, Y based on H(Y|X) Identify numeric inequalities between InD measures Results: InD measures are a broader class than FDs and MVDs Armstrong axioms for FDs derivable from InD inequalities MVD inference rules derivable from InD inequalities Information Theory for Data Management - Divesh & Suresh
Information Dependencies [DR00] Functional dependency: X → Y FD X → Y holds iff t1, t2 ((t1[X] = t2[X]) (t1[Y] = t2[Y])) Information Theory for Data Management - Divesh & Suresh
Information Dependencies [DR00] Functional dependency: X → Y FD X → Y holds iff t1, t2 ((t1[X] = t2[X]) (t1[Y] = t2[Y])) Information Theory for Data Management - Divesh & Suresh
Information Dependencies [DR00] Result: FD X → Y holds iff H(Y|X) = 0 Intuition: once X is known, no remaining uncertainty in Y H(Y|X) = 0.5 Information Theory for Data Management - Divesh & Suresh
Information Dependencies [DR00] Multi-valued dependency: X →→ Y MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z) Information Theory for Data Management - Divesh & Suresh