 Download Download Presentation 5. independence

# 5. independence

Download Presentation ## 5. independence

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. CSE 312, 2012 Autumn, W.L.Ruzzo 5. independence [ | ]

2. independence Defn: Two events E and F are independent if P(EF) = P(E) P(F) If P(F)>0, this is equivalent to: P(E|F) = P(E)(proof below) Otherwise, they are called dependent

3. independence • Roll two dice, yielding values D1 and D2 • 1) E = { D1 = 1 } • F = { D2 = 1 } • P(E) = 1/6, P(F) = 1/6, P(EF) = 1/36 • P(EF) = P(E)•P(F) ⇒ E and F independent • Intuitive; the two dice are not physically coupled • 2) G = {D1 + D2 = 5} = {(1,4),(2,3),(3,2),(4,1)} • P(E) = 1/6, P(G) = 4/36 = 1/9, P(EG) = 1/36 • not independent! • E, G are dependent events • The dice are still not physically coupled, but “D1 + D2 = 5” couples them mathematically: info about D1 constrains D2. (But dependence/independence not always intuitively obvious; “use the definition, Luke”.)

4. independence Two events E and F are independent if P(EF) = P(E) P(F) If P(F)>0, this is equivalent to: P(E|F) = P(E) Otherwise, they are called dependent Three events E, F, G are independent if P(EF) = P(E) P(F) P(EG) = P(E) P(G) and P(EFG) = P(E) P(F) P(G)P(FG) = P(F) P(G) Example: Let X, Y be each {-1,1} with equal prob E = {X = 1}, F = {Y = 1}, G = { XY = 1} P(EF) = P(E)P(F), P(EG) = P(E)P(G), P(FG) = P(F)P(G) but P(EFG) = 1/4!!! (because P(G|EF) = 1)

5. independence In general, events E1, E2, …, En are independent if for every subset S of {1,2,…, n}, we have (Sometimes this property holds only for small subsets S. E.g., E, F, G on the previous slide are pairwise independent, but not fully independent.)

6. E = EF ∪EFc S E F independence Theorem: E, F independent ⇒ E, Fc independent Proof: P(EFc) = P(E) – P(EF) = P(E) – P(E) P(F) = P(E) (1-P(F)) = P(E) P(Fc) Theorem: if P(E)>0, P(F)>0, then E, F independent ⇔ P(E|F)=P(E) ⇔ P(F|E) = P(F) Proof: Note P(EF) = P(E|F) P(F), regardless of in/dep. Assume independent. Then P(E)P(F) = P(EF) = P(E|F) P(F)⇒ P(E|F)=P(E) (÷ by P(F)) Conversely, P(E|F)=P(E) ⇒ P(E)P(F) = P(EF) (× by P(F))

7. biased coin Suppose a biased coin comes up heads with probability p, independent of other flips P(n heads in n flips) = pn P(n tails in n flips) = (1-p)n P(exactly k heads in n flips) Aside: note that the probability of some number of heads =as it should, by the binomial theorem.

8. biased coin • Suppose a biased coin comes up heads with probability p, independent of other flips • P(exactly k heads in n flips) • Note when p=1/2, this is the same result we would have gotten by considering n flips in the “equally likely outcomes” scenario. But p≠1/2 makes that inapplicable. Instead, the independence assumption allows us to conveniently assign a probability to each of the 2n outcomes, e.g.: • Pr(HHTHTTT) = p2(1-p)p(1-p)3 = p#H(1-p)#T

9. hashing • A data structure problem: fast access to small subset of data drawn from a large space. • A solution: hash function h:D→{0,...,n-1} crunches/scrambles names from large space into small one. E.g., if x is integer: • h(x) = x mod n • Good hash functions approximately randomize placement. D R x (Large) space of potential data items, say names or SSNs, only a few of which are actually used 0 . . . n-1 h(x) = i i • (Small) hash table containing actual data 10

10. indp hashing m strings hashed (uniformly) into a table with n buckets Each string hashed is an independent trial E = at least one string hashed to first bucket What is P(E) ? Solution: Fi = string i not hashed into first bucket (i=1,2,…,m) P(Fi) = 1 – 1/n = (n-1)/n for all i=1,2,…,m Event (F1 F2 … Fm) = no strings hashed to first bucket P(E) = 1 – P(F1 F2⋯ Fm) = 1 – P(F1) P(F2) ⋯ P(Fm) = 1 – ((n-1)/n)m ≈1-exp(-m/n)

11. hashing m strings hashed (non-uniformly) to table w/ n buckets Each string hashed is an independent trial, with probability pi of getting hashed to bucket i E = At least 1 of buckets 1 to k gets ≥ 1 string What is P(E) ? Solution: Fi = at least one string hashed into i-th bucket P(E) = P(F1∪⋯∪ Fk) = 1-P((F1∪⋯∪ Fk)c) = 1 – P(F1c F2c … Fkc) = 1 – P(no strings hashed to buckets 1 to k) = 1 – (1-p1-p2-⋯-pk)m

12. hashing • Let D0⊆ D be a fixed set of m strings, R = {0,...,n-1}. A hash function h:D→R is perfect for D0 if h:D0→R is injective (no collisions). How hard is it to find a perfect hash function? • Fix h; pick m elements of D0independently at random ∈ D • Suppose h maps ≈ (1/n)th of D to each element of R. This is like the birthday problem: • P(h is perfect for D0) = graph needs work!!!

13. hashing • Let D0⊆ D be a fixed set of m strings, R = {0,...,n-1}. A hash function h:D→R is perfect for D0 if h:D0→R is injective (no collisions). How hard is it to find a perfect hash function? • Fix D0; pick hat random • E.g., if m = |D0| = 23 and n = 365, then there is ~50% chance that h is perfect for this fixed D0. If it isn’t, pick h’, h’’, etc. With high probability, you’ll quickly find a perfect one! • “Picking a random function h” is easier said than done, but, empirically, picking among a set of functions like • h(x) = (a•x +b) mod n • where a, b are random 64-bit ints is a start. caution; this analysis is heuristic, not rigorous, but still useful.

14. p1 p2 … pn network failure Consider the following parallel network n routers, ith has probability pi of failing, independently P(there is functional path) = 1 – P(all routers fail) = 1 – p1p2 ⋯ pn

15. network failure Contrast: a series network n routers, ith has probability pi of failing, independently P(there is functional path) = P(no routers fail) = (1 – p1)(1– p2) ⋯ (1 – pn) p1 p2 … pn

16. deeper into independence Recall: Two events E and F are independent if P(EF) = P(E) P(F) If E & F are independent, does that tell us anything about P(EF|G), P(E|G), P(F|G), when G is an arbitrary event? In particular, is P(EF|G) = P(E|G) P(F|G) ? In general, no.

17. deeper into independence Roll two 6-sided dice, yielding values D1 and D2 E = { D1 = 1 } F = { D2 = 6 } G = { D1 + D2 = 7 } E and F are independent P(E|G) = 1/6 P(F|G) = 1/6, but P(EF|G) = 1/6, not 1/36 so E|G and F|G are not independent!

18. conditional independence • Definition: • Two events E and F are called conditionally independent given G, if • P(EF|G) = P(E|G) P(F|G) • Or, equivalently (assuming P(F)>0, P(G)>0), • P(E|FG) = P(E|G)

19. do CSE majors get fewer A’s? • Say you are in a dorm with 100 students • 10 are CS majors: P(C) = 0.1 • 30 get straight A’s: P(A) = 0.3 • 3 are CS majors who get straight A’s • P(CA) = 0.03 • P(CA) = P(C) P(A), so C and A independent • At faculty night, only CS majors and A students show up • So 37 students arrive • Of 37 students, 10 are CS ⇒ • P(C | C or A) = 10/37 = 0.27 < .3 = P(A) • Seems CS major lowers your chance of straight A’s ☹ • Weren’t they supposed to be independent? • In fact, CS and A are conditionally dependent at fac night

20. conditioning can also break DEPENDENCE Randomly choose a day of the week A = { It is not a Monday } B = { It is a Saturday } C = { It is the weekend } A and B are dependent events P(A) = 6/7, P(B) = 1/7, P(AB) = 1/7. Now condition both A and B on C: P(A|C) = 1, P(B|C) = ½, P(AB|C) = ½ P(AB|C) = P(A|C) P(B|C) ⇒ A|C and B|C independent Dependent events can become independent by conditioning on additional information! Another reason why conditioning is so useful

21. independence: summary • Events E & F are independent if • P(EF) = P(E) P(F), or, equivalently P(E|F) = P(E) (if p(E)>0) • More than 2 events are indp if, for alI subsets, joint probability = product of separate event probabilities • Independence can greatly simplify calculations • For fixed G, conditioning on G gives a probability measure, P(E|G) • But “conditioning” and “independence” are orthogonal: • Events E & F that are (unconditionally) independent may become dependent when conditioned on G • Events that are (unconditionally) dependent may become independent when conditioned on G 23

22. T T T T H T H H CSE 312, 2012 Autumn, W.L.Ruzzo 6. random variables

23. random variables • Arandom variable is some numeric function of the outcome, not the outcome itself. (Technically, neither random nor a variable, but...) • Ex. • Let H be the number of Heads when 20 coins are tossed • Let T be the total of 2 dice rolls • Let X be the number of coin tosses needed to see 1st head • Note; even if the underlying experiment has “equally likely outcomes,” the associated random variable may not }

24. numbered balls

25. memorize me! first head • Flip a (biased) coin repeatedly until 1st head observed • How many flips? Let X be that number. • P(X=1) = P(H) = p • P(X=2) = P(TH) = (1-p)p • P(X=3) = P(TTH) = (1-p)2p • ... • Check that it is a valid probability distribution: • 1) • 2)

26. probability mass functions

27. head count n = 2 n = 8

28. pmf cdf cumulative distribution function NB: for discrete random variables, be careful about “≤” vs “<”

29. why random variables • Why use random variables? • A. Often we just care about numbers • If I win \$1 per head when 20 coins are tossed, what is my average winnings? What is the most likely number? What is the probability that I win < \$5? ... • B. It cleanly abstracts away from unnecessary detail about the experiment/sample space; PMF is all we need. • Flip 7 coins, roll 2 dice, and throw a dart; if dart landed in sector = dice roll mod #heads, then X = ... → →

30. expectation

31. expectation average of random values, weighted by their respective probabilities 33

32. expectation average of random values, weighted by their respective probabilities 34

33. expectation average of random values, weighted by their respective probabilities 35

34. first head dy0/dy = 0 How much would you pay to play? (To geo)

35. how many heads How much would you pay to play?

36. E[Y] = Σj jq(j) = 72/36 = 2 E[X] = Σi ip(i) =252/36= 7 expectation of a function of a random variable

37. expectation of a function of a random variable E[Y] = Σj jq(j) = 72/36 = 2 E[g(X)] = Σi g(i)p(i) = 252/3= 2

38. g X Y xi1 yj1 xi2 xi3 yj3 xi6 yj2 xi4 xi5 Note that Sj = { xi | g(xi)=yj } is a partition of the domain of g. expectation of a function of a random variable BT pg.84-85

39. properties of expectation • A & B each bet \$1, then flip 2 coins: • Let X be A’s net gain: +1, 0, -1, resp.: • What is E[X]? • E[X] = 1•1/4 + 0•1/2 + (-1)•1/4 = 0 • What is E[X2]? • E[X2] = 12•1/4 + 02•1/2 + (-1)2•1/4 = 1/2 Note: E[X2] ≠ E[X]2

40. properties of expectation • Linearity of expectation, I • For any constants a, b: E[aX + b] = aE[X] + b • Proof: • Example: • Q: In the 2-person coin game above, what is E[2X+1]? • A: E[2X+1] = 2E[X]+1 = 2•0 + 1 = 1

41. properties of expectation • Linearity, II • Let X and Y be two random variables derived from outcomes of a single experiment. Then • Proof: Assume the sample space S is countable. (The result is true without this assumption, but I won’t prove it.) Let X(s), Y(s) be the values of these r.v.’s for outcome s∈S.Claim: • Proof: similar to that for “expectation of a function of an r.v.,” i.e., the events “X=x” partition S, so sum above can be rearranged to match the definition of • Then: E[X+Y] = E[X] + E[Y] True even if X, Y dependent E[X+Y] = Σs∈S(X[s] + Y[s]) p(s) = Σs∈SX[s] p(s) + Σs∈SY[s] p(s) = E[X] + E[Y]

42. properties of expectation • Example • X = # of heads in one coin flip, where P(X=1) = p. • What is E(X)? • E[X] = 1•p + 0 •(1-p) = p • Let Xi, 1 ≤ i ≤ n, be # of H in flip of coin with P(Xi=1) = pi • What is the expected number of heads when all are flipped? • E[ΣiXi] = ΣiE[Xi] = Σipi • Special case: p1 = p2 = ... = p : • E[# of heads in n flips] = pn • ☜ Compare to slide 35

43. properties of expectation • Note: • Linearity is special! • It is not true in general that • E[X•Y] = E[X] • E[Y] • E[X2] = E[X]2 • E[X/Y] = E[X] / E[Y] • E[asinh(X)] = asinh(E[X]) • • • • • • ← counterexample above

44. variance

45. risk • Alice & Bob are gambling (again). X = Alice’s gain per flip: • E[X] = 0 • . . . Time passes . . . • Alice (yawning) says “let’s raise the stakes” • E[Y] = 0, as before. • Are you (Bob) equally happy to play the new game?

46. E[X] measures the “average” or “central tendency” of X. • What about its variability? • If E[X] = μ, then E[|x-μ|] seems like a natural quantity to look at: how much do we expect X to deviate from its average. Unfortunately, it’s a bit inconvenient mathematically; following is easier/more common. • Definition • The variance of a random variable X with mean E[X] = μ is • Var[X] = E[(X-μ)2], often denoted σ2. • The standard deviation of X is σ = √Var[X]

47. what does variance tell us? • The variance of a random variable X with mean E[X] = μ is • Var[X] = E[(X-μ)2], often denoted σ2. • 1: Square always ≥ 0, and exaggerated as X moves away from μ, so Var[X] emphasizes deviation from the mean. • II: Numbers vary a lot depending on exact distribution of X, but typically X is • within μ ± σ ~66% of the time, and • within μ ± 2σ ~95% of the time. • (We’ll see the reasons for this soon.)

48. mean and variance • μ = E[X] is about location;σ = √Var(X) is about spread σ≈2.2 # heads in 20 flips, p=.5 μ # heads in 150 flips, p=.5 σ≈6.1 μ (and note σ bigger in absolute terms in second ex., but smaller as a proportion of max.)

49. risk • Alice & Bob are gambling (again). X = Alice’s gain per flip: • E[X] = 0 Var[X] = 1 • . . . Time passes . . . • Alice (yawning) says “let’s raise the stakes” • E[Y] = 0, as before. Var[Y] = 1,000,000 • Are you (Bob) equally happy to play the new game?

50. ~ ~ ~ ~ σY = 100 σZ = 10 example • Two games: • a) flip 1 coin, win Y = \$100 if heads, \$-100 if tails • b) flip 100 coins, win Z = (#(heads) - #(tails)) dollars • Same expectation in both: E[Y] = E[Z] = 0 • Same extremes in both: max gain = \$100; max loss = \$100 • But variabilityis very different: