Sequential Learning

Sequential Learning James B. Elsner and Thomas H. Jagger Department of Geography, Florida State University Some material based on notes from a one-day short course on Bayesian modeling and prediction given by David Draper (http://www.ams.ucsc.edu/~draper)

HIV Screening As we discussed, the three definitions of probability are: classical, frequentist, and Bayesian. Consider the problem of screening for HIV. Widespread screening for HIV has been proposed by some people in some countries (e.g., the U.S.) Recall: Two blood tests that screen for HIV are widely available: ELISA—relatively inexpensive (roughly $20) and fairly accurate. Western Blot (WB)—considerably more accurate, but cost quite a bit more (about $100).

A new patient comes to You, a physician, with symptoms that suggest he may be HIV positive (You = a generic person making assessments of uncertainty). Questions: • Is it appropriate to use the language of probability to quantify Your uncertainty about the proposition A = {This patient is HIV positive}? • If so, what kinds of probability are appropriate, and how would You assess P(A) in each case? • What strategy (e.g., ELISA, WB, both?) should You employ to decrease Your uncertainty about A? If You decide to run a screening test, how should Your uncertainty be updated in light of the test results?

Let’s say that, with this patient’s values of relevant demographic variables, the prevalence of HIV estimated from the medical literature, P(A) = P(he’s HIV positive), in his recognizable subpopulation is about 1/100 = 0.01 (1%). To improve this estimate by gathering data specificto this patient, You decide to take some blood and get a result from ELISA. Suppose the test comes back positive—what is Your updated P(A)? Bayesian probability has that name because of the simple updating rule attributed to Thomas Bayes who was one of the first to define conditional probability and make calculations on it.

The conditional probability of an event B in relationship to an event A is the probability of that event B occurs given that A has already occurred. P(B|A): conditional probability of B given A. P(B|A) = P(A and B)/P(A) P(A and B): probability of events A and B both occurring. P(A) P(B) P(A and B)

In New England, 84% of the houses have a garage and 65% of the houses have a garage and a backyard. What is the probability that a house has a backyard given it has a garage? Answer: P(backyard | garage) = P(garage and backyard)/P(garage) = 0.65/0.84 = 0.77 or 77%. Simple…, lets try another. In a certain 3-box game, each box contains 2 chips. The first has 2 red chips, the second has 2 green chips, and the third has 1 green and 1 red chip. We do not know which box contains which chips. We take one chip out of one box without looking inside. It is a green chip. What are the chances that the second chip is also green? Answers?

Let, Event A = {second chip in the box is green} Event B = {first chip in the box is green} P(A|B) = P(A and B)/P(B) P(A and B) = 1/3 since only 1 box in 3 have both green chips. P(B) = 3/6 = 1/2 since there are 3 green chips out of a total of 6. P(A|B) = 1/3 divided by 1/2 = 2/3. Another? Monty Hall Game Simulator http://www.shodor.org/interactivate/activities/monty3/index.html

Problems involving conditional probability often lead to results that are unexpected. In fact, many people have a hard time accepting results for these problems, as the results may seem counterintuitive. That may be one reason for the reluctance to embrace Bayesian thinking? Bayes’ theorem is derived from the definition of conditional probability. As we’ve seen: P(A|B) = P(A and B)/P(B) (1) But, we can also write: P(B|A) = P(B and A)/P(A) Multiplying through yields: P(A) P(B|A) = P(B and A) (2) Since: P(A and B) = P(B and A), we can write eq (2) as: P(A) P(B|A) = P(A and B) (3) Then substituting eq (3) into eq (1) yields P(A|B) = P(A) P(B|A)/P(B)

Bayes’ Theorem for propositions: P(A|D) = P(A) P(D|A) / P(D) In the usual application, A is an unknown quantity (such as the truth of some proposition) and D stands for some data relevant to Your uncertainty about A. P(unknown|data) = P(unknown) P(data|unknown)/c, where c = normalizing constant posterior = cX prior X likelihood The terms prior and posterior emphasize the sequential nature of the learning process: P(unknown) was Your uncertainty assessment before the data arrived (Your prior).

Your prior is updated multiplicatively on the probability scale by the likelihood P(data|unknown), and renormalized so that the total probability remains 1 (100%). Writing Bayes’ Theorem both for A and (not A) and combining gives a (perhaps even more) useful version: Bayes’ Theorem in odds form P(A|data) / P(not A|data) = P(A) / P(not A) XP(data|A)/P(data|not A) posterior odds = prior odds X Bayes’ factor Other names for Bayes’ factor are the data odds and the likelihood ratio, since this factor measures the relative plausibility of the data given A and (not A).

Applying Bayes’ Theorem to the HIV example requires additional information about ELISA obtained by screening the blood of people with known HIV status. sensitivity = P(ELISA positive|HIV positive) specificity = P(ELISA negative|HIV negative) These are called ELISA’s operating characteristics. They are rather good—sensitivity of about 0.95, and specificity of about 0.98. Thus, you might well expect that P(this patient HIV positive|ELISA positive) would be close to 1.

Here the updating produces a surprising result: Bayes factor comes out as B = sensitivity/(1 – specificity) = 0.95/0.02 = 47.5 which sounds like strong evidence that this patient is HIV positive. But, the prior odds are quite a bit stronger the other way prior odds = P(A)/[ 1 - P(A)] = 99 to 1 against HIV Thus multiplying prior odds by Bayes factor we get the posterior odds of 99/47.5 = 2.08 against HIV. To turn odds into probability we write P(HIV positive|data) = 1 / (1+odds) = 0.32 (32%). Bayesian calculator.

The reason Your posterior probability that Your patient is HIV positive is only 32% in light of the strong evidence from ELISA is that ELISA was designed to have a vastly better false negative rate—P(HIV positive|ELISA negative); P(HIV positive|ELISA negative)=5/9707 = 0.00052 or 1 in 1941, in comparison to its false positive rate—P(HIV negative|ELISA positive); P(HIV negative|ELISA positive)=198/293 = 0.68 or 2 in 3. This in turn is because ELISA’s developers judged it far worse to tell someone who’s HIV positive that they’re not than the other way around (reasonable for using ELISA for blood bank screening for instance).

The false positive rate would make widespread screening for HIV based only on ELISA a truly bad idea. Formalizing the consequences of the two types of error in diagnostic screening would require quantifying misclassification costs, which shifts the focus from (scientific) inference (the acquisition of knowledge for its own sake: Is this patient really HIV positive?) to decision making (putting that knowledge to work to answer a public policy or business question, e.g.: What use of ELISA and Western Blot would yield the optimal screening strategy?) Bayesian calculator.

From which bowl is the cookie? To illustrate, suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1? Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let H1 corresponds to bowl #1, and H2 to bowl #2. It is given that the bowls are identical from Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal to 0.5. The datum D is the observation of a plain cookie. From the contents of the bowls, we know that P(D | H1) = 30/40 = 0.75 and P(D | H2) = 20/40 = 0.5.

In the courtroom • Bayesian inference can be used to coherently assess additional evidence of guilt in a court setting. • Let G be the event that the defendant is guilty. • Let E be the event that the defendant's DNA matches DNA found at the crime scene. • Let p(E | G) be the probability of seeing event E assuming that the defendant is guilty. (Usually this would be taken to be unity.) • Let p(G | E) be the probability that the defendant is guilty assuming the DNA match event E • Let p(G) be the probability that the defendant is guilty, based on the evidence other than the DNA match. • Bayesian inference tells us that if we can assign a probability p(G) to the defendant's guilt before we take the DNA evidence into account, then we can revise this probability to the conditional probability p(G | E), since • p(G | E) = p(G) p(E | G) / p(E) • Suppose, on the basis of other evidence, a juror decides that there is a 30% chance that the defendant is guilty. Suppose also that the forensic evidence is that the probability that a person chosen at random would have DNA that matched that at the crime scene was 1 in a million, or 10-6. • The event E can occur in two ways. Either the defendant is guilty (with prior probability 0.3) and thus his DNA is present with probability 1, or he is innocent (with prior probability 0.7) and he is unlucky enough to be one of the 1 in a million matching people. • Thus the juror could coherently revise his opinion to take into account the DNA evidence as follows: • p(G | E) = (0.3 × 1.0) /(0.3 × 1.0 + 0.7 × 10-6) = 0.99999766667. • In the United Kingdom, Bayes' theorem was explained by an expert witness to the jury in the case of Regina versus Dennis John Adams. The case went to Appeal and the Court of Appeal gave their opinion that the use of Bayes' theorem was inappropriate for jurors.

Sequential Learning

Sequential Learning

Presentation Transcript

Sequential Learning

SEQUENTIAL LOGIC

A Review of Sequential Supervised Learning

Sequential Flexibility

Sequential Arithmetic

Max-margin sequential learning methods

Sequential Logic

Sequential Networks

Sequential Statements

Revisiting Output Coding for Sequential Supervised Learning

Sequential Circuit

Stacked Sequential Learning

SEQUENTIAL LOGIC

Sequential Circuits

Sequential Logic

Sequential and Spatial Supervised Learning

Sequential Logic

Sequential Learning with Dependency Nets

Sequential Paragraph

Sequential Off-line Learning with Knowledge Gradients

Sequential Logic

Synchronous Sequential Logic Sequential Circuits

Sea Ice

Sea Ice