350 likes | 509 Vues
Primer on Probability. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 25 th , 2012. BMI/CS 576. Definition of probability.
E N D
Primer on Probability Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 25th, 2012 BMI/CS 576
Definition of probability • frequentist interpretation: the probability of an event from a random experiment is the proportion of the time events of same kind will occur in the long run, when the experiment is repeated • examples • the probability my flight to Chicago will be on time • the probability this ticket will win the lottery • the probability it will rain tomorrow • always a number in the interval [0,1] 0 means “never occurs” 1 means “always occurs”
Sample spaces • sample space: a set of possible outcomes for some event • event: a subset of sample space • examples • flight to Chicago: {on time, late} • lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins} • weather tomorrow: {rain, not rain} or {sun, rain, snow} or {sun, clouds, rain, snow, sleet} or…
Random variables • random variable: a function associating a value with an attribute of the outcome of an experiment • example • X represents the outcome of my flight to Chicago • we write the probability of my flight being on time as P(X = on-time) • or when it’s clear which variable we’re referring to, we may use the shorthand P(on-time)
Notation • uppercase letters and capitalized words denote random variables • lowercase letters and uncapitalized words denote values • we’ll denote a particular value for a variable as follows • we’ll also use the shorthand form • for Boolean random variables, we’ll use the shorthand
0.3 0.2 0.1 sun rain sleet snow clouds Probability distributions • if X is a random variable, the function given by P(X = x)for each x is the probability distribution of X • requirements:
Joint distributions • joint probability distribution: the function given by P(X = x, Y = y) • read “X equals xandY equals y” • example probability that it’s sunny and my flight is on time
Marginal distributions • the marginal distribution of X is defined by “the distribution of X ignoring other variables” • this definition generalizes to more than two variables, e.g.
Marginal distribution example joint distribution marginal distribution for X
Conditional distributions • the conditional distribution of Xgiven Y is defined as: “the distribution of X given that we know the value of Y”
Conditional distribution example conditional distribution for X givenY=on-time joint distribution
Independence • two random variables, X and Y, are independent if
Independence example #1 joint distribution marginal distributions Are X and Y independent here? NO.
Independence example #2 joint distribution marginal distributions Are X and Y independent here? YES.
Conditional independence • two random variables X and Y are conditionally independent given Z if • “once you know the value of Z, knowing Y doesn’t tell you anything about X” • alternatively
Conditional independence example Are Fever andHeadache independent? NO.
Conditional independence example Are Fever and Vomitconditionally independent given Flu: YES.
Chain rule of probability • for two variables • for three variables • etc. • to see that this is true, note that
Bayes theorem • this theorem is extremely useful • there are many cases when it is hard to estimate P(x| y) directly, but it’s not too hard to estimate P(y| x) andP(x)
Bayes theorem example • MDs usually aren’t good at estimating P(Disorder| Symptom) • they’re usually better at estimating P(Symptom| Disorder) • if we can estimate P(Fever| Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis
Expected values • the expected value of a random variable that takes on numerical values is defined as: this is the same thing as the mean • we can also talk about the expected value of a function of a random variable
Expected value examples • Suppose each lottery ticket costs $1 and the winning ticket pays out $100. The probability that a particular ticket is the winning ticket is 0.001.
The binomial distribution • distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each) • e.g. the probability of x heads in ncoin flips p=0.5 p=0.1 P(X=x) x x
The multinomial distribution • k possible outcomes on each trial • probability pifor outcome xi in each trial • distribution over the number of occurrences xifor each outcome in a fixed number n of independent trials • e.g. with k=6 (a six-sided die) and n=30 vector of outcome occurrences
Statistics of alignment scores Q: How do we assess whether an alignment provides good evidence for homology? A: determine how likely it is that such an alignment score would result from chance. What is “chance”? • real but non-homologous sequences • real sequences shuffled to preserve compositional properties • sequences generated randomly based upon a DNA/protein sequence model
Model forunrelatedsequences • we’ll assume that each position in the alignment is sampled randomly from some distribution of amino acids • let be the probability of amino acid a • the probability of an n-character alignment of x and y is given by
Model forrelatedsequences • we’ll assume that each pair of aligned amino acids evolved from a common ancestor • let be the probability that evolution gave rise to amino acid a in one sequence and b in another sequence • the probability of an alignment of x and y is given by
taking the log, we get Probabilistic model of alignments • How can we decide which possibility (U or R) is more likely? • one principled way is to consider the relative likelihood of the two possibilities
Probabilistic model of alignments • the score for an alignment is thus given by: • the substitution matrix score for the pair a, b should thus be given by:
Scores from random alignments • suppose we assume • sequence lengths m and n • a particular substitution matrix and amino-acid frequencies • and we consider generating random sequences of lengths m and n and finding the best alignment of these sequences • this will give us a distribution over alignment scores for random pairs of sequences
The extreme value distribution • but we’re picking thebest alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like • this is given by an extreme value distribution
Distribution of scores • the expected number of alignments, E, with score at least S is given by: • S is a given score threshold • m and n are the lengths of the sequences under consideration • K and are constants that can be calculated from • the substitution matrix • the frequencies of the individual amino acids
Statistics of alignment scores • to generalize this to searching a database, have n represent the summed length of the sequences in the DB (adjusting for edge effects) • the NCBI BLAST server does just this • theory for gapped alignments not as well developed • computational experiments suggest this analysis holds for gapped alignments (but K and must be estimated from data)