100 likes | 199 Vues
Explore the mathematical modeling of sequence analysis using a coin toss analogy to predict the average length of runs and alignment scores. Understand the intricacies of local and random alignment, unit distributions, Poisson distribution estimates, and the importance of lambda and K factors in refining the model. Dive into the treasure trove of probabilities inherent in analyzing unrelated sequences and discover the loose ends in gap parameters.
E N D
Pairwise Sequence Analysis-V • Relatedness • “Not just important, but everything” • Modeling Alignment Scores • Coin Tosses • Unit Distributions • Extreme Value Distribution • Lambda and K revealed • Loose Ends Lecture 6 CS566
Modeling Expectation • Reduced model: Coin tosses • Given: • N coin tosses • Probability of heads p • Problem: • What is the average number of longest run of heads? • Solution: • Experimental: Perform several repetitions and count • Theoretical: E(Runmax) = log1/pN • For example, for fair coin and 64 tosses, E(Runmax) = 6 Lecture 6 CS566
Random alignment as Coin tosses • Head = Match • Assume • Score = Run of matches • Maximum score = Longest run of matches • Therefore • Same model of expectation • For example: For DNA sequences of length N, • E(matchlengthmax) = Expected longest run of matches = log1/pN Lecture 6 CS566
Local alignment as Coin tosses • Assume • Score in local alignment = Run of matches • Maximum score = Longest run of matches • Therefore • Similar model of expectation • For DNA sequences of length n & m • E(Matchlengthmax) ~ log1/p(nm) (Why not just n or m?) ~ log1/p(K’nm) • Var(Matchlengthmax) = C (i.e., Independent of sample space) Lecture 6 CS566
Refining Model • S = AS matrix based scoring between unrelated sequences • E(S) ~ log1/p(K’nm) ~ [ln(Knm)]/ (where = loge 1/p) • Holy Grail: Need P(S > x), probability of a score between unrelated sequences exceeding x Lecture 6 CS566
Poisson distribution estimate of P(S > x) • Consider Coin Toss Example Given [x >> E(Runmax)] Define Success = (Runmax x) Define Pn = Probability of n successes Define y = E[Success],i.e., Average no. of successes Then, probability of n successes follows Poisson dist. Pn = (e-yyn)/n! Probability of 0 successes (No score exceeding x) is given by P0 = e-y. Then, probability of at least one score exceeding x, P(S > x) = i 0 Pi = (1 - P0) = 1 - e-y For Poisson distribution, y = Kmne-x. Therefore, P(S > x) = 1 – exp (-Kmne-x) Lecture 6 CS566
Unit Distributions • Normalize Gaussian and EVD • Area under curve = 1 • Curve maximum at 0 • Then • For Gaussian • Mean = 0; SD = 1 • P(S > x) = 1 – exp (-e-x) • For EVD • Mean = 0.577 (Euler cons); Variance = 2/6 = 1.645 • P(S > x) = 1 – exp (-e-(x-u)) • Z-score representation in terms of SDs • P (Z > z) = 1 – exp(-e-1.2825z – 0.5772) Lecture 6 CS566
Lambda and K • = Scale factor for scoring system • Effectively converts AS matrix values to actual natural log likelihoods • K = Scale factor that reduces search space to compensate for non-independence of local alignments • Esimated by fitting to Poisson approximation or equation for E(S) Lecture 6 CS566
Treasure Trove of Probabilities • Probability distribution of scores between unrelated sequences P(Sunrel) • Probability distribution of number of scores from P(Sunrel) exceeding some cut-off, mean represents number of scores exceeding cut-off observed on average • Probability of observing score x occurring between unrelated sequences P(S x) Lecture 6 CS566
Loose Ends • What about gap parameters? • Short answer: No formal theory • Long answer: • Found empirically • Choice of parameters can be used to convert local alignment algorithm into a global alignment • What about gapped alignment? • Not formally proven, but simulations show statistical behavior similar to ungapped alignment • Effective sequence length n’ = n – E(matchLengthmax) Lecture 6 CS566