1 / 30

Statistical significance for trace recalling alignments

Statistical significance for trace recalling alignments. Brief TR review. Generate ambiguity sequence from trace (up to 2 bases per position) Align ambiguity sequence to genomic sequence (primary alignment) Use aligned genome sequence to “subtract out” one template sequence

nizana
Télécharger la présentation

Statistical significance for trace recalling alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical significance for trace recalling alignments

  2. Brief TR review • Generate ambiguity sequence from trace (up to 2 bases per position) • Align ambiguity sequence to genomic sequence (primary alignment) • Use aligned genome sequence to “subtract out” one template sequence • Remaining sequence comes from other template

  3. Brief TR review • Can align this sequence to genome to differentiate real secondary trace from noise (secondary alignment)

  4. What constitutes a good alignment? • Expect primary alignment to have a very high percent identity • Represents stronger of 2 templates present • Ambiguity sequence so generally expect more matches than DNA vs. DNA alignment • Expect few (non-intron) gaps

  5. What constitutes a good alignment? • Expect secondary alignment to have lower percent identity • Weaker of 2 signals in trace • DNA + N only • Still don’t expect many gaps (N’s are used as placeholders)

  6. What constitutes a good alignment? • EST_GENOME doesn't report statistical significance of alignments • Have been using mainly percent identity and length of alignment to determine if procedure “worked” • Decent heuristic but not really the right way to determine significance (and can be misleading)

  7. What is the right way to determine significance? • Would like something like a BLAST E-value • Mathematically sound • Single number • No arbitrary cutoffs

  8. What is the BLAST E-value? • For a given search: • Scoring system (matrix and gap penalties) • Lengths of query and database • For a given MSP: • Score = S • E = expected frequency of chance occurrences of an MSP with score S or greater

  9. What is the BLAST E-value? • So if E = 5 you expect to see 5 MSPs with the same or better score and your MSP is not significant • If E = 10-50 you really don’t expect to see an alignment this good by chance

  10. How is E defined? • E = kmne-λS where: • m = length of query • n = length of database • S = score of MSP • k & λ are constants

  11. How is E defined? • For ungapped alignments there are closed form solutions for k & λ given the scoring matrix • For gapped alignments they must be estimated empirically

  12. What’s going on behind the scenes here? • The BLAST statistics paper (Karlin & Altschul, 1990) states the formula for E without proof • References another paper (Karlin, Dembo & Kawabata, 1990) in Annals of Statistics which contains the proof

  13. What’s going on behind the scenes here? • The BLAST statistics paper (Karlin & Altschul, 1990) states the formula for E without proof • References another paper (Karlin, Dembo & Kawabata, 1990) in Annals of Statistics which contains the proof 

  14. Yikes

  15. What’s going on behind the scenes here (take 2)? • MSP scores are distributed by an extreme value or Gumbel distribution • Looks like a normal distribution with probability density skewed to the right

  16. Extreme value distribution

  17. Why an extreme value distribution? • You get an EVD when the random variable you’re looking at is the maximum of another set of random variables • Might help to see this in another domain • Classic example: Floods • Record height of river every day (Xi) over many years • Define annual flood to be maximum river level for each calendar year (Yj)

  18. Why an extreme value distribution? • The Yj values will be distributed by an EVD • 100 year levees • Distribution of Yj values sensitive to distribution of Xi but not as much as you might suspect • Similar EVD if Xi is exponentially or normally distributed!

  19. Why an extreme value distribution? • MSP scores in an alignment are analogous to annual flood levels • Highest scoring pair between 2 sequences is maximum of several high scoring pairs

  20. What do k (or u) & λ do? • u = ln(kmn)/λ for alignment • EVD cdf  F(x) = e-e-λ(x-u) • EVF pdf  f(x) = λe-λ(x-u) e-e-λ(x-u) • λ controls the spread and u (or k) controls the location

  21. u controls position

  22. λ controls spread

  23. Back to trace recalling • Said earlier that for gapped alignment k & λ must be estimated • One method to get parameters is to generate many random sequences, align them to get optimal score and fit an EVD to them • u & λ can be obtained from mean and standard deviation of this set of scores

  24. Parameters • By method of moments • λ = π/sqrt(6*V) • u = μ – γ/ λ where γ = Euler’s constant = 0.577

  25. Histogram of optimal scores and fit EVD

  26. Back to trace recalling • Secondary alignment is straightforward (DNA vs. DNA alignments) • Primary alignment is a bit trickier • As originally implemented λ and u depended on percentage of ambiguous positions in ambiguity sequence

  27. Primary alignment parameters

  28. Why was this happening? • Gave same score to match between ambiguous positions and non-ambiguous positions • Random alignments to 100% ambiguous sequence were ~2x as long and high scoring as 0% ambiguous sequence

  29. How was it corrected? • Setting match score for ambiguous base at half score of non-ambiguous base fixed problem • This actually makes more sense anyway • No matter what percentage ambiguity positions get λ = 0.66 and k = 0.477

  30. In conclusion • Have added this to trace recalling pipeline • Now in addition to checking for alternate splices by comparing alignments can screen out bad alignments by E-value

More Related