1 / 26

Fast Local Alignment Methods

Fast Local Alignment Methods. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao. FASTA.

chamblee
Télécharger la présentation

Fast Local Alignment Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Local Alignment Methods Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao

  2. FASTA • Find runs of identities, and identify regions with the highest density of identities. • Re-score using PAM matrix, and keep top scoring segments. • Eliminate segments that are unlikely to be part of the alignment. • Optimize the alignment in a band.

  3. FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence B Sequence A

  4. FASTA Step 2: Re-score using PAM matrix, andkeep top scoring segments.

  5. FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

  6. FASTA Step 4: Optimize the alignment in a band.

  7. BLAST • Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman) • The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

  8. The maximal segment pair measure • A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4) • The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. • BLAST heuristically attempts to calculate the MSP score. the highest scoring pair

  9. BLAST • Build the hash table for Sequence A. • Scan Sequence B for hits. • Extend hits.

  10. BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For protein sequences: Seq. A = ELVISAdd xyz to the hash table if Score(xyz, ELV) ≧ T;Add xyz to the hash table if Score(xyz, LVI) ≧ T;Add xyz to the hash table if Score(xyz, VIS) ≧ T; For DNA sequences: Seq. A = AGATCGAT 12345678 AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..TTT

  11. BLAST Step2: Scan sequence B for hits.

  12. BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

  13. Remarks • Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. • The idea of filtration was used in both FASTA and BLAST.

  14. PatternHunter The following slides are based on Bin Ma’s illustrations.

  15. A homology between mouse and human genomes GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Smith-Waterman is the most accurate method. Time complexity:O(mn).

  16. BLAST finds a “hit” and then extends GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Seed match = hit

  17. Example of missing a target • Fail: (Word size 11) GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT || ||||||||| |||||| | |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT • Dilemma • Sensitivity – needs shorter seeds • the success rate of finding a homology • Speed – needs longer seeds • Mega-BLAST uses seeds of length 28.

  18. PatternHunter uses “spaced seeds” • 111010010100110111 (called a model) • Eleven required matches (weight=11) • Seven “don’t care” positions GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| ||||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… 111010010100110111 • Hit = all the required matches are satisfied. • BLAST seed model = 11111111111

  19. Simulated sensitivity curves

  20. Weight of a seed • Lemma: The expected number of hits of a weight W length M seed model within a length L region with similarity p is (L-M+1)pW • Proof: There are (L-M+1) positions a hit can occur. At each position, pW hit is expected. Q.E.D. • Seed models with the same weight generate approximately the same amount of hits. • Speed is approximately the same. • Sensitivity is not necessarily the same. • num of hits v.s. num of regions that contain hits. GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT || ||||||||| ||||| ||||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT 111010010100110111

  21. Why spaced seeds are better? TTGACCTCACC? |||||||||||? TTGACCTCACC? 11111111111 11111111111 CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? CAA?A??A?C??TA?TGG? 111010010100110111 111010010100110111 • BLAST’s seed usually uses more than one hits to detect one homology (redundant) • Spaced seeds uses fewer hits to detect one homology (efficient)

  22. PH’s seed does not overlap heavily • PH’s seed do not overlap heavily when shifts: 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 ...... • The hits at different positions are independent. • The probability of having the second hit is 5*p6 + … • compare to BLAST’s model p + p2 + p3 + p4 + …

  23. Non-overlapping For optimized spaced seed, 111010010100110111 Non overlap Prob 111010010100110111 6 p6 111010010100110111 6 p6 111010010100110111 6 p6 111010010100110111 7 p7 ….. • For spaced seed: the divisor is 1+p6+p6+p6+p7+ … • For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + … Ming Li’s CPM’05 illustration.

  24. In a region of length 64 with p=0.7 • Hit more using less number of hits: • Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 • Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less • the average number of hits in that region is • 2.0 for PH’s weight-11 seed • 3.6 for contiguous weight-11 seed.

  25. PatternHunter I performance Blastn MB28 PH • E.coli (4.7M) v.s. H.inf (1.8M) • 716s /158M 5s/561M 34s/78M • Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M) • -- 21720s/1087M 5020s/279M • Human chr21 (26.2M) v.s. chr22 (35M) • -- -- 14512s/419M • All used a 700MHZ PentiumIII PC with 1G byte memory. • Human (3G) v.s. Mouse (3G)* • Using 2-hit, weight 12 seed, PH used 6 days with a 1GHZ PentiumIII PC with 2G byte memory. • With Blast, it would otherwise take months with parallel computers to finish.

  26. Homology search vs Google search • Internet search • Size limit: 5 billion people x homepage size • Supercomputing power used: ½ million CPU-hours/day • Query frequency: Google --- 112 million/day • Query type: exact keyword search --- easy to do • Homology search • Size limit: 5 billion people x 3 billion basepairs + millions of species x billion bases • 10% (?) of world’s supercomputing power • Query frequency: NCBI BLAST -- 150,000/day, 15% increase/month • Query type: approximate search Ming Li’s CPM’05 illustration.

More Related