1 / 34

Alignment of Genomic Sequences

Alignment of Genomic Sequences. Wen-Hsiung Li Ecology & Evolution Univ. of Chicago. Sequence Alignment. (1) pairs of matched bases (2) pairs of mismatched bases (3) pairs consisting of a base from one sequence and a gap (null base) from the other sequence.

eugene
Télécharger la présentation

Alignment of Genomic Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago

  2. Sequence Alignment (1) pairs of matched bases(2) pairs of mismatched bases(3) pairs consisting of a base from one sequence and a gap (null base) from the other sequence

  3. Alignment as an Evolutionary Hypothesis TCAGA** *TC-GT

  4. A: TCAGACGATTGLA = 11B: TCGGAGCTGLB = 9

  5. Alignment ITCAG-ACG-ATTG|| | | | | |TC-GGA-GC-T-GMatches = 7Gaps = 6

  6. Alignment IITCAGACGATTG|| || TCGGAGCTG--Matches = 4Gaps = 1

  7. Alignment IIITCAG-ACGATTG|| | | | | TC-GGA-GCTG-Matches = 6Gaps = 4

  8. Which alignment is best?

  9. Gap and Mismatch Penalties • Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches • Mismatch penalty - an assessment of how frequently substitutions occur

  10. Similarity Index S = x - wkzk X : number of matches Zk : number of gaps of length k wk : positive number representing penalty for gaps of length k

  11. Distance (Dissimilarity) Index D = y + w'kzk y : number of mismatches zk : number of gaps of length k w'k: positive number representing penalty for gaps of length k

  12. Gap penalty systems • Fixed - no gap extension penalty • Affine or Linear - has two componenets gap opening penalty and gap extension penalty • Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system

  13. Gap penalty systems Linear Logarithmic Fixed Gap penalty Gap length

  14. BEST Gap opening cost = 2 Gap opening cost = 3 Gap extension cost = 6 Gap extension cost = 0 TCAG-ACG-ATTG|| | | | | | S = -5 S = -11 TC-GGA-GC-T-G TCAGACGATTG|| || S = -4 S = 1 TCGGAGCTG-- TCAG-ACGATTG|| | | | | S = -2 S = -6TC-GGA-GCTG-

  15. Dynamic programming • Large searches are divided into succession of small stages: • solution of the initial search stage is trivial • each partial solution in a later stage can be calculated by reference to only a small number of solutions of the earlier stage • the final stage contains overall solution

  16. A T G C G A 1 0 0 0 0 T 0 2 1 1 1 C 0 1 2 3 2 C 0 1 2 3 3 G 0 1 3 2 4 C 0 1 2 4 3 Pointer values and paths connecting the pointers

  17. Traceback A T G C G A 1 0 0 0 0 T 0 2 1 1 1 C 0 1 2 3 2 C 0 1 2 3 3 G 0 1 3 2 4 C 0 1 2 4 3 ATGCG- || || ATCCGC AT--GCG || || ATCCGC-

  18. Similarity Index S = x - wkzk x - number of matches zk - number of gaps of length k wk - a positive number representing penalty for gaps of length k

  19. TCAGACGAGTG x = 6 (I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTGx = 7 (II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b) TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b) TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTGS = 8 - 2(a + b) - (a + 2b)

  20. How to align twolong genomic sequences?

  21. Traditional Seq. Alignment • The seqs. are usually known (coding or non-coding) and are homologous • They are not very long, usually < 10,000 base pairs (bp) • They contain no inversions • Relies on dynamic programming: The time and space required are O(N2), where N is the sequence length.

  22. The Human Genome • Genome size: ~3.2 billion bp • Only ~1.5% is coding. • Contains numerous repetitive elements (more than 4 million). • Introns are usually longer than exons. • Non-coding regions evolve fast and are not well conserved.

  23. Genomic Seq. Alignment • The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacteriumtuberculosis is about 4 Mb. Long time to align. Large computer memory. • May contain inversions and many tandem repeats. • May containnon-alignable (too divergent) segments.

  24. Genomic Seq. Alignment Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.

  25. The System of Delcher et al. (1999) • Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981) • Two closely homologous long sequences or genomes (A and B).

  26. Step 1: Perform a Maximum Unique Match (MUM) decomposition of the two sequences A MUM is a subsequence that occurs once in sequence A and once in sequence B, and is not contained in any longer such sequence.

  27. Max. Unique Matches (MUMs) MUM1 Seq. A tcgatcaAGCTCACTGATatgtaccat Seq. B cgagcgAGCTCACTGATcctgcatca MUM2 -acgctgaATCGACGTAGTCCATGtactgta agtgc-agATCGACGTAGTCCATGatgaat

  28. Suffix Trees A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t 1 2 3 4 5 6 7 8 9 10 A suffix: c c g a c c t A suffix tree is a compact representation that stores all possible suffixes of a seq.

  29. g a a c c g a c c t 1 2 3 4 5 6 7 8 9 10 Root o a t c ga 1 1 2 10 gacct accgacct accgacct cc t cct c 2 3 5 2 9 1 6 gacct t gacct t 3 7 4 8

  30. g a a c c g a c c t# g a a c c t a c c t* 1 2 3 4 5 6 7 8 9 10 Root o a t c ga 1 1 2 10 accgacct# gacct# acc cc t cct c 5 3 2 5 2 9 6 t gacct# gacct# gacct# t# 4 tacct* 3 7 4 8 1 1 7

  31. Step 2: Sort the MUMs After finding the MUMs, we sort them according to their positions in genome A. See figure. Longest Increasing Sequence (LIS): If the order of B positions is given by the sequence [1,2,10,4,5,8,6,7,9,3], the LIS is [1,2,4,5,6,7,9]. The LIS gives a global MUM-alignment.

  32. 3 5 6 1 2 4 7 Genome A: 3 6 5 Genome B: 1 2 4 7 6 1 2 4 7 Genome A: Genome B: 6 1 2 4 7

  33. Step 3: Close the gaps between MUMs Use the Smith-Waterman algorithm to close the gaps between MUMs. Some regions may be very difficult to align. These regions are ignored and considered as non-alignable parts. Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.

More Related