400 likes | 660 Vues
Intro to Bioinformatics ? Sequence Alignment. 2. Sequence Alignments. Cornerstone of bioinformaticsWhat is a sequence?Nucleotide sequenceAmino acid sequencePairwise and multiple sequence alignmentsWe will focus on pairwise alignmentsWhat alignments can helpDetermine function of a newly discov
                
                E N D
1. Sequence Alignments 
2. Intro to Bioinformatics  Sequence Alignment 2 Sequence Alignments Cornerstone of bioinformatics
What is a sequence?
Nucleotide sequence
Amino acid sequence
Pairwise and multiple sequence alignments
We will focus on pairwise alignments
What alignments can help
Determine function of a newly discovered gene sequence
Determine evolutionary relationships among genes, proteins, and species
Predicting structure and function of protein 
3. Intro to Bioinformatics  Sequence Alignment 3 DNA Replication Prior to cell division, all the genetic instructions must be copied so that each new cell will have a complete set
DNA polymerase is the enzyme that copies DNA
Reads the old strand in the 3 to 5 direction 
4. Intro to Bioinformatics  Sequence Alignment 4 Over time, genes accumulate mutations 
5. Intro to Bioinformatics  Sequence Alignment 5 Codon deletion:ACG ATA GCG TAT GTA TAG CCG
Effect depends on the protein, position, etc.
Almost always deleterious
Sometimes lethal
Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG  ACG ATA GCG ATG TAT AGC CG?
Almost always lethal Deletions 
6. Intro to Bioinformatics  Sequence Alignment 6 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT 
7. Intro to Bioinformatics  Sequence Alignment 7 The Genetic Code 
8. Intro to Bioinformatics  Sequence Alignment 8 Comparing Two Sequences Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT
Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCTACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT 
9. Intro to Bioinformatics  Sequence Alignment 9 Why Align Sequences? The draft human genome is available
Automated gene finding is possible
Gene:  AGTACGTATCGTATAGCGTAA
What does it do?
One approach:  Is there a similar gene in another species?
Align sequences with known genes
Find the gene with the best match 
10. Intro to Bioinformatics  Sequence Alignment 10 Gaps or No Gaps Examples 
11. Intro to Bioinformatics  Sequence Alignment 11 Scoring a Sequence Alignment Match score:		+1
Mismatch score:	+0
Gap penalty:		1ACGTCTGATACGCCGTATAGTCTATCT    ||||| |||   || ||||||||----CTGATTCGC---ATCGTCTATCT
Matches:  18  (+1)
Mismatches:  2  0
Gaps: 7  ( 1) 
12. Intro to Bioinformatics  Sequence Alignment 12 Origination and Length Penalties We want to find alignments that are evolutionarily likely.
Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a new gap, than for extending an existing gap 
13. Intro to Bioinformatics  Sequence Alignment 13 Scoring a Sequence Alignment (2) Match/mismatch score:		+1/+0
Origination/length penalty:	2/1ACGTCTGATACGCCGTATAGTCTATCT    ||||| |||   || ||||||||----CTGATTCGC---ATCGTCTATCT
Matches:  18  (+1)
Mismatches:  2  0
Origination: 2  (2)
Length: 7  (1) 
14. Intro to Bioinformatics  Sequence Alignment 14 How can we find an optimal alignment? Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT
C(27,7) gap positions = ~888,000 possibilities
Its possible, as long as we dont repeat our work!
Dynamic programming: The Needleman & Wunsch algorithm 
15. Intro to Bioinformatics  Sequence Alignment 15 Dynamic Programming Technique of solving optimization problems
Find and memorize solutions for subproblems
Use those solutions to build solutions for larger subproblems
Continue until the final solution is found  
Recursive computation of cost function in a non-recursive fashion 
16. Intro to Bioinformatics  Sequence Alignment 16 Global Sequence Alignment Needleman-Wunsch algorithm
Suppose we are aligning:	A with A
 
17. Intro to Bioinformatics  Sequence Alignment 17 Dynamic Programming (DP) Concept Suppose we are aligning:	CACGA
    	CCGA
 
18. Intro to Bioinformatics  Sequence Alignment 18 DP  Recursion Perspective Suppose we are aligning:ACTCGACAGTAG
Last position choices: 
19. Intro to Bioinformatics  Sequence Alignment 19 What is the optimal alignment? ACTCGACAGTAG
Match: +1
Mismatch: 0
Gap: 1 
20. Intro to Bioinformatics  Sequence Alignment 20 Needleman-Wunsch:  Step 1 Each sequence along one axis
Mismatch penalty multiples in first row/column
0 in [1,1] (or [0,0] for the CS-minded) 
21. Intro to Bioinformatics  Sequence Alignment 21 Needleman-Wunsch:  Step 2 Vertical/Horiz. move:  Score + (simple) gap penalty
Diagonal move:  Score + match/mismatch score
Take the MAX of the three possibilities 
22. Intro to Bioinformatics  Sequence Alignment 22 Needleman-Wunsch:  Step 2 (contd) Fill out the rest of the table likewise 
23. Intro to Bioinformatics  Sequence Alignment 23 Needleman-Wunsch:  Step 2 (contd) Fill out the rest of the table likewise 
24. Intro to Bioinformatics  Sequence Alignment 24 But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from 
25. Intro to Bioinformatics  Sequence Alignment 25 A path corresponds to an alignment 	= GAP in top sequence
	= GAP in left sequence
	= ALIGN both positions
One path from the previous table:
Corresponding alignment (start at the end):	AC--TCG	ACAGTAG
 
26. Intro to Bioinformatics  Sequence Alignment 26 Algorithm Analysis Brute force approach 
If the length of both sequences is n, number of possibility = C(2n, n) = (2n)!/(n!)2 ? 22n / (?n)1/2, using Sterlings approximation of n! = (2?n)1/2e-nnn.
O(4n)
Dynamic programming
O(mn), where the two sequence sizes are m and n, respectively  
O(n2), if m is in the order of n  
27. Intro to Bioinformatics  Sequence Alignment 27 Practice Problem Find an optimal alignment for these two sequences:	GCGGTT	GCGT
Match: +1
Mismatch: 0
Gap: 1
 
28. Intro to Bioinformatics  Sequence Alignment 28 Practice Problem Find an optimal alignment for these two sequences:	GCGGTT	GCGT
 
29. Intro to Bioinformatics  Sequence Alignment 29 Semi-global alignment Suppose we are aligning:GCGGGCG
Which do you prefer?G-CG		-GCGGGCG		GGCG
Semi-global alignment allows gaps at the ends for free. 
30. Intro to Bioinformatics  Sequence Alignment 30 Semi-global alignment 
31. Intro to Bioinformatics  Sequence Alignment 31 Local alignment Global alignments  score the entire alignment
Semi-global alignments  allow unscored gaps at the beginning or end of either sequence
Local alignment  find the best matching subsequence
CGATGAAATGGA
This is achieved by allowing a 4th alternative at each position in the table:  zero. 
32. Intro to Bioinformatics  Sequence Alignment 32 Local Sequence Alignment  Why local sequence alignment?
Subsequence comparison between a DNA sequence and a genome
Protein function domains
Exons matching
Smith-Waterman algorithm 
33. Intro to Bioinformatics  Sequence Alignment 33 Local alignment Score: Match = 1, Mismatch = -1, Gap = -1 
34. Intro to Bioinformatics  Sequence Alignment 34 Local alignment Another example 
35. Intro to Bioinformatics  Sequence Alignment 35 More Example Align 
ATGGCCTC
ACGGCTC
Mismatch ? = -3
Gap           ? = -4
 
36. Intro to Bioinformatics  Sequence Alignment 36 More Example  
 
37. Intro to Bioinformatics  Sequence Alignment 37 Scoring Matrices for DNA Sequences Transition: A ?? G  C ?? T
Transversion: a purine (A or G) is replaced by a pyrimadine (C or T) or vice versa 
38. Intro to Bioinformatics  Sequence Alignment 38 Scoring Matrices for Protein Sequence  PAM (Percent Accepted Mutations) 250 
39. Intro to Bioinformatics  Sequence Alignment 39 Scoring Matrices for Protein Sequence  BLOSUM (BLOcks SUbstitution Matrix) 62 
40. Intro to Bioinformatics  Sequence Alignment 40 Using Protein Scoring Matrices Divergence
BLOSUM 80 	BLOSUM 62 		BLOSUM 45
PAM 1 		PAM 120 		PAM 250
Closely related 				Distantly related
Less divergent 				More divergent
Less sensitive 				More sensitive
Looking for
Short similar sequences ? use less sensitive matrix
Long dissimilar sequences ? use more sensitive matrix
Unknown ? use range of matrices
Comparison
PAM  designed to track evolutionary origin of proteins
BLOSUM  designed to find conserved regions of proteins