410 likes | 426 Vues
Sequence Alignment techniques. Definition. A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationship between the sequences. Sequence Alignments ?.
E N D
Definition • A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationship between the sequences.
Sequence Alignments ? 1. I have just sequenced something. What is known about the thing I sequenced? 2. I have a unique sequence. Is there similarity to another gene that has a known function? 3. I found a new protein in a lower organism. Is it similar to a protein from another species? 4. I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR. 5. I wish to perform molecular modeling of the proteins sequence which has significant similarity to sequence of a protein for which the 3D structure is available.
Sequence alignments • Pair wise • Multiple
Pairwise protein sequence alignment • definition: compare pairs of sequences and search for series of characters that are in the same order • sequences in rows with identical (or similar) characters in same columns and non-identical (non-similar) characters either in same column (mismatch) or opposite a gap ----TFGK------ ||| ----TFGR------ HCKLTFGKWFTSEW | ||| | KCGPTFGRIACGEM Local - most similar sub-regions of sequences aligned (islands of similarity) Global - entire sequences aligned up to both ends
Methods of pairwise sequence alignment… • dot matrix - all possible matches between sequence residues are found; used to compare two sequences to look for regions where they may align; very useful for finding indels and repeats in sequences; can be used as a first pass to see if there is any similarity between sequences • dynamic programming - mathematically guaranteed to find optimal alignment (global or local) between pairs of sequences; very computationally expensive - # of steps increases exponentially with sequence length
Dot matrix method 1 - one sequence listed along top of page and second sequence listed along the side 2 - move across row and put dot in any column where the character is the same 3 - continue for each row until all possible character matches between the sequences are represented by dots 4 - diagonal rows of dots reveal sequence similarity (can also find repeats and inverted repeats off the main diagonal) 5 - isolated dots represent random similarity unrelated to the alignment H C G E T F G R W F T P E W K C • G • P • T • • F • • G • R • I A C • G • • E • • M
Protein sequence Alignments… • Dot matrix method not a convenient method • Manual alignment of sequences ? • For sequence of length N, about 22N/√2N alignments are possible (for n=300, 10179 alignments!) • Mathematical solution: Dynamic programming (nothing to do with computer!)
Protein sequence Alignments… • In naturally occurring conserved proteins certain amino acids are favorably replaced in the process of natural selection. • Based on these observations and mutations substitution matrices have been generated. • For example: • BLOSUM (Block Substitution Matrices) matrices: BLOSUM40, BLOSUM60 etc. • PAM (Point Accepted Mutation) matrices: PAM80, PAM120, PAM250 • These matrices are used by various protein sequence alignment algorithms.
Dynamic programming • a dot matrix shows regions of similarity but not path that connects disjointed regions i.e. the optimal alignment which is the ultimate goal of pairwise sequence comparison • dynamic programming was applied to sequence alignment by Needleman & Wunsch to achieve this end • dynamic programming is a general class of optimization solutions that finds best solutions by breaking down large intractable problems into smaller pieces and then solving • ultimately a sequence or ‘path’ of subproblem scores that yields the highest overall score is chosen as the optimal solution for the entire problem
Dynamic programming & sequence alignment • overall problem is broken down into subproblems of aligning each residue of one sequence to each residue of the other • choose the best solution to the problem among the three options of (1) - aligning residues (2) - introducing gap in sequence 1 or (3) - introducing gap in sequence 2 • each high scoring choice rules out two low scoring choices - this is critical in reducing the overall space of alignments needed to be evaluated (essence of time saving) • the algorithm use a matrix similar to the dot matrix with sequences on the top and left axes • at each position in the matrix the algorithm computes the best score and stores a pointer from the previous position from where the highest score was derived • finally a ‘trace back’ step is performed where the highest scoring path along the pointers is traced - this represents the optimal alignment
Dynamic programming & sequence alignment: Steps… • Two sequences are arranged in a matrix table. • Initial GAP penalties (d) are listed in the first row or column. • First values of substitutions scores (Si,j) are filled in the table using substitution matrices • The simple matrix table is converted to dynamic programming table using the following mathematical equation. • Hi,j = max { (Hi-1,j-1 + Si,j), (Hi-1,j – d), (Hi,j-1 –d) }
H G S A Q V K T E A E M • Hi,j = max { (Hi-1,j-1 + Si,j), (Hi-1,j – d), (Hi,j-1 –d) }
sequence 1 M - N A L S D R T sequence 2 M G S D R T T E T score 6 -12 1 0 -3 1 0 -1 3 = -5 sequence 1 M N A - L S D R T sequence 2 M G S D R T T E T score 6 0 1-12 -3 1 0 -1 3 = -5
Which matrix to use? • PAM120 for general use • PAM60 for close relations • PAM250 for distant relations • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations • When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices.
Global alignment algorithms : Needleman and Wunsch • Local Alignment algorithms: Smith-Waterman local alignment http://www.ebi.ac.uk/Tools/emboss/align/
Needleman S.B. and Wunsch C.D. 1970. J. Mol. Biol. 48: 443-453 Smith T.F. and Waterman M.S. 1981. J. Mol. Biol. 147: 195-197 Eddy, S.R. 2004. Nature Biotechnology 22: 909 - 910
Clustal • Most widely used algorithm for MSA • Available in different forms ClustalW, ClustalX • Different Output formats • Apart from standalone it is also available in: • BIOEDIT, GCG, EMBOSS, Macvector etc.
ClustalW • Formats: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF, RSF. • Output formats: same as above + Phylip
Web server: http://www.ebi.ac.uk/clustalw/index.html Align few sequences by default parameters. Change parameters like GAP penalties and note the changes in alignment outputs. Exercise: Make a dynamic programming matrix for a protein sequence of length 7. Use BLOSUM40 matrix to generate a dynamic programming matrix using the mathematical equation given in the presentation. Trace back the path of maximum scores and obtain optical alignment(s)
Exercise to be submitted by Thursday • Go to http://expasy.org/tools/randseq.html • Generate a random protein sequence of length 25 amino acids of average amino acid composition • Draw a dot plot. Identify regions of similarities, repeats, inverted repeats. • Submit the record to me.