460 likes | 893 Vues
Multiple sequence alignment (MSA). Bioe C144/244 Fall 2010. Topics Covered. Uses of multiple sequence alignments (MSAs) Global, glocal and local alignments Structural vs sequence alignments Progressive, iterative and master-slave alignment methods ClustalW, MAFFT
E N D
Multiple sequence alignment(MSA) Bioe C144/244 Fall 2010
Topics Covered • Uses of multiple sequence alignments (MSAs) • Global, glocal and local alignments • Structural vs sequence alignments • Progressive, iterative and master-slave alignment methods • ClustalW, MAFFT • Benchmark datasets & scoring functions • Alignment formats (aligned FASTA, UCSC a2m etc)
What is an MSA • An MSA is an assertion of homology • > 2 nucleotide or amino acid sequences • Characters in the columns have descended from an ancestral character except for indel characters that represent insertions and deletions • An MSA is a matrix • Mi,j = the character for sequence i at column j. • Lower-case characters may have different meanings from upper-case characters • Dot (.) and dash (-) are indel characters (understand UCSC a2m format)
Uses of sequence alignment Phylogenetic tree Active and binding site prediction GFKLP GYKLP GFRVP GF-LP Homology Models And more… Profiles/HMM construction Domain Prediction Substitution Matrices Function prediction by phylogenomic analysis Secondary structure prediction Subfamily identification …
Local, glocal and global-global • Local-local • Global-local (aka glocal) • Global-global Best for boosting remote homolog detection, identification of evolutionary domains. Default protocol of BLAST and PSI-BLAST. Global to the query, potentially local to the hit. Best for gathering homologs to a structural domain. Restrict sequences to those appearing to have the same domain architecture. Recommended for phylogenomic inference of molecular function. Default protocol of FlowerPower and PhyloBuilder.
Structural alignment is the gold standard • Structural superposition of two PDB structures provides correspondences/equivalences between residues • Since primary sequence diverges more rapidly than 3D structure, structural alignment is the gold standard against which sequence alignment is assessed • Not all structural aligners agree on all pairs • However, clearly superposable pairs (within 2.5 Angstroms) are normally agreed upon by structural aligners • Example structural aligners include: CE, DALI, VAST, Structal
Sequence and structural divergence are correlated Accuracy of sequence alignment relative to structural alignment Left three columns show results of structural alignment %ID: Structure pairs have been placed into bins based on sequence identity given the structural alignment #pair: number of pairs in each bin %Superpos: percent positions that are within ~3Angstroms RMSD (between backbone C-alpha carbons) Right three columns give Cline Shift scores for pairwise sequence alignments relative to the structural alignment. The best CS score possible is 1; negative scores indicate incorrect over-alignment with very few (or no) correctly aligned residue pairs.
Structural alignment example ID EC Function 1E9Y 3.5.1.5 Urease 1J79 3.5.2.3 Dihydroorotase Identity 9.8% Equivalent Residues 40%
VAST Structural Alignment at NCBI Type in the PDB structure ID of interest. http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
Then select PDB structures for which you want to see a structural alignment
VAST alignment of selected structures VAST multiple alignment based on structure superposition: Non-equivalent positions are in lower-case 1SN4 Scorpion neurotoxin (colored manually using PhyloFacts JMOL viewer to display non-equivalent positions)
Basic MSA method classes • Progressive: • Sequences are aligned to each other starting with the most similar sequences, until all sequences are included (typically using a form of agglomerative clustering) • A guide tree (based on pairwise sequence similarity) may be used to determine the join order of the sequences (e.g., ClustalW) • Key features: once two sequences are aligned to each other, their respective alignment will not change • Examples: ClustalW, SATCHMO • Iterative • The MSA is adjusted iteratively, allowing individual sequences or groups of sequences to change their respective alignments • Key feature: alignments of sequences may be adjusted based on some objective function • Examples: MUSCLE, MAFFT • Master-slave • One sequence (or profile/HMM) is the master • Other sequences are aligned to it • The MSA of the set is based on the alignment to the master • Examples: BLAST, PSI-BLAST, PFAM (full) MSAs
Some alignments are easy Note: no gaps, high sequence identity.
Alignment Editing Belvu allows: • Coloring columns according to characteristics • Changing sequence order (by %ID, tree topology) • Deleting columns (specified range or characteristics) • Deleting sequences individually, or according to characteristics (fraction gaps, low %ID) • And more…
Caveats • Sequence “signal” guides the alignment • If the signal is weak, the alignment is likely to be poor • As proteins diverge from a common ancestor, their structures and functions can change • All methods perform poorly when pairs are included with <25% identity • Even structural superposition can be challenging! • Unequal numbers of repeats, domain shuffling, large insertions or deletions introduce significant alignment errors • Take care in selecting sequences
Lecture notes on ClustalW from Per Kraulis http://www.avatar.se/lectures/molbioinfo2001/multali-clustal.html
Lecture notes on ClustalW from Per Kraulis http://www.avatar.se/lectures/molbioinfo2001/multali-clustal.html
ClustalW example Pevsner, Jonathan. Bioinformatics and Functional Genomics. 2nd. Baltimore Maryland: Wiley-Blackwell, 2009. 179-207. Print.
ClustalW example Pevsner, Jonathan. Bioinformatics and Functional Genomics. 2nd. Baltimore Maryland: Wiley-Blackwell, 2009. 179-207. Print.
ClustalW example Pevsner, Jonathan. Bioinformatics and Functional Genomics. 2nd. Baltimore Maryland: Wiley-Blackwell, 2009. 179-207. Print.
Summary of ClustalW method • For years, ClustalW was the best method available. • It’s still a solid performer, provided that sequences are closely related and do not have significant structural differences • Distinguishing characteristics: • Progressive alignment based on a guide tree • Gap parameters informed by hydrophobicity of amino acids and by previously inserted gaps • Amino acid substitution matrices derived from observed sequence divergence (different matrices for different groups)
MAFFT overview MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Kazutaka Katoh, Kazuharu Misawa,1 Kei-ichi Kuma, and Takashi Miyataa MAFFT is an example of an iterative algorithm The basic steps of the MAFFT algorithm Input sequences are progressively aligned following the branching order of sequences in the guide tree. This method requires a guide tree based on the all pairwise comparison distance matrix The alignment obtained by is subjected to further improvement, in which the alignment is divided into two groups and realigned using a technique called tree-dependent restricted partitioning. This process is iterated until no better scoring alignment is obtained http://nar.oxfordjournals.org/content/30/14/3059.full
Tree-dependant partitioning used in the MAFFT iterative step http://nar.oxfordjournals.org/content/30/14/3059.full
Additional issues • MSA methods vary in computational complexity • Some are very slow (e.g., ProbCons, T-Coffee, FSA, SATCHMO) • Some are very fast (e.g., MAFFT, MUSCLE) • Is the extra time worth the trouble? • Some methods are optimized for local alignment, while others are optimized for global alignment • Most methods assume input sequences are globally alignable • Restrict sequences to the homologous regions • Selection of sequences, alignment method and subsequent editing protocols must be guided by the intended use • All methods have serious problems in accuracy when sequence identities drop below 25% • Iterative methods can have problems when outlier (orphan) sequences are included
Benchmark datasets • BAliBase(Benchmark Alignment Database) • PREFAB (Protein Reference Alignment database) • SABmark(Sequence and structure Alignment Benchmark)
BAliBASE Julie Thompson, Frédéric Plewniak and Olivier Poch (1999) Bioinformatics, 15, 87-88
BAliBase http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/
Motivation (paraphrased from the authors) • The alignment of protein sequences is a crucial tool in molecular biology and genome analysis. • Historically, the quality of new alignment programs has been compared to previous methods using a small number of test cases selected by the program author. [italics mine] • Comparisons have been done using a set of alignments selected from structural databases. • These databases assemble proteins into homologous families, but alignments are not classified specifically for the systematic evaluation of multiple alignment programs. http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/
BAliBASE Motivation (cont’d) • It has been shown (McClure et al., 1994) that the performance of alignment programs depends on • the number of sequences, • the degree of similarity between sequences and • the number of insertions in the alignment. • Other factors may also affect alignment quality • length of the sequences • existence of large insertions • N/C-terminal extensions • over-representation of some members of the protein family. • We have constructed BAliBASE (Benchmark Alignment dataBASE) containing high-quality, documented alignments to identify the strong and weak points of the numerous alignment programs now available. http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/
Reference alignment construction • The sequences included in the database are selected from alignments in either the FSSP or HOMSTRAD structural databases, or from manually constructed structural alignments taken from the literature. • When sufficient structures are not available, additional sequences are included from the HSSP database (Schneider et al., 1997). • The VAST Web server (Madej, 1995) is used to confirm that the sequences in each alignment are structural neighbours and can be structurally superimposed. • Functional sites are identified using the PDBsum database (Laskowski et al., 1997) • Alignments are manually verified and adjusted, in order to ensure that conserved residues are aligned as well as the secondary structure elements. http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/
Reference alignments • Reference 1 contains alignments of < 6 equidistant sequences (i.e. the pairwise identity is within a specified range). All the sequences are of similar length, with no large insertions or extensions. • Reference 2 aligns up to three "orphan" sequences (less than 25% identical) from reference 1 with a family of at least 15 closely related sequences. • Reference 3 consists of up to 4 subgroups, with less than 25% residue identity between sequences from different groups. The alignments are constructed by adding homologous family members to the more distantly related sequences in reference 1. • Reference 4 is divided into two sub-categories containing alignments of up to 20 sequences including N/C-terminal extensions (up to 400 residues), and insertions (up to 100 residues). http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/
Reference alignments http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/
Reference alignments http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/
BAliBASE Summary • BAliBASE was the first benchmark dataset with specific reference alignments to assess relative method performance under different types of conditions • While many of the sequences included in the reference alignments have been aligned structurally, alignments have been refined manually • It has been extended by the authors to include additional types of alignment tasks • It remains one of the most commonly used benchmark datasets for evaluating multiple sequence alignments • Nevertheless, some issues with the dataset exist (to be covered in a later silde)
PREFAB The proctocol used to make the PREFAB dataset Two proteins are aligned by a structural method that does not incorporate sequence similarity. Each sequence is used to query a database, from which high‐scoring hits are collected. The queries and their hits are combined and aligned by a multiple sequence method. Accuracy is assessed on the original pair alone, by comparison with their structural alignment. The structural aligned proteins were selected from the FSSP database. Re‐aligned each pair of structures using the CE aligner and retained only those pairs for which FSSP and CE agreed on 50 or more positions. The full‐chain sequence of each structure was used to make a PSI‐BLAST search of the NCBI non‐redundant protein sequence database keeping locally aligned regions of hits with e‐values below 0.01. Hits were filtered to 80% maximum identity (including the query), and 24 selected at random. Finally, each pair of structures and their remaining hits were combined to make sets of ≤50 sequences. The final set, PREFAB version 3.0, has 1932 alignments averaging 49 sequences of length 240, of which 178 positions in the structure pair are found in the consensus of FSSP and CE Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97.
SABmark SABmark is designed to assess the performance of both multiple and pairwise (protein) sequence alignment algorithms. The database contains 2 sets, each consists of a number of subsets with related sequences. Its main features are: - Covers the entire known fold space (SCOP classification), with subsets provided by the ASTRAL compendium - All structures have high quality, with 100% resolved residues - Structure alignments have been derived carefully, using both SOFI and CE, and Relaxed Transitive Alignment - At most 25 sequences in each subset to avoid overrepresentation of large 1. The Twilight Zoneset is divided into sequence groups that each represent a SCOP fold. Sequence similarity is very low, between 0-25% identity, and a (traceable) common evolutionary origin cannot be established between most pairs even though their structures are (distantly) similar. This set therefore represents the worst case scenario for sequence alignment, which unfortunately is also the most frequent one, as most related sequences share less than 25% identity. - Twilight Zone set (with false positives): 209 groups, 1740 (3280) sequences, 10667 (44056) related pairs 2.The Superfamilies set consists of groups that each represent a SCOP superfamily, and therefore contain sequences with a (putative) common evolutionary origin. However, they share at most 50% identity, which is still challenging for any sequence alignment algorithm. Frequently, alignments are performed to establish whether or not sequences are related. To benchmark this, a second version of both the Twilight Zone and the Superfamilies set is provided, in which to each alignment problem a number of false positives, i.e. sequences not related to the original set, are added. - Superfamilies set (with false positives): 425 groups, 3280 (6526) sequences, 19092 (79095) related pairs http://bioinformatics.vub.ac.be/databases/databases.html
Benchmark datsets http://www.ncbi.nlm.nih.gov/pubmed/20047958
Assessing sequence alignment methods using the PREFAB benckmark dataset The Q_Developer score measures recall (sensitivity). In other words, it measures the fraction of the reference alignment that is correctly predicted by the sequence alignment Q_Developer = TP / (TP+FN) TP = # of correctly aligned residue pairs in the sequence alignment (i.e., agree with the reference alignment). FN = # of aligned residue pairs in the reference alignment which are not in the sequence alignment (i.e., they are missed by the sequence alignment) SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Hagopian et al, Nucleic Acids Research 2010 http://makana.berkeley.edu/satchmo/supplementary/webserver/
Assessing sequence alignment methods using the PREFAB benckmark dataset The Q_Modeler score measures precision (selectivity). Q_Modeler = TP / (TP + FP) TP = # of correctly aligned residue pairs in the sequence alignment (i.e, that agree with the reference) FP = # of incorrectly aligned residue pairs in the sequence alignment (i.e., pairs in the sequence alignment that are not aligned in the reference) SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Hagopian et al, Nucleic Acids Research 2010 http://makana.berkeley.edu/satchmo/supplementary/webserver/
Assessing sequence alignment methods using the PREFAB benckmark dataset The Cline Shift score includes a small positive score for being close to the reference alignment, and a small penalty for overalignment. SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Hagopian et al, Nucleic Acids Research 2010 http://makana.berkeley.edu/satchmo/supplementary/webserver/
UCSC a2m format • The UCSC SAM HMM software uses a specialized format for alignments, to describe how a sequence was emitted by an HMM (or, equivalently, aligns to an HMM). a2m format MSA columns are of two types: • Columns consisting of upper-case characters and dashes correspond to nodes in the HMM representing the consensus structure • Dashes are placed to indicate passage through an HMM skip/delete state • I.e., a dash indicates a sequence does not have the consensus structure at that position • Columns consisting of lower-case characters and dots correspond to residues emitted in HMM insert states, representing inserts between positions in the consensus structure • Dots are inserted post-hoc so that all sequences in the MSA have the same number of characters. • Dots in one sequence indicate that another sequence has inserted characters using an insert state at that position
References • Pevsner, Jonathan. Bioinformatics and Functional Genomics. 2nd. Baltimore Maryland: Wiley-Blackwell, 2009. 179-207. Print. • Katoh, Kazutaka, Kuma Kei-ichi, Hiroyuki Toh, and Takahashi Miyata. "MAFFT:a novel method for rapid multiple sequence alignment based on fast Fourier transform." Nucleic Acids Research. 30.14 (2002). • SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Hagopian et al, Nucleic Acids Research 2010