530 likes | 705 Vues
Introduction to Sequence Analysis. Protein Sequence Analysis Part II [ web page ] Osvaldo Graña CNIO Bioinformatics Unit ograna@cnio.es March 2013. Introduction to Sequence Analysis. Introduction.
 
                
                E N D
Introduction to Sequence Analysis Protein Sequence Analysis Part II [web page] Osvaldo Graña CNIO Bioinformatics Unit ograna@cnio.es March 2013
Introduction to Sequence Analysis Introduction • Determination of protein/peptide sequences is a basic requirement for biomedical research, like in cancer research. It is absolutely essential for characterising and identifying proteins or peptides. The UniProt Knowledgebase is a central database of protein sequence and function. The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and a section with computationally analyzed records that await full manual annotation. The two sections are referred to as "UniProtKB/Swiss-Prot" (reviewed, manually annotated) and "UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively. ********Check this web page with information about UniProtKB: http://www.uniprot.org/help/uniprotkb
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 We are going to search against a protein database using a nucleotide query (sequence 2) with NCBI BLAST2 (http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/sequence2.txt) and look for peptides/protein sequences that are similar in UniProtKB/Swiss-Prot. This peptide/protein sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps from closely related animals, or from sequences of closely related proteins. We are going to select the BLASTX option and the Swiss-Prot database to search in a protein database using a nucleotide query http://www.ebi.ac.uk/Tools/services/web/toolform.ebi?tool=ncbiblast&context=protein
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Database Choose here the databases you wish to run your protein sequence against.
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Database Choose here the databases you wish to run your protein sequence against.
Introduction to Sequence Analysis Selecting Blast parameters to search
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Matrix You may choose from a complete list of matrices which should cover various evolutionary constraints. This is because substitutions will occur in your sequences due to genetic diversity during evolution. Each matrix is tailored to a particular evolutionary distance. The default matrix for BLAST is blosum62 (Blocks Substitution Matrix 62% identity), which is the best of the available matrices for detecting weak protein similarities. PAM (Point Accepted Mutation) matrices are also traditionally used for amino acid sequences. Choosing a matrix with a larger PAM value will allow alignments of sequences with larger evolutionary distances, and choosing a blosum matrix with a larger value will allow a larger percentage identity. The default value is blosum 62. Expected threshold The expected threshold establishes a statistical significance threshold for reporting database sequence matches. The default value is 10, meaning that 10 matches are expected to be found merely by chance. Lower expected thresholds are more stringent, leading to fewer chance matches being reported. Increasing the expected threshold shows less stringent matches and is recommended when you are performing searches with short sequences as a short query is more likely to occur by chance in the database than a longer one, so even a perfect match (no gaps) can have low statistical significance and may not be reported. Increasing the Expected threshold allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance. Generally a value of up to 1000 is enough to see results. The default value is 10. 0 10-5 10-2 Risky Reliable Very Reliable Homology
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Filter The filter option, if set to true, will allow you to mask out various segments of the query sequence for regions which are non-specific for sequence similarity searches. Filtering can eliminate statistically significant but biologically uninteresting reports from the output, for example hits against common acidic-, basic- or proline-rich regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence, not to database sequences. The program used for this, with nucleotide query sequences is known as DUST written by Tatusov, R. L., and Lipman, D.J. The SEG program is used for filtering low complexity regions in amino acid sequences from your protein query sequence and was written by Wootton, J.C., and Federhen, S. The default is true. Default Filters (When Filter set to true): * BLASTp: SEG * BLASTx: SEG * BLASTn: DUST Drop off This is the amount a score must drop before extension of word hits is halted.
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Open gap The gap open penalty is the score taken away for the initiation of the gap in sequence or in structure. To make the match more significant you can try to make the gap penalty larger. It will decrease the number of gaps and if you have good alignment without many gaps, its Z-score will be higher. The default is 11. Extend gap The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalised. If you don't like long gaps, just increase the extension gap penalty. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap open penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. The default is 1.
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Gap align This is a true/false answer that tells the program to perform optimised alignments within regions involving gaps. If set to true, the program will perform an alignment using gaps. Otherwise, if it is set to false, it will report only individual HSP where two sequences match each other, and thus will not produce alignments with gaps. The default is true. (N.B. HSP means High-Scoring Segment Pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search)
Introduction to Sequence Analysis Substitution matrices Alignment of protein sequences can take account of the diferential rates at which amino acids substitute for each other. It can be measured through two types of matrices: PAM and BLOSUM. PAM (Percent Accepted Mutations): on the basis of comparisons among many pairs of very similar protein sequences (at least 85% identical, ie., homologous sequences), Margaret Dayhoff constructed a mutation probability matrix comparing many pairs of protein sequences to determine the empirical frequencies with which one amino acid is replaced by others during evolution. Examples are PAM1, PAM10, PAM25, PAM50, PAM100, PAM125, PAM250. PAM10= [PAM1]10 …… PAM250 = [PAM1]250 The PAM1 matrix could be multiplied by itself N times to give transition matrices for comparing sequences with lower and lower levels of similarity due to separation over longer periods of evolutionary history. Thus, the commonly used PAM250 matrix represents a level of 250% of change expected in 2500 million years. Although this amount of change seems very large, sequences at this level of divergence still have about 20% of similarity (Bioinformatics, D. W: Mount, page 96).
Introduction to Sequence Analysis Substitution matrices The empirical frecuency with which aminoacid type i is replaced by type j (or viceversa) is writen as Mi,j in the matrix: the probability of aligning two Ys in an alignment YY/YY is 10+10=20, a very significant score, whereas that of YY/TP is -3-5=-8
Introduction to Sequence Analysis Substitution matrices Recommendations Which PAM matrix should I use? One cannot know previously what the percentage similarity or difference between two sequences actually is until an alignment is done, thus a trial alignment must be first done. Once the initial similarity score has been obtained with these matrices, a more representative score can be obtained by using another PAM matrix designed specifically for sequences at that level of similarity.
Introduction to Sequence Analysis Substitution matrices BLOSUM (Blocks Substitution Matrix): the PAM matrices introduced by Dayhoff are constructed from the amino acid replacements inferred from alignments of protein sequences that are at least 85% identical. Henikoff & Henikoff (1992) considered blocks, or highly conserved regions, in aligned protein sequences. The BLOSUM matrix scores for amino acid pairs are based on the frequency of amino acid substitutions in aligned sequence motifs (blocks) from a related familiy of proteins, regardless of the overall degree of similarity between the protein sequences. The BLOSUM62 substitution matrix is widely used for scoring protein sequence alignments. The matrix values are based on the observed aminoacid substitutions in a large set of approximately 2000 conserved amino acid blocks representing more than 500 families of related proteins. BLOSUM62 -> based on blocks that are 62% identical BLOSUM80 -> based on blocks that are 80% alike BLOSUM62 example: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm
Introduction to Sequence Analysis Substitution matrices PAM vs BLOSUM The PAM matrices are based on scoring all amino acid positions in related sequences, whereas the BLOSUM matrices are based on substitutions and conserved positions in blocks, which represents the most-alike common regions in related sequences. The PAM model is thus designed to track the evolutionary origins of proteins, whereas the BLOSUM model is designed to find their conserved domains. The choice of which matrix to use depends on the goals of the investigator. Still there are some equivalences between PAM and BLOSUM matrices:
Introduction to Sequence Analysis GAPs in the alignment We have to consider also insertions and deletions, this implies to open gaps in the alignment and so we have to recalculate the scores penalizing for: a) Opening a gap in the alignment b) Extending the gap in the alignment Values vary depending on the program we are using, but a general rule is that opening a new gap is much more penalized than extending an existing one It is more frequent to find long gaps than bunches of “1 base” gaps Example1 “bunch of gaps”: ATCG_ATCG_ATCG_ATCG ATCGTATCGTATCGTATCG Example 2 “long gap”: ATCG_ _ _ ATCG ATCGT CG ATCG
Introduction to Sequence Analysis Example Example of scoring a sequence alignment with a gap penalty and under BLOSUM62. BLOSUM62 matrix: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm Sequence 1 V D S - C Y Sequence 2 V E S L C Y Score 4 2 4 -11 9 7 Total score= (∑ amino acid pair scores ) minus (single gap penalty) = 15
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 We are going to search against a protein database using a nucleotide query (sequence 2) with NCBI BLAST2 (http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/sequence2.txt) and look for peptides/protein sequences that are similar in UniProtKB/Swiss-Prot. This peptide/protein sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps from closely related animals, or from sequences of closely related proteins. We are going to select the BLASTX option and the Swiss-Prot database to search in a protein database using a nucleotide query http://www.ebi.ac.uk/Tools/services/web/toolform.ebi?tool=ncbiblast&context=protein
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2: Results Summary NOTE: by clicking 'show alignments' we will find that the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is a coding frame.
Introduction to Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2: Results Summary NOTE: by clicking 'show alignments' we will find that the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is a coding frame.
Introduction to Sequence Analysis Showing the alignments NOTE: all the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is a coding frame.
Introduction to Sequence Analysis Visual output (results) ***Why is the alignment spanning a small region in the query sequence while it spans the full hit?
Introduction to Sequence Analysis Visual output (results) ***The part of the mouse fosB mRNA that Blast is able to align with the FosB protein sequence is the one that belongs to the CDS, from the first methyonine (translation start site) until the stop codon (translation stop site).
Introduction to Sequence Analysis Functional predictions (results)
Introduction to Sequence Analysis Description of Uniprot entry
Introduction to Sequence Analysis Description of Uniprot entry
Introduction to Sequence Analysis Pairwise local/global alignment: differences Global alignment: we try to align the whole sequence. It is only useful for homologous proteins with a high percentage of identity. Local alignment: we try to align locally as much of the sequence as we can. This is useful when dealing with domains. Are these proteins homologues?
Introduction to Sequence Analysis Pairwise local/global alignment: differences Global alignment: we try to align the whole sequence. It is only useful for homologous proteins with a high percentage of identity. Local alignment: we try to align locally as much of the sequence as we can. This is useful when dealing with domains. Are these proteins homologues? Globally: no, they are very different, the score would be very low. Locally: there is a homologous domain, the grey one.
Introduction to Sequence Analysis Pairwise local/global alignment: Running an EMBOSS-Align alignment We are going to use the EMBOSS-Align tool (http://www.ebi.ac.uk/Tools/psa/). • 2 jobs to execute, one with the EMBOSS global alignment program (needle), and one with the local alignment program (water). • As we are comparing 2 protein sequences, the molecule type was left on protein. • The default blosum62 matrix is used, and the default gap open of "10" and gap extend of "0.5" is also used.
Introduction to Sequence Analysis Pairwise local/global alignment: differences Lets align these two sequences: http://pfam.sanger.ac.uk/family?acc=PF00071 http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/Q4RD65_TETNG.txt http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/RACA_DICDI.txt
Introduction to Sequence Analysis Pairwise local/global alignment: needle GLOBAL result
Introduction to Sequence Analysis Pairwise local/global alignment: water LOCAL result The Smith-Waterman algorithm is more suitable for identifying related proteins of limited sequence similarity than FASTA and BLAST in a database search (Bioinformatics, D. W. Mount, page 259).
Introduction to Sequence Analysis Pairwise local/global alignment: Results of EMBOSS-Align alignments Note that identical amino acids are connected with a "|" symbol. Unrelated pairs of amino acids (mismatches) would be connected with a space. A gap would be represented with a "-" symbol. Similar pairs (e.g. leucine vs methionine) are connected via a ":" symbol. Less similar ones are indicated with "." The %id is the percentage of identical matches between the two sequences over the reported aligned region. The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring matrix value is greater or equal to 0.0. The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the longest of the two sequences.
Introduction to Sequence Analysis ClustalW can build multiple sequence alignments (MSA) ClustalW (http://www.ebi.ac.uk/Tools/msa/clustalw2/) is a general purpose global multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen through Cladograms or Phylograms. Multiple alignments of protein sequences are important tools in studying sequences.The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and protein sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from multiple alignments. The program has some adjustable parameters with reasonable defaults. ClustalW (Higgins et al. 1996) It is designed to provide an adequate alignment of a large number of more close related sequences and a reliable indication of the domain structure of those sequences. The steps used by ClustalW include: • Perform pair-wise alignments of all the sequences • Use the aligment scores to produce a phylogenetic tree • Progressive multiple sequence alignment: it reduces the construction of the MSA to a series of pair-wise alignments. Initially, a dynamic programming alignment is made between the two most alike sequences, and the resulting alignment is then extended to include other, less alike sequences.
Introduction to Sequence Analysis Building a MSA: 1) get protein homologs with Blast We select all the hits obtained from the previous search results, and the click download fasta
Introduction to Sequence Analysis Building a MSA: 2) Copy all the downloaded sequences We then copy all the downloaded sequences to the ClustalW2 tool
Introduction to Sequence Analysis Building a MSA: 3) ClustalW MSA results
Introduction to Sequence Analysis Analyzing ClustalW results The branch lengths on the phylogram are proportional to the evolutionary distance between species, however the branches are normalized in the cladogram and therefore do not represent the distance between species.
Introduction to Sequence Analysis Interpretation of ClustalW results Consensus symbols An alignment will display by default the following symbols denoting the degree of conservation observed in each column: "*" means that the residues or nucleotides in that column are identical in all sequences in the alignment. ":" means that conserved substitutions (similar) have been observed, according to the COLOUR table below. "." means that semi-conserved substitutions (less similar) are observed. Colour This option only works when you have chosen ALN or GCG the output format. The colouring of residues takes place according to the following physiochemical criteria:
Introduction to Sequence Analysis Other examples of MSA programs T-Coffee: combines information from global and local alignments to produce a global MSA (http://www.ebi.ac.uk/Tools/t-coffee/index.html) Muscle: builds global MSA (http://www.ebi.ac.uk/Tools/muscle/) Mafft: generates global MSA (http://www.ebi.ac.uk/Tools/mafft/index.html) DiAlign: produces global and local MSA (http://bibiserv.techfak.uni-bielefeld.de/dialign/) Hmmer: generates local MSA (http://hmmer.janelia.org/) Meme: builds local MSA (http://meme.sdsc.edu/meme4_1/cgi-bin/meme.cgi)
Introduction to Sequence Analysis Searching protein families with InterPro
Introduction to Sequence Analysis Searching protein families with InterPro What is InterPro? http://www.ebi.ac.uk/interpro/user_manual.html • InterPro is an integrated documentation resource for protein families, domains and sites. InterPro is a consortium of member databases (PROSITE, Pfam, Prints, ProDom, SMART and TIGRFAMs). Each member database devises methods that can be applied computationally to assign a score for a protein according to how well it matches a given signature. For some types of methods, the classification is binary (i.e. hit or miss), in other cases a numerical value is produced and a cut off point chosen to separate hits from misses. Different member databases create methods/signatures in different ways: some groups build them from alignments studied manually, others use automatic processes with some human input and correction, ProDom uses an entirely automatic method. • Signatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry and literature references and has links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views. The graphical views, which can be sorted by UniProtKB accession number, structure or taxonomy, show the position of the signatures on the protein, mousing over the signature brings up a pop-box, giving the accession, name and position. • InterPro graphically represents the location of a protein domain and information pertaining to the origin of that domain and the proteins that contain it. Families are also defined and may contain several InterPro domains which are often, but not always, in the same order. Through the InterPro Domain Architecture view, the composition and order of the different domains within a family are clearly displayed for easy comparison, as well as for simple navigation between the entries for individual domains. • InterPro entries are linked to one another through PARENT/CHILD and CONTAINS/FOUND IN relationships. PARENT/CHILD relationships indicate superfamily/family/subfamily relationships, as well as domain hierarchies, where sequences can be subdivided into more specific sub-sets. CONTAINS/FOUND IN relationships apply to domains, repeats and sites within families, and are used to describe the composition of protein sequences.
Introduction to Sequence Analysis Searching protein families with InterPro Going back to our sequence 2: http://www.uniprot.org/uniprot/p13346 • We move down through the page to the section ‘Database cross-references’ until we find the following link: • There are 2 InterPro entries in this case: The first entry is an annotation with type ‘Domain’. The second InterPro entry is an annotation defined as ‘Family’. • InterPro entries can have associations like parent/child (different levels defined by InterPro methods) or contain/found in.
Members that contribute to this annotation Introduction to Sequence Analysis Searching protein families with InterPro
Members that contribute to this annotation Introduction to Sequence Analysis Searching protein families with InterPro
Introduction to Sequence Analysis Searching protein families with InterPro
Introduction to Sequence Analysis Searching protein families with InterPro
Introduction to Sequence Analysis Searching protein families with InterPro
Introduction to Sequence Analysis Searching protein families with InterPro