1 / 48

Multiple Sequence Alignment

Multiple Sequence Alignment. Definition. Homology: related by descent Homologous sequence positions.  ATTGCGC. ATTGCGC. ATTGCGC. . AT-ACGC. ATTGCGC.  ATACGC. A. Reasons for aligning sets of sequences. Organise data to reflect sequence homology Estimate evolutionary distance

guy
Télécharger la présentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment

  2. Definition • Homology: related by descent • Homologous sequence positions  ATTGCGC ATTGCGC ATTGCGC  AT-ACGC ATTGCGC  ATACGC A

  3. Reasons for aligning sets of sequences • Organise data to reflect sequence homology • Estimate evolutionary distance • Infer phylogenetic trees from homologous sites • Highlight conserved sites/regions • Highlight variable sites/regions • Uncover changes in gene structure • Look for evidence of selection • Summarise information

  4. Alignments help to Organise Visualise Analyze Sequence Data

  5. The process of aligning sequences is a game involving playing off gaps and mismatches

  6. Ways of aligning multiple sequences • By hand • Automated • Combination

  7. Definition Optimality criteria: some kind rule or scoring scheme to help you to decide what you consider to be the best alignment

  8. Pairwise vs Multiple Sequences • Pairs of sequences typically aligned using exhaustive algorithms (dynamic programming) • complexity of exhaustive methods is O(2n mn) n = number of sequences m = sequence length • Multiple sequence alignment usually performed using heuristic methods

  9. ATTGCGC  ATA-CGC The Correct Alignment  ATTGCGC ATTGCGC ATTGCGC  AT-ACGC ATTGCGC  ATACGC A

  10. The Correct Alignment

  11. Sequence alignment is easy with sufficiently closely related sequences • Below a certain level of identity sequence alignment may become meaningless • twilight zone for aa sequences ~ 30% • In the twilight zone it is good to make use of additional information if possible (e.g. structure)

  12. Consensus Sequences • Simplest Form:A single sequence which represents the most common amino acid/base in that position Y D D G A V - E A L Y D G G - - - E A L F E G G I L V E A L F D - G I L V Q A V Y E G G A V V Q A L Y D G G A/I V/L V E A L

  13. Multiple Alignment Formats e.g. Clustal, Phylip, MSF, MEGA etc. etc.

  14. Clustal Format CLUSTAL X (1.81) multiple sequence alignment CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN- CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN- CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE-------- CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE-------- CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------- *:***: **.*.*:* : . :

  15. Phylip Format (Interleaved) 7 100 SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA MPLSSLFANA VLRAQHLHQL SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA MPLSSLFSNA VLRAQHLHQL SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA MPLSSLFANA VLRAQHLHQL SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA MPLSSLFANA VLRAQHLHQL SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT IPLSRLFDNA MLRAHRLHQL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDV AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC FSESIPTPSN REETQQKSNL

  16. Phylip Format (Sequential) 3100 Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG

  17. Mega Format #mega TITLE: No title #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

  18. Progressive Multiple Alignment • Heuristic • Perform pairwise alignments • Align sequences to alignments or alignments to existing alignments (profile alignments • Do the alignments in some sensible order

  19. Progressive versus Simultaneous • speed versus accuracy • simultaneous methods are capable of working out an ‘exact’ solution to the problem of multiple sequence alignment (e.g. NCBI’s MSA – user interface QAlign)

  20. Iterative methods • Several progressive alignment methods can be iterated • e.g. Barton-Sternberg, ClustalX

  21. ClustalX Algorithm • Perform pairwise alignments and calculate distances for all pairs of sequences • Construct guide tree (dendrogram) joining the most similar sequences using Neighbour Joining • Align sequences, starting at the leaves of the guide tree. This involves the pair-wise comparisons as well as comparison of single sequence with a group of seqs (Profile)

  22. ClustalX is not optimal • There are known areas in which ClustalX performs badly e.g. • errors introduced early cannot be corrected by subsequent information • alignments of sequences of differing lengths cause strange guide trees and unpredictable effects • edges: ClustalX does not penalise gaps at edges • There are alternatives to ClustalX available

  23. T-Coffee • JMB 2000 • Also a progressive alignment method • Designed to solve some of the problems with clustal (in particular the problem of clustals inability to correct errors that appear early in the process of alignment) • Can consider global and local pair-wise alignments

  24. Using ClustalX • Start with sequences in FASTA format (or an existing alignment in Clustal format • [Do Alignment] on the alignment menu

  25. ClustalX Parameters • Scoring Matrix • Gap opening penalty • Gap extension penalty • Protein gap parameters • Additional algorithm parameters • Secondary structure penalties

  26. Score Matrices • Pairwise matrices and multiple alignment matrix series • PAM (Dayhoff), BLOSUM (Hennikof), GONNET (default), user defined • Transition (A<->G)/Transversion (C<-T) ratio – low for distantly related sequences

  27. Gap Penalties • Linear gap penalties – Affine gap penalties p = (o + l.e) • Gap opening • Gap extension • Protein specific penalties (on by default) • Increase the probability of gaps associated with certain residues • Increase the chances of gaps in loop regions (> 5 hydrophilic residues)

  28. Algorithm parameters • Slow-accurate pair-wise alignment • Do alignment from guide tree • Reset gaps before aligning (iteration) • Delay Divergent sequences (%)

  29. Additional displays • Column Scores • Low quality regions • Exceptional residues

  30. Multiple Alignment Tips • Align pairs of sequences using an optimal method • Progressive alignment programs such as ClustalX for multiple alignment • Choose representative sequences to align carefully • Choose sequences of comparable lengths • Progressive alignment programs may be combined • Review alignment by eye and edit • If you have a choice align amino acid sequences rather than nucleotides

  31. Alignment of coding regions • Nucleotide sequences much harder to align accurately than proteins • Protein coding sequences can be aligned using the protein sequences • e.g. BioEdit: toggle translation to amino acid, call clustalw to align, edit alignment by hand, toggle back to nucleotide • In-frame nucleotide alignments can be used, e.g. to determine non-synonymous and synonymous distances separately

  32. Multiple Alignments and Phylogenetic Trees • You can make a more accurate multiple sequence alignment if you know the tree already • A phylogenetic tree is only as good as the alignment from which it was produced • The process of constructing a multiple alignment (unlike pair-wise) needs to take account of phylogenetic relationships

  33. Editing a multiple sequence alignment • It is NOT fraud to edit a multiple sequence alignment • Incorporate additional knowledge if possible • Alignment editors help to keep the data organised and help to prevent unwanted mistakes

  34. Alignment Editors • e.g. GDE, Bioedit, Seaview, Jalview etc. • Some alignment editors have begun to function as sequence analysis platforms (e.g. tools on BioEdit, GDE) • Construct sub-sequences (GDE, Seaview) • Annotate sequences (Seaview)

  35. Aligning weakly similar sequences

  36. Sequence contains conserved regions • e.g. DIALIGN (Morgenstern, Dress, Werner) • re-aligns regions between conserved blocks http://bibiserv.techfak.uni-bielefeld.de/ useful if sequences contains consistent conserved blocks • Block Maker – searches for conserved words that may be inconsistent http://blocks.fhcrc.org/

  37. Profile Alignment Gribskov et al. 1987 • Position specific scores • Allows addition of extra sequence(s) to an alignment • Allows alignment of alignments • Gaps introduced as whole columns in the separate alignments • Optimal alignment in time O(a2l2) a = alphabet size, l = sequence length • Information about the degree of conservation of sequence positions is included

  38. Good reasons to use profile alignments • Adding a new sequence to an existing multiple alignment that you want to keep fixed(align sequence to profile) • Searching a database for new members of your protein family(pfsearch) • Searching a database of profiles to find out which one your sequence belongs to(pfscan) • Combining two multiple sequence alignments(profile to profile)

  39. Profile Alignment Using ClustalX • Profile Alignment Mode • Align sequence to profile • Align profile 1 to profile 2 • Secondary structure parameters

  40. Profile searching using PSI-BLAST • Position Specific Iterative • Perform search – construct profile – perform search • Convergence (hopefully…) • Increased sensitivity for distantly related sequences • Available on-line (NCBI)

  41. Databases of Aligned Sequences • Hovergen http://pbil.univ-lyon1.fr/databases/hovergen.html (vertebrate alignments) • Pfam http://www.sanger.ac.uk/Software/Pfam/ (protein domain alignments and profile HMMs) • BLOCKS http://blocks.fhcrc.org/ • Ribosomal Database Project http://rdp.cme.msu.edu/html/ alignments and trees derived from rRNA sequences • Interpro – combines information from other sources • Many more…

  42. Probabilistic Models of Sequence Alignment • Hidden Markov Models • sequence of states and associated symbol probabilities • Produces a probabilistic model of a sequence alignment • Align a sequence to a Profile Hidden Markov Model • Algorithms exist to find the most efficient pathway through the model

  43. Markov Chain: A chain of things. The probability of the next thing depends only on the current thing Hidden Markov Model: A sequence of states which form a Markov Chain. The states are not observable. The observable characters have “emission” probabilities which depend on the current state.

  44. Some more recent developments • The need to align genomes • alignment tools required that can align very large regions of genomes • poses a computational challenge • programmes such as dialign can be run in parallel on multiprocessor machines

  45. Some more recent developments • MUSCLE • Faster (uses a k-mer frequency to calculate first pair-wise alignments) • Progressive (repeats the MSA using the more accurate kimura distance between aligned amino acid sequences) • Has a third optimisation stage that involves making profile alignments of sub-trees and accepting the new alignment if it improves the SP score.

  46. MuSiC - multiple sequence alignment with constraints • web server that allows a user to enter a set of

More Related