1 / 51

Applied Bioinformatics

Applied Bioinformatics. Week 6. Topics. Multiple Sequence Alignment Profiles. Why we do multiple alignments?. Multiple nucleotide or amino sequence alignment techniques are usually performed to fit one of the following scopes :

rosa
Télécharger la présentation

Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applied Bioinformatics Week 6

  2. Topics • Multiple Sequence Alignment • Profiles

  3. Why we do multiple alignments? • Multiple nucleotide or amino sequence alignment techniques are usually performed to fit one of the following scopes : • In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences) • Determination of the consensus sequence of several aligned sequences.

  4. Why we do multiple alignments? • Help prediction of the secondary and tertiary structures of new sequences • Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees

  5. An example of Multiple Alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

  6. Multiple Sequence Alignment • Can we take pair wise alignments into higher dimensions?

  7. Trace back from the left-top corner, and always select the maximum value from the outmost column and row, and jump to next maximum in the next row or column. Computational cost Clearly there are many more paths here that need to be evaluated

  8. 3D Sequence Alignment • Has been done • Seems feasible for short sequences • However higher dimensions are computationally very expensive

  9. Multiple Alignment Method • The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. • The principal is that multiple alignments is achieved by successive application of pairwise methods.

  10. Multiple Alignment Method • The steps are summarized as follows: • Compare all sequences pairwise. • Perform cluster analysis on the pairwise data to generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering • Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

  11. Steps in Multiple Alignment

  12. Alternative Ways • Align all sequences against all sequences pair-wise • Start with the two sequences that have the highest similarity • Calculate a „consensus sequence“ • Align the next distant sequence from the pool with the consensus

  13. Multiple Sequence Alignment a 2 b Alignment Scores c d e f

  14. Dendrogram (Distance Tree) Alignment Scores Align sequences sequentially With decreasing similarity Align best two Determine best alignment with The consensus sequence and Align that next. a d b d c e d Consensus f

  15. Aligning Profile and Sequence

  16. Aligning Profile and Sequence

  17. Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

  18. UP: Sum(CijPn * Type) + Val(up) LEFT: Sum(CijPn * Type) + Val(left) DIAG: Sum(CijPn * Type) + Val(diagonal) Cij = MAX Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

  19. UP: -3 + G = -6 LEFT: -3 + G = -6 DIAG: 0 + 2/3 * Mnuc + 1/3 * MMts = 5/3 Cij = MAX Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

  20. UP: Sum(CijPn * Type) + Val(up) LEFT: Sum(CijPn * Type) + Val(left) DIAG: Sum(CijPn * Type) + Val(diagonal) Cij = MAX Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

  21. A C – G T A C – G T G C C A T New: A C – G -

  22. Alternative Dendrogram a d b d c e d Consensus f Only profile against sequence a d b c Profile vs. profile Consensus e f

  23. Dendrogram • The dendrograms shown here are mere guide trees • They show in which order the sequences in the MSA were aligned • This may be important since alignments may differ if the sequence of their construction is different

  24. Different MSAs? • MSA is currently a heuristic approach since alignment of all sequences at once is not feasible at the moment • The trick is to align the sequences one by one using pair-wise alignment methods • In each step some information may be lost which could be recovered if other sequences would have been aligned

  25. Alternative Alignment Strategies a d b c e Consensus f -Align the closest relatives then continue picking the closest relative -Align the closest relatives then determine the closest relative to the profile and choose that for the next alignment a d b c Consensus e f -Align the closest relatives pair-wise. Then make pair-wise alignments of all profiles and left-over single sequences against each other and continue form the start

  26. End of Theory I • Mind Map • 10 min break

  27. Practical Part I

  28. Choosing sequences for alignmentGeneral considerations • The more sequences to align the better. • Don’t include similar (>80%) sequences. • Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment.

  29. Choosing Sequences As far as possible, try to align sequences of similar length. Pileup can align sequences of up to 5000 residues, with 2000 gaps (total 7000 characters). Pileup is a good program only for similar (close) sequences.

  30. Choosing Sequences • How many? • 10 – 15 (less than 50 would be good) • Seqs should be >30% and <90% identical • Prefer seqs of similar length • Prefer seqs without internal repeats or extract them before aligning • Make sure that you don’t overrepresent a type of sequence against other types in your MSA

  31. Choosing Sequences • While choosing your sequences give them good names • Some of the sequences should be well annotated

  32. Gathering Sequences • Retrieve a protein sequence from NCBI • Translated nucleotides could be tried • Go to: http://www.expasy.ch/tools/blast • Paste that sequence into the box

  33. Gathering Sequences • Scroll through the results and select about 10 full length sequences • From different levels of similarity e.g. Different number of identities • Export collection as FASTA

  34. Identities in Range? • Go to: http://www.biolnk.com • Choose Tools and then MultiIdentity • Paste your FASTA formated information • Set the thresholds • See if all sequences are in the desired range of identities amongst each other • Add/ Delete Sequences accordingly

  35. MSA • http://www.ebi.ac.uk/clustalw • http://www.tcoffee.org • http://www.drive5.com/muscle • Try all the above and compare the resulting MSAs

  36. How good is the MSA • * Column entirely conserved • : Approx same size and hydropathy • . Less similar than : • Colors (check color scheme)

  37. Output Formats • Many different formats • FASTA widely supported • Pdf Only for printing/ storing/ sharing • Pir Similar to fasta • Msf common MSA format • Aln subset of msf

  38. Which Output do I need? • Depends on what you are planning to do with the MSA • Depends on the software you would like to use for downstream processing • We will see more next week

  39. Converting Formats • http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html • Names (>…) no longer than 15 characters • Different formats maintain different data • Converting will introduce the problem of loosing data • Make sure to have a master copy

  40. Editing Alignments • http://www.jalview.org • Start the applet • Choose File – Input Alignment – from Textbox • Copy and paste the ClustalW alignment

  41. Playtime • Be creative • Explore the functions • For saving you need to install locally • JAVA applets are not allowed to save to your computer

  42. End Practice II • 15 min break

  43. Theopractical Part • Gene Structure • Profiles • Sequence Logos

  44. Gene Structure (a) Genes of multicellular organisms contain both promoter-proximal elements and enhancers as well as a TATA box or other promoter element. The latter positions RNA polymerase II to initiate transcription at the start site and influences the rate of transcription. Enhancers may be either upstream or downstream and as far away as 50 kb from the transcription start site. In some cases, promoter-proximal elements occur downstream from the start site as well. (b) Most yeast genes contain only one regulatory region, called an upstream activating sequence (UAS), and a TATA box, which is ≈90 base pairs upstream from the start site.

  45. Profiles

  46. Logos

  47. More Complex Logo

  48. Logo • http://blocks.fhcrc.org/blocks/process_blocks.html • Retrieve the FASTA sequence of your alignment • Paste it to the box above and create blocks

More Related