Applied Bioinformatics

Applied Bioinformatics Week 6

Topics • Multiple Sequence Alignment • Profiles

Why we do multiple alignments? • Multiple nucleotide or amino sequence alignment techniques are usually performed to fit one of the following scopes : • In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences) • Determination of the consensus sequence of several aligned sequences.

Why we do multiple alignments? • Help prediction of the secondary and tertiary structures of new sequences • Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees

An example of Multiple Alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple Sequence Alignment • Can we take pair wise alignments into higher dimensions?

Trace back from the left-top corner, and always select the maximum value from the outmost column and row, and jump to next maximum in the next row or column. Computational cost Clearly there are many more paths here that need to be evaluated

3D Sequence Alignment • Has been done • Seems feasible for short sequences • However higher dimensions are computationally very expensive

Multiple Alignment Method • The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. • The principal is that multiple alignments is achieved by successive application of pairwise methods.

Multiple Alignment Method • The steps are summarized as follows: • Compare all sequences pairwise. • Perform cluster analysis on the pairwise data to generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering • Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Steps in Multiple Alignment

Alternative Ways • Align all sequences against all sequences pair-wise • Start with the two sequences that have the highest similarity • Calculate a „consensus sequence“ • Align the next distant sequence from the pool with the consensus

Multiple Sequence Alignment a 2 b Alignment Scores c d e f

Dendrogram (Distance Tree) Alignment Scores Align sequences sequentially With decreasing similarity Align best two Determine best alignment with The consensus sequence and Align that next. a d b d c e d Consensus f

Aligning Profile and Sequence

Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

UP: Sum(CijPn * Type) + Val(up) LEFT: Sum(CijPn * Type) + Val(left) DIAG: Sum(CijPn * Type) + Val(diagonal) Cij = MAX Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

UP: -3 + G = -6 LEFT: -3 + G = -6 DIAG: 0 + 2/3 * Mnuc + 1/3 * MMts = 5/3 Cij = MAX Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

UP: Sum(CijPn * Type) + Val(up) LEFT: Sum(CijPn * Type) + Val(left) DIAG: Sum(CijPn * Type) + Val(diagonal) Cij = MAX Mnuc: 2, Mgap: 1, MMts: 1, MMtv: -1, G: -3

A C – G T A C – G T G C C A T New: A C – G -

Alternative Dendrogram a d b d c e d Consensus f Only profile against sequence a d b c Profile vs. profile Consensus e f

Dendrogram • The dendrograms shown here are mere guide trees • They show in which order the sequences in the MSA were aligned • This may be important since alignments may differ if the sequence of their construction is different

Different MSAs? • MSA is currently a heuristic approach since alignment of all sequences at once is not feasible at the moment • The trick is to align the sequences one by one using pair-wise alignment methods • In each step some information may be lost which could be recovered if other sequences would have been aligned

Alternative Alignment Strategies a d b c e Consensus f -Align the closest relatives then continue picking the closest relative -Align the closest relatives then determine the closest relative to the profile and choose that for the next alignment a d b c Consensus e f -Align the closest relatives pair-wise. Then make pair-wise alignments of all profiles and left-over single sequences against each other and continue form the start

End of Theory I • Mind Map • 10 min break

Practical Part I

Choosing sequences for alignmentGeneral considerations • The more sequences to align the better. • Don’t include similar (>80%) sequences. • Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment.

Choosing Sequences As far as possible, try to align sequences of similar length. Pileup can align sequences of up to 5000 residues, with 2000 gaps (total 7000 characters). Pileup is a good program only for similar (close) sequences.

Choosing Sequences • How many? • 10 – 15 (less than 50 would be good) • Seqs should be >30% and <90% identical • Prefer seqs of similar length • Prefer seqs without internal repeats or extract them before aligning • Make sure that you don’t overrepresent a type of sequence against other types in your MSA

Choosing Sequences • While choosing your sequences give them good names • Some of the sequences should be well annotated

Gathering Sequences • Retrieve a protein sequence from NCBI • Translated nucleotides could be tried • Go to: http://www.expasy.ch/tools/blast • Paste that sequence into the box

Gathering Sequences • Scroll through the results and select about 10 full length sequences • From different levels of similarity e.g. Different number of identities • Export collection as FASTA

Identities in Range? • Go to: http://www.biolnk.com • Choose Tools and then MultiIdentity • Paste your FASTA formated information • Set the thresholds • See if all sequences are in the desired range of identities amongst each other • Add/ Delete Sequences accordingly

MSA • http://www.ebi.ac.uk/clustalw • http://www.tcoffee.org • http://www.drive5.com/muscle • Try all the above and compare the resulting MSAs

How good is the MSA • * Column entirely conserved • : Approx same size and hydropathy • . Less similar than : • Colors (check color scheme)

Output Formats • Many different formats • FASTA widely supported • Pdf Only for printing/ storing/ sharing • Pir Similar to fasta • Msf common MSA format • Aln subset of msf

Which Output do I need? • Depends on what you are planning to do with the MSA • Depends on the software you would like to use for downstream processing • We will see more next week

Converting Formats • http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html • Names (>…) no longer than 15 characters • Different formats maintain different data • Converting will introduce the problem of loosing data • Make sure to have a master copy

Editing Alignments • http://www.jalview.org • Start the applet • Choose File – Input Alignment – from Textbox • Copy and paste the ClustalW alignment

Playtime • Be creative • Explore the functions • For saving you need to install locally • JAVA applets are not allowed to save to your computer

End Practice II • 15 min break

Theopractical Part • Gene Structure • Profiles • Sequence Logos

Gene Structure (a) Genes of multicellular organisms contain both promoter-proximal elements and enhancers as well as a TATA box or other promoter element. The latter positions RNA polymerase II to initiate transcription at the start site and influences the rate of transcription. Enhancers may be either upstream or downstream and as far away as 50 kb from the transcription start site. In some cases, promoter-proximal elements occur downstream from the start site as well. (b) Most yeast genes contain only one regulatory region, called an upstream activating sequence (UAS), and a TATA box, which is ≈90 base pairs upstream from the start site.

Profiles

Logos

More Complex Logo

Logo • http://blocks.fhcrc.org/blocks/process_blocks.html • Retrieve the FASTA sequence of your alignment • Paste it to the box above and create blocks

Applied Bioinformatics