Pairwise sequence alignment Unit 7

Pairwise sequence alignmentUnit 7 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

Reminders From Last Lecture: Primary (archival) and Secondary (curated) DBs NT_123456 – genomic contig NM_123456 - mRNA NP_123456 - proteins XM_123456 – model mRNAs XP_123456 – model proteins ---Sequence Polymorphisms – also see chapter 7 of Bahevanis & Ouellette

Chapter 11 ALIGNMENT =Matching =Positioning to maximize similarity

ALIGNMENT

ALIGNMENT ACTGTGGGAACCTTTGCACCGAAAC ACTGTGGGAACCTATGCACCGAAAC Similar alignments are used for translation, comparing text documents, any STRINGs - linear sequences of symbols (characters or words or phrases)

Text alignment ‘alignment’ in UNIX: • cmpf1 f2 Compare two files • diff f1 f2 Lists file differences ‘alignment’ in MS Word: • Review  Compare  Show changes at the Character or Word level • Spell Checker

Sequence Alignment Motivation: • Storing,retrieving and comparing DNA sequences in Databases. • Comparing two or more sequences for similarities. • Searching databases for related sequences and subsequences. • Exploring frequently occurring patterns of nucleotides. • Finding informative elements in protein and DNA sequences. • Various experimental applications (reconstruction of DNA, etc.)

The dotplot A simple picture that gives an overview of the similarities between two sequences

The dotplot

ALIGNMENT software • BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ • BLAST = Basic Local Alignment Search Tool • ClustalW: http://www.ebi.ac.uk/clustalw/ • BLAT: http://genome.cse.ucsc.edu/ • SIM http://www.expasy.ch/tools/sim-prot.html • Pfam: http://pfam.wustl.edu/ • String: http://string.embl.de/ • ALION: http://motif.stanford.edu/alion/ • FastA (Fast Alignment) : ftp://ftp.virginia.edu/pub/fasta

ALIGNMENT software • ClustalW: http://www.ebi.ac.uk/clustalw/ Q What type of sequences can ClustalW align? A either nucleotide or protein sequences Q How many sequences can it align at a time? A Many (limit 10485760 bytes)

ClustalW: sequence formats • ALN/ClustalW format • AMPS Block file format • ClustalW • Codata • EMBL • GCG/MSF • GDE • Genebank • Fasta (Pearson) • NBRF/PIR • PDB format • Pfam/Stockholm format • Phylip • Raw • RSF • UniProtKB/Swiss-Prot

ALIGNMENT ALGORITHMS • Smith-Waterman ACTGTCTATAACCTTTGCGGCCAAAC ACTGTCTATACCTAT GCGGCGAAAC ACTGTGGGAACCTATGCGGCGAAAC • Needleman-Wunsch

Needleman-Wunsch Algorithm • General algorithm for sequence comparison • Maximise a similarity score, to give ‘maximum match’ • Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions • Finds the best GLOBAL alignment of any two sequences

Needleman-Wunsch Algorithm • N-W involves an iterative matrix method of calculation • All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array • All possible alignments (comparisons) are represented by pathways through this array

Needleman-WunschAlgorithm • Three main steps 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment

Similarity values A numerical value is assigned to every cell in the array depending on the similarity/dissimilarity of the two residues These may be simple scores or more complicated, e.g. related to chemical similarities or frequency of observed substitutions The example shown has match = +1 mismatch = 0 Needleman-Wunsch Algorithm (cont.)

Score pathways through array For each cell want to know the maximum possible score for an alignment ending at that point Searches subrow and subcolumn, as shown, for the highest score Adds this to the score for the current cell Proceeds row by row through the array Gap penalty for the introduction of gaps in the alignment (presumed insertions or deletions into one sequence) … here = 0 Needleman-Wunsch Algorithm (cont.) Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}

Construct alignment The alignment score is cumulative by adding along a path through the array The best alignment has the highest score i.e. the maximum match Maximum match = largest number resulting from summing the cell values of every pathway The maximum match will ALWAYS be somewhere in the outer row or column shown The alignment is constructed by working backwards from the maximum match Needleman-Wunsch Algorithm (cont.) MP-RCLCQR-JNCBA | || | | | | | -PBRCKC-RNJ-CJA

Needleman-WunschAlgorithm Statistical Significance • Maximum match is a function of sequence relationship and composition • Would like to know probability of obtaining result (maximum match) from a pair of random sequences

Needleman-Wunsch Algorithm • Estimate this experimentally • form pairs of random sequences by randomly drawing one member from each set (I.e. have same composition as the real proteins) • if the value found for the real proteins is significantly different from that for the random proteins then the difference is a function of the sequences alone and not of their composition

Smith-Waterman Algorithm • Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure • For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions

Only works effectively when gap penalties are used In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent of gap) Start with all cell values = 0 Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account Smith-Waterman Algorithm (cont.)

Smith-Waterman Algorithm (cont.) • Four possible ways of forming a path For every residue in the query sequence 1. Align with next residue of db sequence … score is previous score plus similarity score for the two residues 2. Deletion (i.e. match residue of query with a gap) … score is previous score minus gap penalty dependent on size of gap 3. Insertion (i.e. match residue of db sequence with a gap) … score is previous score minus gap penalty dependent on size of gap 4. Stop … score is zero • Choose whichever of these is the highest

Construct Alignment The score in each cell is the maximum possible score for an alignment of ANY LENGTH ending at those coordinates Trace pathway back from highest scoring cell This cell can be anywhere in the array Align highest scoring segment Smith-Waterman Algorithm (cont.) GCC-UCG GCCAUUG

Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences

Computer Time and Space Requirements

Substitution matrix • PAM (MDM/Dayhoff)– Point Accepted Mutation • BLOSUM - BLOcksSUbstitution Matrix • BLOSUM 62 is the default matrix in BLAST

What is BLAT & why we need it There exist many alignment tools-SmithWaterman'salgoreithm: solves two short sequence alignment problem -FASTA,NCBIBLAST,MegaBLASTWU-BLAST: provides flexible & fast alignment involving large database -Sim4: does a fine job with cDNAalignment-SAM,PSI-BLAST: slowly but surely find remote homology

BLATprocess of assembling and annotating the human genome-aligning three millions ESTs and aligning 13 million mouse whole- genome random reads against the human genome-need to be done in less than two weeks in order to have time to process an updated genome every month or two==>we need a very high speed alignment algorithm so Jim Kent developed BLATthe Blast-Like Alignment Tool

BLAT -BLAT (compared with existing tools) -more accurate -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using nonoverlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions seperately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible

BLAT -BLAT’s speed & sensitivity are decided by 1.k-mer size (finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)

BLAT's similarity & difference compare with BLAST Similarity:-scans relative short matchs(hits)ie.build index then find hits-extend hits into high-scoring pairs (HSPs)

BLATDifference:-BLAST build index for query sequence but BLAT build index for database-BLAST scans linearly through database but BLAT scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits

BLATDifference:-BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment-BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome

BLAT's application formsserver-client-building index is a relatively slow procedure a BLAT server is available for keeping index in memory for clients to query ==>good for interactive applicationsstand-alone-suitable for batch runs on one or more CPUs

BLAT's 3 major application & evaluation -mRNA/DNA alignment -Mouse/Human Translated alignment -client/server version to power interactive searches

BLATEvaluating mRNA/DNA Alignments(compared with Sim4)-test set: remapped 713 mRNAs to genes on chromosome 22-speed: BLAT:26 sec Sim4:5hr-sensitivity: BLAT: 99.99% agreed of the annotated bases Sim4: 99.96%

BLATserver/client to power interactive searches-thousands of interactive sequence searches per day -just one time for building index and keeps index in memory for query ===>efficient-but not as efficient as stand-alone version -because server need to save memory so it only keep the index,not the database

BLAT-BLAT is a very effective tool for doing nucleotide alignments between mRNA and DNA in same species-it is more accurate and faster than Sim4-BLAT's strategy for nucleotide alignments becomes less effective below 90% sequence identity but it can efficiently sequence divergence introduced by sequencing error twilight zone: 20-35% sequence identity

BLAT For search stage: -BLAT indexes database rather than query sequence so it only scan the short query sequence -A program “SSAHA” also indexes the database and it is an extremely effective tool for aligning genomic regions from same organism against each other -but “SSAHA” does not implement “unsplicing”,and always uses a single perfect match as a seed BLAT is more flexible in this aspect

BIOLOGY Homology • Orthologs: two similar genes in two different species that originated from a common ancestor • Paralogs: a gene in an organism is duplicated to occupy two different positions in the same genome • Homolog - A gene related to a second gene by descent from a common ancestral DNA sequence. Can be either Ortholog or Paralog

Sequence Alignment in Matlab Pairwise sequence alignment — standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) Standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.

Sequence Alignment in Matlab Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (multialignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.

Sequence Alignment in Matlab Multiple sequence profiles —multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof). Other useful Biological codes —aminolookup, baselookup, geneticcode, revgeneticcode.

Pairwise sequence alignment Unit 7

Pairwise sequence alignment Unit 7

Presentation Transcript

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (I)

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment Exercise 2

Pairwise sequence Alignment

Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment

Pairwise Sequence Alignment (cont.)

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment (practice)

Pairwise Sequence Alignment (II)

Pairwise sequence alignment