MapViewer

LocusLink Gene database UniGene Homologene Basic Local Alignment Search Tool Genome Resources and Sequence Similarity MapViewer

Topics • Why use sequence similarity? • BLAST algorithm • blastn, blastp, megablast • BLAST statistics • BLAST output • Examples

: NCBI’s tool Why Do We Need Sequence Similarity Searching? • To identify and annotate sequences • To evaluate evolutionary relationships • Other: • model genomic structure (e.g., Spidey) • check primer specificity in silico

Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment

Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21 Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa)

Global programming algorithm

Global Dynamic Programming • Full sequence must be aligned • Gaps at ends are penalized as much as internal ones • F(n,m) is the best score for alignment • Traceback can give >1 correct alignment • Used to examine closely related sequences • http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html

Local Alignment – Smith-Waterman

Local alignments - How • Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one) • The alignment can end anywhere in the matrix • Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0

Heuristic alignment algorithms • Shortcuts are important • Searching a sequence length of 1000 against a database with 108 residues requires approximately 1011 matrix cells. At ten million matrix cells a second, it would take about 3 hours. • BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.

Seeding • BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide) • It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.

Extension • The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming. • These locally optimal alignments are called high scoring segment pairs or HSP’s • Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments

Alignment example • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. • Assume the seed is the capital T, extend the alignment • You’ll hit a mismatch c/e should you continue and how far? • Generate a variable X to measure how far the score drops off. • Set X = 5 and try the alignment… • Set X = 2 and try again … • A large X value will increase the speed, however, speed is often modulated by word size and other parameters…

Gapped BLAST – a time saver • Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A) • A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver • Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit • T is adjustable, higher the T then the smaller the search space

Evaluation • Once seeds are extended to generate alignments, these alignments are tested for statistical significance. • Can establish thresholds for reporting

The Flavors of BLAST • Standard BLAST • traditional “contiguous” word hit • position independent scoring • nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) • Megablast • optimized for large batch searches • can use discontiguous words • PSI-BLAST • constructs PSSMs automatically; uses as query • very sensitive protein search • RPS BLAST • searches a database of PSSMs • tool for conserved domain searches

BLASTN variations • BLASTN seeds are always identical words; T is never used • To make BLASTN faster, increase word size, to make it more sensitive decrease word size • MegaBLAST increases word size to 28 • The minimum word size is 7 • http://monod.uwaterloo.ca/papers/02ph.pdf

BLASTP implementation • To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default) • Affine gaps • Decreased penalty for gap extension relative to gap introduction

Also, FASTA • Similar to Gapped BLAST – except bigger neighborhood • Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6 • Once identified, looks for diagonals with many mutually supporting word matches • Extensions similar to BLAST

Scoring Matrices • Scoring matrix specifies a score, sij, for aligning sequence Iwith sequence II. • Choice of matrix depends on the divergence level of desired/expected hits. • Examples: PAM, BLOSUM • Both can be modified for different divergence levels (eg, BLOSUM40, BLOSUM62) • Advice: try several matrices when possible.

Dayhoff Family of Matrices • Dayhoff model measures sequence evolution in units of “PAMs” • One PAM unit represents the evolutionary distance in which 1% of the amino acids have changed. • Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change) • Mutability of alanine was defined to be 100.

Dayhoff Family of Matrices • Problems with the original Dayhoff scheme • It does not consider the genetic code. • Not all amino acid substitutions can occur by a single nucleotide substitution event. • Parameters were estimated from a small sample of closely related proteins. • Evolution at the “average site” of the “average protein” is being modeled.

BLOSUM Scoring Matrices Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)

Basic Local Alignment Search Tool • Widely used similarity search tool • Heuristic approach based on Smith Waterman algorithm • Finds best local alignments • Provides statistical significance • All combinations (DNA/Protein) query and database. • DNA vs DNA • DNA translation vs Protein • Protein vs Protein • Protein vs DNA translation • DNA translation vs DNA translation • www, standalone, and network clients

X dropoff (X1) X dropoff (X2) X dropoff (X3) How BLAST Works • Make lookup table of “words” for query • Scan database for hits • Ungapped extensions of hits (initial HSPs) • Gapped extensions (no traceback) • Gapped extensions (traceback; alignment details)

GTACTGGACATGGACCCTACAGGAA Query: 11-mer WORD SIZE default minimum blastn 11 7 megablast 28 8 Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words . . .

GTQITVEDLFYNIATRRKALKN Query: Word size = 3 (default) Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Make a lookup table of words [ -f 11 = blastp default ]

Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT one exact match • Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words [ -A 40 = blastp default ] two matches

example query words Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 YLSHFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP BLASTP Summary High-scoring pair (HSP)

Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 [ -r 1 -q -3 ] CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST

Common amino acids have low weights Rare amino acids have high weights F Negative for less likely substitutions Y Positive for more likely substitutions D D F BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

PSSM scores 1 5 7 4 4 Position-Specific Score Matrix Serine/Threonine protein kinases catalytic loop DAF-1

Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop [ >./blastpgp -i NP_499868.2 -d nr -j 3 -Q NP_499868.pssm ]

E = Kmne-S or E = mn2-S’ K = scale for search space  = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance, ≥ S your score Alignments expected number of random hits Score (S) More info:www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Gapped Alignments • Gapping provides more biologically realistic alignments • Gapped BLAST parameters are simulated for each scoring matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)

An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX

Other BLAST Algorithms • Megablast • Discontiguous Megablast • PSI-BLAST

Megablast: NCBI’s Genome Annotator • Long alignments of similar DNA sequences • Greedy algorithm • Concatenation of query sequences • Faster than blastn; less sensitive

Discontiguous Megablast • Uses discontiguous word matches • Better for cross-species comparisons

Discontiguous (Cross-species) MegaBLAST

Discontiguous Word Options

Templates for Discontiguous Words W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 W = word size; # matches in template t = template length Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

BLAST Databases: Nucleic Acid • nr (nt) • traditional GenBank divisions • NM_ and XM_ RefSeqs • dbest • EST division • htgs • HTG division • gss • GSS division • chromosome • NC_ RefSeqs • env_nr • environmental sample[filter] • e.g., 16S rRNA

BLAST Databases: Protein nr(non-redundant protein sequences) • GenBank CDS translations • NP_ RefSeqs • Outside databases PIR, Swiss-Prot, PRF • PDB (sequences from structures) env_nr (environmental sample[filter])

MapViewer

MapViewer

Presentation Transcript

Oracle Spatial and Mapviewer Problems From Real World Applications

EDIT MapViewer