450 likes | 571 Vues
This document discusses the newly discovered gene potentially holding keys to human brain evolution and its occurrence in other species. It introduces an index-based search for biological sequences that significantly enhances the efficiency of local alignment through tools like BLAST and BLAT. We explore the challenges posed by traditional alignment algorithms, outline an indexing approach to speed up searches across genomic databases, and detail strategies for optimizing sensitivity and specificity in sequence alignment.
E N D
Index-based search of single sequences Omkar Mate CS 374 Stanford University
Motivation A newly discovered gene … (may hold clues to evolution of human brain capacity ) • Occurrence in other species • Mutation • …
Sequence Alignment New Query Existing Genome Database ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… gacatta Easy??? Think again………..
State of Biological Databases ~300 sequenced genomes
Alignment Problem • Assume we try Smith-Waterman: [running time = O(MN)] The entire genomic database 1011 Our new gene 104 Time and Space complexity: O(1015) Huge number! But we can do better …
Indexing-based Local Alignment BLAST- Basic Local Alignment Search Tool Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman! query DB hits
BLAST Step 1: Construct dictionary of query words (Query indexed by all words of size, k = 4) Query: AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… Index of query words
BLAST • Step 2 – Generate all the relatives of a word (Relative: a word with alignment score greater than a threshold, T) Query Word: ATGC, T: 50 Candidates Score Relatives! Update the index … (Query: ) AATGCCGATAGCATCG …
BLAST • Step 3: Searching • Search through database linearly, one word at a time • Initiate alignment with all occurrences of that word in query Database: ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … Query: Index:
BLAST • Step 4 – Alignment Extension Once we find an alignment, extend to left and right with no gaps until alignment score falls below a certain threshold. ATGCCGATACGATCAGCTACGATCAG… ATGCCGATACGATCAGCTACGATCAG…
Sensitivity-Speed Tradeoff Longer words => Fewer alignments => Faster but Low chance of a match Shorter words => More alignments => High chance of a match but Slower
BLAT - The BLAST-Like Alignment Tool Similarities: • Rapid scans for relatively short matches (hits) • Extend these hits into high-scoring pairs (HSP) Differences: BLAST BLAT • Index of query sequence Index of database • Scan through database Scan through query sequence • Gaps not allowed in ext. Gaps allowed in extension • Returns smaller alignments Returns larger alignments
BLAT Strategies - I • Single perfect matches We do not allow any mismatch. Common intuition : fewer matches for longer words
Sensitivity and SpecificitySingle Perfect Nucleotide K-mer Matches as a Search Criterion
BLAT Strategies - II • Allow one mismatch Intuition: Higher number of matches for same word length => Better sensitivity (Caution: Keep k higher, else no. of matches will be huge)
Sensitivity and SpecificitySingle Near-Perfect Nucleotide K-mer Matches as a Search Criterion (one mismatch allowed)
BLAT Strategies - III Allow multiple perfect matches Two parameter: N: no. of matches, K: word size Practically: Same sensitivity, higher speed
Sensitivity and SpecificityMultiple Perfect Nucleotide K-mer Matches as a Search Criterion (2 and 3 perfect matches)
Seeded AlignmentA dominant paradigm for fast comparisons • Seed: A common pattern of positions used for efficient large scale comparison of genomic DNA …G A T T A C C A G A T T A C C A G A T T A … Seed: {0,1,2,3} => Comparison sequences: {G A T T} {A T T A} {T T A C} … Seed: {0,2,4,6} => Comparison sequences: {G T A C} {A T C A} {T A C G} …
Similarity Detection Sequence 1: Sequence 2: Seed: {0,2,4,5} A T C G A C T A T C G AC T A T C G A C T C T A G T C T C T A G TC T C T A G T C T Offset = 0 => Mismatch Offset = 1 => Match We can have multiple seeds (patterns)!
Seed Design A (hazy) problem definition Collection of ungapped genomic sequence similarities Parameters: length of seeds, resource limits ….. Algorithm A set of seeds that will give “optimum performance” (What are the parameters? How do you define optimum performance?)
Tasks in Seed Designing • Define a measure of goodness for a seed. Easy ! Sensitivity to interesting biosequence similarities. • Show how to evaluate goodness for a seed. Hard ! No efficient algorithm.
Terminology related to a Seed A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6
Computational Cost Seed weight w f Computational Cost No. of seeds n
Performance Measurement Optimum performance => Maximum sensitivity (i.e. detection probability) to the similarities S (Currently, Markov Models are used to measure these probabilities!)
Markov Models • A kth order Markov model M: Given k bits, predict (k+1)th bit 4th order Markov Model
(Exact) Problem Definition Inputs: • Number of seeds: n • Weight of each seed: w • Markov Model: M • Similarities: S Output: A set of seeds (ordered positions), P = {x11, …, x1w}, {x21,…,x2w},…,{xn1,…,pnw} that maximizes detection probability for S
Computing Detection Probabilities • Challenge: The probability of at least one match varies because the probabilities of matches at different offsets are not independent. Ex: Seed = {0,2,3,5} This similarity has 2 matches, at offsets 0 and 2, which share two of four positions in common
DFA to compute Detection Probability(Deterministic Finite Automaton) • Construct a DFA that accepts a string of 1’s and 0’s defined by the seed P. • Ex: P = {0,2} i.e. for a substring of length 3, we need a match in 1st and 3rd position. • Then the DFA should accept strings given by the regular expression: “(0+1)*1(0+1)1(0+1)*”
Dynamic Programming Algorithm To compute the detection probabilities recursively! Complexity analysis: • Size of DFA <= • Time to construct a DFA = • Time for each step = • No. of Steps = l • Total time complexity = This is faster by a factor of s2/wthan the best previous algorithm for detection probabilities
Remarks about the algorithm • Can be extended to work with a set of seeds • The DFA need not be minimal • Time complexity can be further reduced
Structure in Seed Space • Addressing the problem: When is one seed more sensitive than another? • Factors: • Parameters of the Markov model M • Similarity length : l • Smaller length: irregular behavior => We can generalize only for asymptotic cases
Asymptotic Result Let, El (P) = Event that P detects S at some offset Elc (P) = complementary event Then, A seed P is asymptotically worse than a seed P’, P < P’, if Liml Pr[ Elc (P) / Elc (P’) ] > 1 (P’ has more chances of detecting S, than P does!)
Mandala: Fast, Practical Seed Design • Seed selection: • No efficient algorithm to find optimum w, s given M (except Brute Force) • Applies local search method; global efficiency sacrificed • Training a Markov model: • Training set is adaptively selected to suit the intended application • Samples training set using LSH-ALL-PAIRS algorithm
Experimental Results - I • Avg. detection probabilities given by theoretical models for random seeds (w = 11) Solid line: M0 Dashed line: M5
Experimental Results - II • Detection probabilities for best seeds found by Mandala (k = 5) Solid: noncoding DNA model Dashed: coding DNA model
Directions for further research • Extend the model to evaluate seeds • Extend similarity models to distinguish between different classes of substitution • Construct models of multiple alignment: to compare 3 or more genomes at once
Timing - BLAT vs WU-TBLASTX • Dataset: 1000 Mouse Reads and a RepeatMasked Human Chromosome 22
Sensitivity – BLAT vs WU-TBLASTX • Dataset: 13 million Mouse Shotgun Reads and Human Chromosome 22
Sensitivity and SpecificitySingle Perfect Amino Acid K-mer Matches as a Search Criterion
Sensitivity and SpecificitySingle Near-Perfect Amino Acid K-mer Matches as a Search Criterion (one mismatch allowed)
Sensitivity and SpecificityMultiple Perfect Amino Acid K-mer Matches as a Search Criterion (2 and 3 perfect matches)
Mathematical Formula • To compute the probability that the DFA associated with a seed accepts a string randomly chosen from a Markov Model M S = similarity, l = length of S, k = order of M, delta = bit string of length k, q = a state, Phi(q) = set of states that transition to q on bit b t = no. of bits read, k’ = min {k, t}
Dynamic Programming • Initialize the recurrence: P (q0, 0, 0) = 1 • After l steps, return the sum over all k+1-mer bit strings delta.1 of P (qa, l , delta.1)