200 likes | 406 Vues
procrastination. presented by aaron darling coauthors: Todd J Treangen, Louxin Zhang, Carla Kuiken, Xavier Messeguer, Nicole T Perna. demorar sin razon / remettre ( à plus tard ) / aufschieben, zögern / Тянуть кота за хвост . a quick overview.
E N D
procrastination presented by aaron darling coauthors: Todd J Treangen, Louxin Zhang, Carla Kuiken, Xavier Messeguer, Nicole T Perna demorar sin razon / remettre (à plus tard) / aufschieben, zögern / Тянуть кота за хвост
a quick overview Task: efficiently identify all repeating subsequences in a DNA sequence de novo Key features of our approach: • uses spaced-seeds for “seed-and-extend” string matching • identifies degenerate repeats with mismatches and gaps up to w nucleotides long in O(wN log wN) time • palindromic seed patterns match both DNA strands simultaneously • uses “procrastination” to efficiently prioritize seed extension
454 genome sequencer does 5 Mbp per hour Imagine that a new genome sequence has magically appeared • Task: interpret the genome • Don't know anything about it • Don't know genes • Don't know repeat families How can we annotate important features like repeat families? • Use a database of known repeats (RepeatMasker/RepBase) • novel repeat elements may not be in the database • repetitive gene families are never in the database • Identify repeats de novo using sequence analysis
de novo repeat detection • One approach: self-search with a pairwise local-alignment tool such as BLAST • Number of pairwise alignments grows O(r2) in the copy number of the repeat • Known problem for microbial genomes: IS elements • Known problem for mammalian genomes: Alu repeats
An example local multiple alignment: • AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC • AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC • AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT- • AACAAGCAGACACTTTTATCCATGGTCGTGGTAC--------- • AACAAGCA----CTTTTATCCATAGTCGTGGTA---------- • ------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC There must be a better way… • Yes! Local multiple alignment! Instead of 25 pairwise alignments, we create a single local multiple alignment of the 5 repeat elements Uses O(N) space for a genome of length N Can a local multiple alignment be constructed efficiently?
The Eulerian path approach Y Zhang and M S Waterman 2005. “An Eulerian path approach to local multiple alignment for DNA sequences.” Proc Nat Acad Sci102(5):1285-90 • An efficient algorithm • Uses a de Bruijn graph to perform match filtration, i.e. restrict the search space of alignments to those likely to be part of a high-scoring alignment • Limitation: Exact k-mer matching reduces sensitivity in the presence of nucleotide substitution
A new approach • Spaced seeds generally improve sensitivity Ma et al. 2002, Buhler et al. 2003, Choi et al. 2004, Li et al. 2006 • Add efficiency with palindromic spaced seed patterns–search both DNA strands at once • A list of optimal palindromic seeds is available online Step 1. Apply seed pattern at each position 1 … … … i … … … N ACAGCTAGCATGGCAA……GTTACCTAG………ACCACCTAG 1*1*1 1*1*1 Store the lexicographically lesser (alphabetic order) 1 AAC vs GTT N-8 ACC N-7 CAC N-6 AGG N-5 ACA N-4 CAG 1 AAC 2 ACG 3 ACA 4 CAC 5 CAC 6 TCA 7 ACT 1 AAC 8 CTC 9 CAG 10 AGC 11 TCA 12 GCA i+0 GAC i+1 GTA i+2 AGA i+3 ACA i+4 CAG 2 ACG 2 CGT vs ACG Reverse complement seeds highlighted in blue
A new approach • Spaced seeds generally improve sensitivity Ma et al. 2002, Buhler et al. 2003, Choi et al. 2004, Li et al. 2006 • Add efficiency with palindromic spaced seed patterns–search both DNA strands at once • A list of optimal palindromic seeds is available online Step 2. hash seeds to identify matches 1 … … … i … … … N ACAGCTAGCATGGCAA……GTTACCTAG………ACCACCTAG 1 AAC 3 ACA i+3 ACA N-5 ACA N-8 ACC 2 ACG 7 ACT i+2 AGA 10 AGC N-6 AGG 4 CAC N-7 CAC 5 CAC 9 CAG i+4 CAG N-4 CAG 8 CTC 12 GCA i+0 GAC i+1 GTA 6 TCA 11 TCA
Components: 1 1 1 2 2 2 3 3 w nucleotides w nucleotides 1 2 Neighborhood list: 1 2 3 1 Neighborhood list: Procrastination queue: A simple example of match extension Step 3f. Procrastinate: do not extend the subset match. Instead create a link between the subset match and the extended match Step 3b. Identify chainable matches – matches with identical multiplicity and with each component adjacent (within w nucleotides) Step 3g. Attempt extension to the left. No matches exist to the left. Extension has completed. Step 6. No chainable or subset matches exist. End extension. No more matches in the procrastination queue. End. Step 5. Select the next match from the procrastination queue. The next match is gray. It has a superset link, so we perform a link extension. Step 3e. No chainable matches exist, look for subset matches – matches of lower multiplicity with neighboring components Step 3. Perform rightward extension Step 5a. For link extension, immediately extend the match to include all area spanned by the superset. Look for chainable subsets on the other side Step 3d. Construct a right-side neighborhood list Step 4. Select the next match from the procrastination queue. The next match is orange, and has been subsumed. No extension is necessary. Step 1. Construct a procrastination queue – a priority queue that orders matches by their multiplicity Step 3c. Mark the chained matches as ‘subsumed’ and continue extension Step 3a. Construct a right-side neighborhood list – a list of match components within w nucleotides to the right Step 2. Extend the first matchin the procrastination queue Match records: Subsumed by: Left subset: Right subset: Left superset: Right superset:
Components: 1 1 1 2 2 2 3 3 Why procrastination is good… Then, when extending the green match rightward, we would again create two neighborhood lists– covering the same regions as previously processed during the gray match extension! Procrastination allows us to do less work and arrive at the same set of chained alignments. If the multiplicity 2 gray match were extended first, we would create two left-side neighborhood lists during extension Match records: Subsumed by: Left subset: Right subset: Left superset: Right superset:
Components: 1 1 1 2 2 2 3 3 Novel subset matches An attempt to extend the first match will find no chainable or subset matches We create a novel subset match (pink). We create subset/superset links between the novel subset and the other matches. We procrastinate and extend the new match later—adding it to the queue Later, when extending the gray match, it will extend to include the novel subset When extending the orange match to the left, we find the fully extended green match neighboring two components
Algorithm time complexity • O(wN log wN) to process neighborhood lists • Key observation: each of the N nucleotides may be included in at most w neighborhood lists • Log factor stems from sort by key comparison • O(wN log wN) to process link extensions • Novel subsets created only from pairs of fully extended matches, thus O(N) novel subsets Total work O(wN log wN)
Human time complexity • About two weeks of programming time • Was it worth our time??
Comparison to Eulerian path approach • Evaluate the ability to identify Alu repeat sequences in the human genome • Use Alus identified by RepeatMasker • Measure Sensitivity as the number of Alus contained by an extended match out of the total number of Alus • Measure Specificity as the fraction of extended match components that hit Alus out of the total number of components
w (max gap size) Compute time Seed weight Total sequence length Alu copy number Number of Alu families Sensitivity Specificity Alu repeat results • For short sequences, Euler and procrastAligner have similar accuracy • For long sequences procrastAligner significantly improves sensitivity
Conclusions • Using a seed-based filtration mechanism significantly improves sensitivity and specificity of local-multiple alignment on large data sets Future work: • Couple the filtration method with banded dynamic programming to produce full local-multiple alignments • Apply an information content scoring criterion with a better model for multi-residue indels
Acknowledgements • Todd Treangen (Univ. Politecnica de Catalunya) • Nicole T. Perna (Univ. of Wisconsin-Madison) • Bob Mau (Univ. of Wisconsin-Madison) Coauthors Louxin Zhang (Natl. Univ. Singapore), Xavier Messeguer (Univ. Politecnica de Catalunya) Carla Kuiken (Los Alamos Natl. Lab) The people who give us money: NLM 5T15LM007359-05 Spanish Ministry MECD Grant TIN2004-03382 AGAUR Training Grant FI-IQUC-2005 AFT Grant 146-000-068-112
Extra slides • The following are extra slides for answering specific questions
Tandem repeats and novel subsets • During leftward extension of the black match, we discover the grey subset and link it. • We also notice that some components of the black match are in each other’s neighborhood. • Thus, we label the black match as a tandem repeat and create a novel subset with one component for each repeating group: {1},{2,3,4},{5,6},{7} 4 3 6 1 2 5 7
GC AT TG CT GT A de Bruijn graph for 3-mers Input sequence: ATGT……ATGC……CTGT • Problem: Exact k-mer matching reduces sensitivity in the presence of nucleotide substitution Step 1. Enumerate each 3-mer ATG TGT ATG TGC CTG TGT Step 2. Construct de Bruijn graph Step 3. Identify local-multiple alignments as high-multiplicity paths in the graph