1 / 74

Optimal k -mer superstrings for protein identification and DNA assay design.

Optimal k -mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. k -mer (Sub-)Problems. Enumerate: For all (distinct) k -mers, do Existence:

mandek
Télécharger la présentation

Optimal k -mer superstrings for protein identification and DNA assay design.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

  2. k-mer (Sub-)Problems • Enumerate: • For all (distinct) k-mers, do • Existence: • ...with respect to exact (& inexact) count ¸x • Uniqueness: • ...with respect to exact & inexact match • Near-neighbors: • ...with respect to inexact match • Representation: • Represent (distinct) k-mers for other tools • Fast annotation of k-mer counts on original sequences

  3. Applications of k-mer sets • Peptide Identification • Represent all amino-acid 30-mers • ...that occur at least twice in human dbEST • PCR Primer Design: • Test DNA 20-mer primers for uniqueness • What does it mean to be unique? • DNA sequencing error / repeat detection • Eliminate mers that are too rare or too frequent • Pathogen signatures • Near-neighbors imply potential false-positives

  4. k-mer Superstring Problem Given • A set of sequences S = { S1, ..., Sn } • Sequence database • Word size k Find • A new set of sequences T = { T1, ..., Tm } Such that • Total length of T is minimized, and • T is complete and correct w.r.t. k-mers of S

  5. k-mer Superstring Problem • Completeness • All of the k-mers of S are represented • Correctness • No additional k-mers are present • Minimize the total representation length • Correlates with running time

  6. Shortest (common) superstring problem • General strings (arbitrary length) • Single output string • Completeness for input sequences only • Classical NP-hard problem • Garey and Johnson • Approximate within ~ 2.5*OPT • Max-SNP hard • One of the first algorithmic approaches to genome assembly

  7. de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = 0001110100 A = {0,1}, k = 4: s = 0000111101011001000

  8. 110 011 100 001 000 010 111 101 de Bruijn Graph: A = {0,1}, k = 4 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0

  9. de Bruijn Sequences & Graphs de Bruijn graphs (k,A): • Edges represent length k words from A • Each node has • in degree |A| • out degree |A| Eulerian tour constructs de Bruijn sequence.

  10. Sequencing-by-Hybridization-graph ACDEFGI, ACDEFACG, DEFGEFGI

  11. Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

  12. Sequence Databases & CSBH-graphs • Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

  13. C3 Enumeration • Complete • All k-mers are present • Correct • No other k-mers are present • Compact • No k-mer is present more than once

  14. Correct, Complete, Compact (C3) Enumeration • Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG

  15. Correct, Complete (C2) Enumeration • Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

  16. Patching the CSBH-graph • Use artificial edges to fix unbalanced nodes

  17. Patching the CSBH-graph • Use matching-style formulations to choose artificial edges • Optimal C2/C3 enumeration in polynomial time. • Chinese Postman Problem • Edmonds and Johnson, ’73 • l-tuple DNA sequencing • Pevzner, ’89 • Shortest (Common) Superstring • MAX-SNP-hard, 2.5 approx algorithm

  18. Related work • Chinese Postman Problem • Undirected graph, weighted edges • Shortest path that uses all the edges • Solvable in polynomial time • Construct minimum weighted matching between nodes of odd-degree • Add matching to graph and find Eulerian path • Minimize weight of extra edges used

  19. C2 Enumeration • Chinese postman problem, except: • Directed graph • Add edges from nodes with surplus in-degree to nodes with surplus out-degree • Fixed cost teleportation option • Can always “start” a new sequence • Find optimal set of additional edges • Transportation problem / min cost flow instance

  20. -2 -1 -4 -1 -2 1 3 2 1 3 C3 Enumeration #in-#out #in-#out Cost: k

  21. EHAC FHAC ACD HAC GHAC D Reusing Edges • ACDEHAC, ACDFHAC, ACDGHACD

  22. $ACD EHAC ACD HAC FHAC GHAC D Reusing Edges • C3: ACDEHACDFHAC, ACDGHACD

  23. D EHAC ACD HAC FHAC GHAC D Reusing Edges • C2: ACDEHACDFHACDGHAC

  24. -4 -2 -1 -2 -1 3 2 1 1 3 C2 Enumeration #in-#out #in-#out 4 10 “Shortcut paths” 7

  25. -4 -2 -1 -1 -2 0 0 3 1 2 3 1 C3 Enumeration #in-#out #in-#out Cost: 0 Cost: 0 Cost: k

  26. Enzymatic Digest and Fractionation Sample Preparation for Peptide Identification

  27. Single Stage MS MS m/z

  28. Tandem Mass Spectrometry(MS/MS) m/z Precursor selection m/z

  29. Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z

  30. Peptide Identification • For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well • Peptide sequences from protein sequence databases • Swiss-Prot, IPI, NCBI’s nr, ... • Automated, high-throughput peptide identification in complex mixtures

  31. Novel Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

  32. Novel Splice Isoform

  33. Novel Splice Isoform

  34. Novel Mutation • HUPO Plasma Proteome Project • Pooled samples from 10 male & 10 female healthy Chinese subjects • Plasma/EDTA sample protocol • Li, et al. Proteomics 2005. (Lab 29) • TTR gene • Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis. • Familial amyloidotic polyneuropathy • late-onset, dominant inheritance

  35. Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

  36. Novel Mutation

  37. Compressed EST Peptide Sequence Database • For all ESTs mapped to a UniGene gene: • Six-frame translation • Eliminate ORFs < 30 amino-acids • Eliminate amino-acid 30-mers observed once • Compress to C2 FASTA database • Complete, Correct for amino-acid 30-mers • Inclusive gene-centric peptide sequence database: • Size: < 3% of naïve enumeration, 20774 FASTA entries • Running time: ~ 1% of naïve enumeration search • E-values: ~ 2% of naïve enumeration search results

  38. Compressed EST Peptide Sequence Database • For all ESTs mapped to a UniGene gene: • Six-frame translation • Eliminate ORFs < 30 amino-acids • Eliminate amino-acid 30-mers observed once • Compress to C2 FASTA database • Complete, Correct for amino-acid 30-mers • Gene-centric peptide sequence database: • Size: < 3% of naïve enumeration, 20774 FASTA entries • Running time: ~ 1% of naïve enumeration search • E-values: ~ 2% of naïve enumeration search results

  39. Sequence Databases & CSBH-graphs • All k-mers represented by an edge have the same count 1 2 2 1 2

  40. CSBH-graph subgraphs • Quickly determine those that occur twice 2 2 1 2

  41. k-mer (Sub-)Problems • Enumerate: • For all (distinct) k-mers, do • Existence: • ...with exact (& inexact) count ¸x • Uniqueness: • ...exact & inexact match • Near-neighbors: • ...inexact match • Representation: • Represent (distinct) k-mers for other tools • Fast annotation of k-mer counts on original sequences

  42. Large scale instances! • CSBH-graph instances • Partition set of all k-mers, determine non-trivial nodes • Days on condor grid (250 CPUs) to construct • ¸ 100,000,000 nodes and edges (sparse & dense) • Min-cost flow instances • ¸ 500,000 nodes and edges • Algorithms must be linear in problem size • Out-of-core Eulerian path algorithm? • Currently testing out-of-core connected-components

  43. Grid computing • Heterogeneous machines • Varying disk/memory/MHz/cores capabilities • Centralized scheduler • Jobs started asynchronously • Other jobs may preempt current job • Input files may need to be staged • 250 simultaneous requests for a 3Gb file? • How to guarantee integrity of input files? • Problem decomposition may be non-trivial • Jobs sizes need to fit the least capable machine • Sometimes need to “game” the scheduler • Need to ensure the integrity of job output

  44. Uniqueness Oracles • Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb) • Count occurrences in the genome: 0,1,2+ • Construct 20-mer superstring for 20-mers with count 1 • Construct 20-mer superstring for 20-mers with count > 1 • Easy(-ish) for exact sequence match: O(n) • Fast automata, hash tables, suffix trees.

  45. Polymerase Chain Reaction

  46. Polymerase Chain Reaction

  47. Inexact sequence match • Inexact sequence matching O(n*m*k) • Errors/Mismatches (k): 1,2,3 • # distinct 20-mers (m): O(n) • Achieve expected linear time using a hybrid approach (blastn): • Exact search for short chunks of primers • Expensive alignment only where chunks match • Large chunks ) Fast, but miss occurrences • Small chunks ) Slow, find all matches

  48. Inexact sequence match Baeza-Yates Perleberg: • Correct and O(n) for small k • At least 1 chunk is observed with no error. • Small k → Large chunks → Fast and correct • Form of locality sensitive hashing g ≠ = ≠ q

  49. Locality Sensitive Hashing • For each primer: • store a (set of) hash(es) in hash-table • At each position in the genome: • look-up a (set of) hash(es) in hash-table • if any hash is found, do more expensive check • Need to weigh • sensitivity (false negatives) vs • specificity (false positives) • Our application requires speed andno false negatives!

  50. Random Projection • Choose T templates of l random “care” positions g q

More Related