Optimal k -mer superstrings for protein identification and DNA assay design.

Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

k-mer (Sub-)Problems • Enumerate: • For all (distinct) k-mers, do • Existence: • ...with respect to exact (& inexact) count ¸x • Uniqueness: • ...with respect to exact & inexact match • Near-neighbors: • ...with respect to inexact match • Representation: • Represent (distinct) k-mers for other tools • Fast annotation of k-mer counts on original sequences

Applications of k-mer sets • Peptide Identification • Represent all amino-acid 30-mers • ...that occur at least twice in human dbEST • PCR Primer Design: • Test DNA 20-mer primers for uniqueness • What does it mean to be unique? • DNA sequencing error / repeat detection • Eliminate mers that are too rare or too frequent • Pathogen signatures • Near-neighbors imply potential false-positives

k-mer Superstring Problem Given • A set of sequences S = { S1, ..., Sn } • Sequence database • Word size k Find • A new set of sequences T = { T1, ..., Tm } Such that • Total length of T is minimized, and • T is complete and correct w.r.t. k-mers of S

k-mer Superstring Problem • Completeness • All of the k-mers of S are represented • Correctness • No additional k-mers are present • Minimize the total representation length • Correlates with running time

Shortest (common) superstring problem • General strings (arbitrary length) • Single output string • Completeness for input sequences only • Classical NP-hard problem • Garey and Johnson • Approximate within ~ 2.5*OPT • Max-SNP hard • One of the first algorithmic approaches to genome assembly

de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = 0001110100 A = {0,1}, k = 4: s = 0000111101011001000

110 011 100 001 000 010 111 101 de Bruijn Graph: A = {0,1}, k = 4 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0

de Bruijn Sequences & Graphs de Bruijn graphs (k,A): • Edges represent length k words from A • Each node has • in degree |A| • out degree |A| Eulerian tour constructs de Bruijn sequence.

Sequencing-by-Hybridization-graph ACDEFGI, ACDEFACG, DEFGEFGI

Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs • Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

C3 Enumeration • Complete • All k-mers are present • Correct • No other k-mers are present • Compact • No k-mer is present more than once

Correct, Complete, Compact (C3) Enumeration • Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG

Correct, Complete (C2) Enumeration • Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

Patching the CSBH-graph • Use artificial edges to fix unbalanced nodes

Patching the CSBH-graph • Use matching-style formulations to choose artificial edges • Optimal C2/C3 enumeration in polynomial time. • Chinese Postman Problem • Edmonds and Johnson, ’73 • l-tuple DNA sequencing • Pevzner, ’89 • Shortest (Common) Superstring • MAX-SNP-hard, 2.5 approx algorithm

Related work • Chinese Postman Problem • Undirected graph, weighted edges • Shortest path that uses all the edges • Solvable in polynomial time • Construct minimum weighted matching between nodes of odd-degree • Add matching to graph and find Eulerian path • Minimize weight of extra edges used

C2 Enumeration • Chinese postman problem, except: • Directed graph • Add edges from nodes with surplus in-degree to nodes with surplus out-degree • Fixed cost teleportation option • Can always “start” a new sequence • Find optimal set of additional edges • Transportation problem / min cost flow instance

-2 -1 -4 -1 -2 1 3 2 1 3 C3 Enumeration #in-#out #in-#out Cost: k

EHAC FHAC ACD HAC GHAC D Reusing Edges • ACDEHAC, ACDFHAC, ACDGHACD

$ACD EHAC ACD HAC FHAC GHAC D Reusing Edges • C3: ACDEHACDFHAC, ACDGHACD

D EHAC ACD HAC FHAC GHAC D Reusing Edges • C2: ACDEHACDFHACDGHAC

-4 -2 -1 -2 -1 3 2 1 1 3 C2 Enumeration #in-#out #in-#out 4 10 “Shortcut paths” 7

-4 -2 -1 -1 -2 0 0 3 1 2 3 1 C3 Enumeration #in-#out #in-#out Cost: 0 Cost: 0 Cost: k

Enzymatic Digest and Fractionation Sample Preparation for Peptide Identification

Single Stage MS MS m/z

Tandem Mass Spectrometry(MS/MS) m/z Precursor selection m/z

Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z

Peptide Identification • For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well • Peptide sequences from protein sequence databases • Swiss-Prot, IPI, NCBI’s nr, ... • Automated, high-throughput peptide identification in complex mixtures

Novel Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

Novel Splice Isoform

Novel Mutation • HUPO Plasma Proteome Project • Pooled samples from 10 male & 10 female healthy Chinese subjects • Plasma/EDTA sample protocol • Li, et al. Proteomics 2005. (Lab 29) • TTR gene • Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis. • Familial amyloidotic polyneuropathy • late-onset, dominant inheritance

Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

Novel Mutation

Compressed EST Peptide Sequence Database • For all ESTs mapped to a UniGene gene: • Six-frame translation • Eliminate ORFs < 30 amino-acids • Eliminate amino-acid 30-mers observed once • Compress to C2 FASTA database • Complete, Correct for amino-acid 30-mers • Inclusive gene-centric peptide sequence database: • Size: < 3% of naïve enumeration, 20774 FASTA entries • Running time: ~ 1% of naïve enumeration search • E-values: ~ 2% of naïve enumeration search results

Compressed EST Peptide Sequence Database • For all ESTs mapped to a UniGene gene: • Six-frame translation • Eliminate ORFs < 30 amino-acids • Eliminate amino-acid 30-mers observed once • Compress to C2 FASTA database • Complete, Correct for amino-acid 30-mers • Gene-centric peptide sequence database: • Size: < 3% of naïve enumeration, 20774 FASTA entries • Running time: ~ 1% of naïve enumeration search • E-values: ~ 2% of naïve enumeration search results

Sequence Databases & CSBH-graphs • All k-mers represented by an edge have the same count 1 2 2 1 2

CSBH-graph subgraphs • Quickly determine those that occur twice 2 2 1 2

k-mer (Sub-)Problems • Enumerate: • For all (distinct) k-mers, do • Existence: • ...with exact (& inexact) count ¸x • Uniqueness: • ...exact & inexact match • Near-neighbors: • ...inexact match • Representation: • Represent (distinct) k-mers for other tools • Fast annotation of k-mer counts on original sequences

Large scale instances! • CSBH-graph instances • Partition set of all k-mers, determine non-trivial nodes • Days on condor grid (250 CPUs) to construct • ¸ 100,000,000 nodes and edges (sparse & dense) • Min-cost flow instances • ¸ 500,000 nodes and edges • Algorithms must be linear in problem size • Out-of-core Eulerian path algorithm? • Currently testing out-of-core connected-components

Grid computing • Heterogeneous machines • Varying disk/memory/MHz/cores capabilities • Centralized scheduler • Jobs started asynchronously • Other jobs may preempt current job • Input files may need to be staged • 250 simultaneous requests for a 3Gb file? • How to guarantee integrity of input files? • Problem decomposition may be non-trivial • Jobs sizes need to fit the least capable machine • Sometimes need to “game” the scheduler • Need to ensure the integrity of job output

Uniqueness Oracles • Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb) • Count occurrences in the genome: 0,1,2+ • Construct 20-mer superstring for 20-mers with count 1 • Construct 20-mer superstring for 20-mers with count > 1 • Easy(-ish) for exact sequence match: O(n) • Fast automata, hash tables, suffix trees.

Polymerase Chain Reaction

Inexact sequence match • Inexact sequence matching O(n*m*k) • Errors/Mismatches (k): 1,2,3 • # distinct 20-mers (m): O(n) • Achieve expected linear time using a hybrid approach (blastn): • Exact search for short chunks of primers • Expensive alignment only where chunks match • Large chunks ) Fast, but miss occurrences • Small chunks ) Slow, find all matches

Inexact sequence match Baeza-Yates Perleberg: • Correct and O(n) for small k • At least 1 chunk is observed with no error. • Small k → Large chunks → Fast and correct • Form of locality sensitive hashing g ≠ = ≠ q

Locality Sensitive Hashing • For each primer: • store a (set of) hash(es) in hash-table • At each position in the genome: • look-up a (set of) hash(es) in hash-table • if any hash is found, do more expensive check • Need to weigh • sensitivity (false negatives) vs • specificity (false positives) • Our application requires speed andno false negatives!

Random Projection • Choose T templates of l random “care” positions g q

Optimal k -mer superstrings for protein identification and DNA assay design.

Optimal k -mer superstrings for protein identification and DNA assay design.

Presentation Transcript

Superstrings

Protein Identification

Identification: classic and DNA

DNA, RNA, and Protein

DNA, RNA, and Protein

DNA and Protein Synthesis

Bradford/ BioRad Protein Assay

DNA and Protein Synthesis

Protein Sequencing and Identification

DNA Codes for Protein

DNA Identification

DNA and Protein Synthesis

Protein and DNA interaction

DNA and Protein Synthesis

DNA and Protein Synthesis

DNA, RNA and Protein

DNA AND PROTEIN SYNTHESIS

DNA AND PROTEIN SYNTHESIS

protein identification

Host Cell Protein Assay

DNA Identification

DNA  PROTEIN