Design and Optimization of Universal DNA Arrays

Design and Optimization of Universal DNA Arrays Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/

DNA Arrays • Exploit Watson-Crick complementarity to simultaneously perform a large number of substring tests • Used in a variety of high-throughput genomic analyses • Transcription (gene expression) analysis • Single Nucleotide Polymorphism (SNP) genotyping • Alternative splicing, ChIP-on-chip, tiling arrays, genomic-based species identification, point-of-service diagnosis, … • Common array formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide

Universal DNA Arrays • “Programable” arrays • Array consists of application independent oligonucleotides • Analysis carried by a sequence of reactions involving application specific primers • Flexible AND cost effective • Universal array architectures: tag arrays, APEX arrays, SBE/SBH arrays,…

Overview • Tag Array Design - Tag Set Design - Tag Assignment Algorithms • SBE/SBH Assays - Decoding and Multiplexing Algorithms • Conclusions

T T T A G C C C C G G G G A A SNP Genotyping with Tag Arrays Tag Primer G + A G 2. Solution phase hybridization • Mix reporter probes with unlabeled genomic DNA C antitag 4. Solid phase hybridization 3. Single-Base Extension (SBE)

Tag Array Advantages • Cost effective • Same array used in many analyses  can be mass produced • Easy to customize • Only need to synthesize new set of reporter probes • Reliable • Solution phase hybridization better understood than hybridization on solid support

Tag Set Design Problem (H1) Tags hybridize strongly to complementary antitags (H2) No tag hybridizes to a non-complementary antitag (H3) Tags do not cross-hybridize to each other t1 t1 t2 t2 t1 t1 t2 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)

Hybridization Models • Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state • 2-4 rule Tm = 2 #(As and Ts) + 4 #(Cs and Gs) • More accurate models exist, e.g., the near-neighbor model

Hybridization Models (contd.) • Hamming distance model, e.g., [Marathe et al. 01] • Models rigid DNA strands • LCS/edit distance model, e.g., [Torney et al. 03] • Models infinitely elastic DNA strands • c-token model [Ben-Dor et al. 00]: • Duplex formation requires formation of nucleation complex between perfectly complementary substrings • Nucleation complex must have weight  c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule)

c-h Code Problem • c-token:left-minimal DNA string of weight  c, i.e., • w(x)  c • w(x’) < c for every proper suffix x’ of x • A set of tags is a c-h code if (C1) Every tag has weight  h (C2) Every c-token is used at most once c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code

Algorithms for c-h Code Problem • [Ben-Dor et al.00] approximation algorithm based on DeBruijn sequences • Alphabetic tree search algorithm • Enumerate candidate tags in lexicographic order, save tags whose c-tokens are not used by previously selected tags • Easily modified to handle various combinations of constraints • [MT 05, 06] Optimum c-h codes can be computed in practical time for small values of c by using integer programming • Practical runtime using Garg-Koneman approximation and LP-rounding

Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag  sequence of c-tokens End pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT

Layered c-token graph for length-l tags l-1 l c/2 (c/2)+1 … c1 t s cN

Integer Program Formulation [MPT05] • Maximum integer flow problem w/ set capacity constraints • O(hN) constraints & variables, where N = #c-tokens

Number of c-tokens

Periodic Tags [MT05] • Key observation: c-token uniqueness constraint in c-h code formulation is too strong • A c-token should not appear in two different tags, but can be repeated in a tag • A tag t is called periodic if it is the prefix of () for some “period”  • Periodic strings make best use of c-tokens

c-token factor graph, c=4 (incomplete) CC AAG AAC AAAA AAAT

Vertex-disjoint Cycle Packing Problem • Given directed graph G, find maximum number of vertex disjoint directed cycles in G • [MT 05] APX-hard even for regular directed graphs with in-degree and out-degree 2 • h-c/2+1 approximation factor for tag set design problem • [Salavatipour and Verstraete 05] • Quasi-NP-hard to approximate within (log1- n) • O(n1/2) approximation algorithm

Cycle Packing Algorithm • Construct c-token factor graph G • T{} • For all cycles C defining periodic tags, in increasing order of cycle length, • Add to T the tag defined by C • Remove C from G • Perform an alphabetic tree search and add to T tags consisting of unused c-tokens • Return T • Gives an increase of over 40% in the number of tags compared to previous methods

h Experimental Results

Antitag-to-Antitag Hybridization • Additional constraint: antitags do not cross-hybridize, including self • Formalization in c-token hybridization model: (C3) No two tags contain complementary substrings of weight  c • Cycle packing and tree search extend easily

h Results w/ Extended Constraints

More Hybridization Constraints… t1 t1 t2 • Enforced during tag assignment by • - Leaving some tags unassigned and distributing primers across multiple arrays [Ben-Dor et al. 03] • - Exploiting availability of multiple primer candidates [MPT05]

p’ t’ p t Assignable Primers • If primer p hybridizes to the complement of tag t’, at most one of the assignments (p,t’), (p,t) and (p’,t’) can be made • Set P of primers is assignable to a set T of tags if the condition above is satisfied for every p,p’ and t,t’

Characterization of Assignable Sets • conflict graph: • G=(T  P,E), where (t,p) ∈ E if t and p hybridize • X = number of primers adjacent to a degree 1 tag • Y = number of degree 0 tags X=1 Y=2 • [Ben-Dor 04] Set P is assignable to T iff X+Y  |P|

Finding Assignable Primer Sets Multiplexing Problem: given primer set P and tag set T, find partition of P into minimum number of assignable sets • Both problems are NP-hard [Ben-Dor 04] Maximum Assignable Primer Set Problem: given primer set P and tag set T, find a maximum size assignable subset of P

Integration with Primer Selection • In practice, several primer candidates with equivalent functionality • In SNP genotyping, can pick primer from either forward and reverse strand • In gene expression/identification applications, many primers have desired length, Tm, etc.

Pooled Array Multiplexing Problem Pooled Multiplexing Problem: Given set of primer pools P and tag set T, find a primer from each pool and a partition of selected primers into minimum number of assignable sets

Pooled Multiplexing Algorithms • Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04] • Repeatedly delete primer of maximum potential until X+Y  #pools, where • Potential of tag t is 2-deg(t) • Potential of primer p is sum of potentials of conflicting tags • Subtract ½ if primer adjacent to a tag of degree 1

Pooled Multiplexing Algorithms • Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04] • Primer-Del+ = same but never delete last primer from pool unless no other choice • Min-Pot = select primer with min potential from each pool, then run Primer-Del • Min-Deg = select primer with min degree, then run Primer-Del • Iterative ILP = iteratively find a maximum assignable pool set using integer linear program

Results: GenFlex Tags, c=8

Herpes B Gene Expression Assay GenFlex Tags Periodic Tags

SBE/SBH Assay [MP 06] Primers T T A A T T TTGCA T CCATT A GATAA T hybridization to k-mer array (SBH) single-base extension (SBE)

Some notations • P set of primers, X set of probes • Ep⊆ {A,C,T,G} the set of possible extensions for primer p • The spectrum of primer p, SpecX(p), is the set of probes hybridizing with p • The extended spectrum of primer p with extension set Ep,

Decodable primer sets • Four parallel single-color SBE/SBH experiments  one type of extension in each SBE experiment • P is weakly decodable with respect to extension e if for every primer p • One SBE/SBH experiment with 4 colors (4 extensions) • P is weaklydecodable if for every primer p and every extension e ∈ Ep

Strongly r-decodable primer sets • Hybridization involving labeled nucleotide is less predictable Informative probes should not rely on it • Signal from one SNP may obscure signal from another when read at the same probe due to differences in DNA amplification efficiency Informative probes cannot be shared between SNPs • P is strongly r-decodable if for every primer p where r = redundancy parameter

MPPP • A set of primer pools P ={P1,…,Pn } is strongly r-decodable iff there is a primer pi in each pool Pi such that {p1,…,pn} is strongly r-decodable. Minimum Pool Partitioning Problem (MPPP) Given: • primer pools set P and extensions sets Ep, for every primer p • probe set X • redundancy r Find: partition of P into the min number of strongly r-decodable subsets

MDPSP Maximum r-Decodable Pool Subset Problem (MDPSP) Given: • primer pools set P and extensions sets Ep, for every primer p • probe set X • redundancy r Find: • strongly r-decodable subset of P of maximum size

Min-Greedy Algorithm for Maximum Induced Matching in General Graphs • Pick a vertex u of min degree • Pick a vertex v of min degree from among u’s neighbors • Add edge (u,v) to the matching • Delete all neighbors of u and v from the graph • Repeat the above steps until the graph becomes empty • [Duckworth 05] d-1 approximation factor for d-regular graphs

Min-Greedy Algorithms for MDPSP • Bipartite hybridization graph G: • Primers in left side, probes in right side • Two types of edges: • N+(p)=SpecX(p) • N-(p)=SpecX(p,Ep) \ SpecX(p) • Two algorithm variants: • MinPrimerGreedy: pick primer first • MinProbeGreedy: pick probe first • Delete primer/probe if N+ degree drops below r/1

Experimental results for k-mers (|Ep|=4, primer length=20)

MDPSP Size vs Primer Length k=10

Experimental results for c-tokens (|Ep|=4, primer length=20)

MDPSP Size vs Primer Length c=13

Conclusions and Ongoing Work • Combinatorial algorithms yield significant increases in multiplexing rates of universal arrays • New SBE/SBH architecture particularly promising based on preliminary simulation results • Ongoing work: • Extend methods to more accurate hybridization models, e.g., use NN melting temperature models • More complex (e.g., temperature dependent) DNA tag set non-interaction requirements for DNA self/mediated assembly • Probabilistic decoding in presence of hybridization errors • Application to novel domains, e.g., DNA barcoding

Acknowledgments • Claudia Prajescu and Dragos Trinca • Funding from NSF (Awards 0546457 and 0543365) and UCONN Research Foundation

Design and Optimization of Universal DNA Arrays