Elevator Speech for Oracle Interview

Elevator Speech for Oracle Interview Zhaoming Yin Jan 16, 2014

Summary • Work on GPU Algorithms for Sequence Alignment -Using GPU to parallelize the HMM and CRF algorithm for sequence alignment. • Work on Algorithm to cope with Genome Rearrangement Problem -Algorithm engineering method to accelerate the median algorithm for more than 2 orders of magnitude. -new algorithm to deal with unequal content data. -A newsoftware package to construct tree with unequal content data • Work on parallelizing optimization problems -problems such as knapsack problem, exemplar distance problem (which are all NP-Hard problems)

GPU Algorithms for Sequence Alignment (HMM & CRF) wave-front algorithm: The computing procedure is similar to a frontier of a wave to fill a matrix, where each block’s value in the matrix is calculated based on the values of the previously-calculated blocks.

GPU Algorithms for Sequence Alignment (HMM & CRF) Streaming Algorithm: Transfering data between Host and Device.

Genome Rearrangement (algorithm engineering) Genome rearrangements observed in Drosophila polytene chromosomes. DOBZHANSKY, T., and A. H. STURTEVANT, 1938 Inversions in the chromosomes of Drosophila pseudoobscura. Genetics 23: 28-64.

Experimental Results (Time)

Experimental Results (Space)

Genome Rearrangement (New Algorithm) deletion Traditional Algorithm: 1 2 3 4 5 6 7 8 9 10 1 3 -2 4 6 7 -9 -10 -8 1 2 -3 -7 -6 -9 -8 -10 4 ………… New Algorithm: 1 2 4 5 6 7 8 9 10 1 3 -2 -2 4 6 6 7 -9 -10 -8 1 2 -3 -7 -6 -9 -8 -10 4 ………… duplication

Experimental Result Close distance estimation Close median estimation http://ai.stanford.edu/~serafim/CS374_2006/presentations/lecture17.ppt

Genome Rearrangement In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution. 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 –6 –5 -4 -3 7 8 9 10 Transposition: 1 2 7 8 3 4 5 6 9 10 Inverted Transposition: 1 2 7 8 –6 -5 -4 -3 9 10

Genome Median Computation 5 6 5 6 4 2 3 3 1 4 2 1 4 4 3 3 1 1 5 5 6 6 2 2

Genome Median Computation 1,2,3 4 1,-3,-2 -2,-1,3 3 1 5 6 1,2,3 = 2 moves 2,-1,3 = 5 moves ….. 2

Step 1: Spectral Partition

Step 2: Compute MP Tree for Each Sub-Disk

4 3 5 2 6 1 7 8 Step 2-1: How to Compute Median (BNB) 4 4 3 3 5 5 4 3 2 2 5 6 6 2 6 1 1 7 7 1 8 8 7 8 4 3 5 2 6 1 7 8 4 3 5 2 6 4 4 3 3 5 5 1 2 2 6 6 7 8 1 1 7 7 8 8

Step 2-2: How to Compute Median (LK) …………………. stop

Step 2-2: How to Evaluate Median 1 1, 2, 3, 4, 3, 6, 5 med 1, 2, 3, 3, 4, 6, 5 2 1, 2, 3, 4, 6, 3, 5 3 1, 2, 5, 4, 6, 3, 3 Dis(m,1)+Dis(m,2)+Dis(m,3)

Step 2-2: How to Evaluate Median 1, 2, 3, 3, 4, 6, 5 1, 2, 3, 4, 3, 5 Find a mapping first (NP hard) dis=1 1, 2, 3, 3, 4, 6, 5 -2, -1, 3, 3, 4, 5 Complete the loss (polynomial) dis =2 1, 2, 3, 4, 6, 5 -2, -1, 3, 4, 6, 5 Compute DCJ (polynomial) dis =3 1, 2, 3, 4, 6, 5 1, 2, 3, 4, 6, 5

Step 3: Merge Disks Decomposition of The disks Construct a tree for each disk Merge the tree using A specific consensus method: Strict, majority etc… Disambiguation

Step 4: Initialization Init by insertion Which is local 4 3 1 5 6 c X 2 b 1 2 e Init by prospection Which is global. d

Step5: Iterative Refinement 1 2 a b 3 4

Review • Step 1: Spectral partition • Step 2: Subtree construction • Step 3: Supertree merge • Step 4: Initialization of complete tree using General Adequate Subgraph (GAS) method. • Step 5: Iterative Refinement until the complete tree converged.

Result—Simulated Data seed #Theta+ #gamma+ #phi operations We grow our own tree We know the total number of evolution event in the model tree

Result--Accuracy %of duplication 0.1 % of loss 0.1 Theta is % of inversion There are 8 species 2*8-3 =13edges. So the average accuracy is ~90%

Result – Real Data SCRaMbLE Matrix • We can represent a SCRaMbLEd strain by its vector. • The sign gives the orientation. • The color encodes the position in the synthetic chromosome.

Result – Real Data #inversion:#insertion/deletion:#duplication

Parallel Method [Bader 05] Load Balancing Parallel search

Experimental Results (Parallel)

Why Many-core BnB? • So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR). • Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient. • But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).

Experimental Results (Intel Phi knapsack)

Elevator Speech for Oracle Interview