Efficient Algorithms for SNP Haplotype Block Selection Problems

Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics Providence University, Taiwan E-mail: yllin@pu.edu.tw http://www.cs.pu.edu.tw/~yawlin

Outline Introduction Motivation Terminology Definition Diversity Functions Haplotype Block Selection Dealing with Missing Data Experiment Conclusion

Introduction Mutation in DNA is the principle factor that is responsible for the phenotypic differences among human beings. SNP (Single Nucleotide Polymorphisms) is the most common mutation.

Introduction (cont.) Recent studies have shown that the chromosome recombination only takes places at some narrow hotspots. Haplotype blocks stand for segments between these hotspots where little or even no recombination occurs. A a A A a a B b B b B b

Motivation The SNPs within a haplotype block are highly correlated due to the low diversity in each block. SNPs, haplotype pattern, or disease gene in the same block are associative. (Linkage)

Terminology Definition major minor c/t c/t c/g a/g a/t t/a g/c t/c g/a c/t H1 c n c a a a g t a c H2 t t g a t n g c g n H3 c c c g a t n t g t major minor c/t c/t c/g a/g a/t t/a g/c t/c g/a c/t H4 t c n n t t c c g c H1 0 3 0 0 0 1 0 0 1 0 major←0 minor←1 n←3 H2 1 1 1 0 1 3 0 1 0 3 H3 0 0 0 1 0 0 3 0 0 1 H4 1 0 3 3 1 0 1 1 0 0

Terminology Definition (cont.)

Diversity Functions Each different haplotype string si in a matrix is associated with a probability pi. pi: 2/7, 2/7, 1/7, 1/7, 1/7

Diversity Functions (cont.) Raising the square to an arbitrary power q. Information Entropy function:

Results

Results (cont.)

Haplotype Block Selection Computing Diversities of All Blocks i j Total: n2 (i,j) pairs. Total time complexity: O(mn3) O(mn)

Haplotype Block Selection (cont.) Suffix Tree T1 Suffix Tree: 1-suffix Time Complexity: O(n) … n leaves.

Haplotype Block Selection (cont.) Merge m suffix trees into the total suffix tree T* … … … … … 1-suffix i-suffix m-suffix T* … merge mn leaves.

Lowest Common Ancestor

LCA (confluent) subtree

Confluent subtree – Illustration

Consructing confluent subtree

Haplotype Block Selection (cont.) LCA Tree T* … 1st suffix string for each row 1-LCA Tree • m×n haplotype matrix • n LCA Trees … … (with m leaves) i-LCA Tree n-LCA Tree

Haplotype Block Selection (cont.) Event-List Event-List 1 7 … 1 0 1[4,3] 2[2,2,2,1] 4[2,2,1,1,1] 8 4 3 … 0 1 0 1 n 2 2 2 1 0 0 1 0 1 1 0 0 1 h1(8), h6(8) h4(8), h5(8) h3(8) h7(8) h2(8) 8-LCA Tree

Haplotype Block Selection (cont.) • Event-List Event-List 1 8-LCA Tree … 7 1[4,3] 2[2,2,2,1] 4[2,2,1,1,1] 8 … 3 4 n 2 2 2 1 Depth-List 8[4,3] 1 8[2,2] 8[2,1] 2 3 h1(8), h6(8) h4(8), h5(8) h3(8) h7(8) h2(8) 8[2,2,1,1,1] 4 … BFS Search n

Haplotype Block Selection (cont.) Farthest-sites (good partner) L[i-1] L[i] i-1 i

Haplotype Block Selection (cont.)

Haplotype Block Selection (cont.) Dynamic Programming i L[j] j … B1 Bk-1 Bk i j j-1 … B1 Bk

Haplotype Block Selection (cont.) Dynamic Programming k-1 f(k-1,i,L[j]-1) k i f(k,i,j-1) f(k,i,j) j

Haplotype Block Selection (cont.) Dynamic Programming

Haplotype Block Selection (cont.) Dynamic Programming j i=1 i 1

Dealing with Missing Data Sometime we may fail to distinguish two different haplotype due to the ambiguity cased by missing data. Let Aij∈{0,1,3}. Aij=3 means the j-th site of observation i is missing data. One way to deal with missing data is to assign each Aij=3 to either 0 or 1 such that the resulting diversity is minimized.

Dealing with Missing Data (cont.) The minimum-diversity problem is NP-hard by a reduction from the minimum-clique-partition problem. Two rows i,j of A are different is there exists a column k such that {Aik,Ajk}={0,1}. Two rows are compatible if they are not different. (1,3) (1,4) (2,4) (3,5) 1 3 0 3 1 3 0 3 3 1 3 0 3 1 3 3 3 3 0 3 1 1 2 5 0001 2 3 1110 4 3 4 5

Dealing with Missing Data (cont.) Our heuristic method: 1.Partition Phase S T (Missing Data)

Dealing with Missing Data (cont.) Our heuristic method: 1.Partition Phase T (Missing Data) S s1 t1 ^ t2 s2 ^ s3 t3 s4

Dealing with Missing Data (cont.) Our heuristic method: 2.Search Phase T (Missing Data) S 3.Assignment Phase (Consolidate) count+1 s1 t1 count+1 t2 s2 Miss s3 t3 s4 s5

Experiment Experiment Method Data: Patil (Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) Chromosome: 21 No. of SNP: 24,047 SNPs from 20 individuals. Diversity threshold: 0.85 and 0.9 No. of Block: 100, 200, and 300 Classification: block length<15, 15≦length≦30, and 30<length.

Experiment (cont.) Experiment Results • D=0.85 • No.=100 • D=0.9 • No.=100

Conclusion Contributions We develop a visualization tool to help us with observation the diversity of haplotype strings. We propose several efficient algorithms to select interesting haplotype blocks by using different diversity functions. We show the minimum-diversity problem is NP-complete and propose a heuristic method for dealing with missing data suitably.

Conclusion (cont.) Future and continuous works: Explore and elaborate other meaningful diversity functions. Improve our diversity visualization tool. TagSNP selection in the haplotype block. Further experiments on related biomedical haplotype data.

Thank You! Any Question?

Problem Definitions (1) Given a haplotype matrix A, find a segmentation S consisted of k blocks, with the coverage of common hapltypes in each block more than α% and the total length of S in maximized.

Monotonic Diversity A diversity function δ is said to be monotonic if, for any block (interval) I = [i, j] of A, it follows that δ(i’, j’) δ(i, j) whenever [i’,j’] [i, j]; that is, the diversity of any subinterval of I is always no larger than the diversity of I. The coverage of common haplotype does not satisfy the property of monotonic diversity in the haplotype sample with missing data. i i’ j’ j δ(i’, j’) δ(i, j), [i’,j’] [i,j]

Longest Blocks Partitioning with Constraint on Diversity • Dynamic programming algorithm j i L[j] … B1 Bk-1 Bk i j j-1 … B1 Bk

Longest Blocks Partitioning with Constraint on Diversity (cont.) • Preprocessing of farthest-sites (good partner) • Given a haplotype matrix A and a diversity upper limit D; for each column j, find the farthest left marker i=L[j] so that δ(i,j)<D. • We use the techniques of suffix tree and LCA to solve the problem in O(mn+n2) time. L[j] j

Longest Blocks Partitioning with Constraint on Diversity (cont.) Time: O(nk) after the preprocessing of L[j]’s. Space: O(nk). n i j f(k-1,i,L[j]-1) k f(k,i,j) f(k,i,j-1)

Longest Blocks Partitioning with Constraint on Diversity (cont.) Linear space k>1 x* i j D1 D2 D E E2 E1 … … i j k=1 D1

Longest Blocks Partitioning withConstraint on Diversity (cont.) How to find the cut-point x* 2. 3. x=i i+1 i+2 j-2 j-1 x* x=i i+1 i+2 j-2 j-1

Longest Blocks Partitioning with Constraint on Diversity (cont.) Time: O(nk) after the preprocessing of L[j]’s and R[j]’s. Let T(n,k) denote the time needed for f(k,1,n). Assume that T(n’,k’) c2n’k’ for all n’ < n, k’< k. According to the algorithm, we have:

Experimental Results Algorithm Time: O(nk) Space: O(n) Experiment Method 24,047 SNPs from 20 individuals (21 chromosome). Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) Our results: 4,588 tagSNPs and 1,707 haplotype blocks. 673 blocks suffice to cover 80% of chromosome region.

Problem Definitions (2) Given a haplotype matrix A and a specific number of tagSNP t, we wish to find a list of feasible blocks with the coverage of common hapltypes in each block more than α% , the total number of tagSNP required for these blocks less than t and the total length is maximized.

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs Dynamic programming algorithm i i-1 1 … B1 Bn t 1 k k-1 … B1 Bn-1 Bn tag(k,i) t - tag(k,i)

Efficient Algorithms for SNP Haplotype Block Selection Problems