1 / 23

Non-breaking Similarity of Genomes with Gene Repetitions

Non-breaking Similarity of Genomes with Gene Repetitions. Binhai Zhu Computer Science Department, Montana State University Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao. Background.

Télécharger la présentation

Non-breaking Similarity of Genomes with Gene Repetitions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-breaking Similarity of Genomes with Gene Repetitions Binhai Zhu Computer Science Department, Montana State University Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao

  2. Background • Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936. • A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.

  3. Background (cond.) • On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem. • This is a typical optimization problem, it makes sense to study the approximability of the problem.

  4. Definitions • Given n gene families (alphabets) F, a genome G’ is a sequence of elements of F such that each element has a (+/-) sign. Example. F={a,b,c,d}, G’=-bd-cab-d-c • We will focus on unsigned sequences in this work. • A genome G is said to be exemplar if every gene appears exactly once in G.

  5. Definitions (cond.) • Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G. Example, G=abcdefg H=efgdcab there are 3 breakpoints in G (and symmetrically in H). • The number of breakpoints between G and H is called the breakpoint distance between G and H.

  6. Exemplar Breakpoint Distance Problem • Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized. • We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).

  7. Approximation Algorithms • Given a minimization (maximization) problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/). • Usually we say that A is a factor- approximation for Л.

  8. Prior Results (1) • We showed that the exemplar breakpoint distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006]. • This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0. • Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].

  9. Prior Results (2) • On the other hand, for the exemplar breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results. • As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.

  10. Background for this work • We try to look at the complement of the breakpoint distance under the gene duplication model. • As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.

  11. Definitions • Given exemplar genomes G and H drawn from the same alphabet, ab is a non-breaking point, if ab appears in both G and H. Example. G = abcdefg H = fegcdab We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H). Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1. • Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).

  12. Example G’ = abcadcefg H’ = cfegcdabf We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg. We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf. enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.

  13. Inapproximability Result Theorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP. Proof Idea: A linear reduction from Independent Set (IS).

  14. e2 v2 v1 N=5 vertices, M=5 edges N+M is even e4 e3 e1 e5 v4 v3 v5 G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5 H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 = x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2 Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4} Input graph has an IS of size K iff enbs(G,H’)=K.

  15. e2 v2 v1 N=5 vertices, M=5 edges N+M is even e4 e3 e1 e5 v4 v3 v5 G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5 H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 = x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2 Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4} Input graph has an IS of size K iff enbs(G,H’)=K.

  16. Positive Results Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g., …xyx…aba…

  17. Positive Results Definition: occ(g,G’) is the number of occurrence of g in G’. span(g,G’) is the maximum distance between two copies of g in G’. totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’)

  18. Positive Results Definition: occ(g,G’) is the number of occurrence of g in G’. span(g,G’) is the maximum distance between two copies of g in G’. totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’) Example. G’=abcdaebd span(a,G’)=4, span(b,G’)=5, span(d,G’)=4, totalocc(4,G’)=6

  19. Positive Results Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

  20. Positive Results Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time. Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time. Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse. T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)

  21. Positive Results Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time. Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.

  22. Positive Results Theorem 3. Let G’ and H’ be two genomes with a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time. Example. G’=abcadef H’=bcedefad shift(a,G’,H’) = 6

  23. Conclusion • We introduce non-breaking similarity, which is the complement of the famous breakpoint distance, for genome comparison. • The general exemplar non-breaking similarity problem is hard to approximate. 3. For some special cases, we can obtain polynomial solutions.

More Related