Genome Halving – work in progress

Genome Halving – work in progress Fulton Wang 8-17-2011 ACGT Group Meeting

Motivation: chromosomes have undergone whole duplication in the past, going from “pre duplicated genome” to “perfectly duplicated genome” • “pre duplicated genome” A: single circular chr with exactly 1 copy of each gene • “perfectly duplicated chr” A+A: genome that has 2 copies of every adjacency in A • A+A could be either 1 or 2 chrs, ie A-A or A A depending on how you label the adjacencies in A+A • Eg: 1234 -> 12341234 or 1234 1234

After the duplication event, rearrangements occur that alter the adjacencies in the perfectly duplicated chr to give us present day genome G • Eg: 12341234 -> 12314234 • We will assume that G is a single circular chr • Given G, want to infer A

Natural solution: want to find preduplicatedchr A such that dist(A+A, G) is minimized • Simple distance measure: breakpoint distance: • N - # shared adjacencies between 2 chrs • Eg: 1234 and 1243 has breakpoint distance of 1 • But A+A and G each have 2 copies of every gene

What is breakpoint distance between 1212 and 1122? Not well defined. • Need to match up the genes in the 2 chr to each other. Need to label every gene copy • 11211222 and 11122122 gives 0 shared adjacencies • 11211222 and 11122221 gives 1 shared adjacency

If G1, G2 are duplicated genomes(exactly 2 copies of every gene), then dist(G1,G2) is the minimum of bp(G1,G2) over all possible labellings of G1, G2 • Halving problem: given a genome G with exactly 2 copies of each gene, we want to find the minimum over all pre duplicated genomes A of dist(A+A, G). Assume A and G are both single circular chr

Restriction is that the pre duplicated chr A be a single circular chr • If you didn’t care about # pieces in A, then there is simple algorithm to solve halving problem – involves just maximum matching where the vertices are the gene ends, edges between each pair of gene ends weighed by how many times that adjacency appears in G

Restrictions on A make the problem harder • Right now, I am considering the simpler case when I restrict all the genes to be positively oriented, ie no backwards facing genes

Graph representation of problem • A chr can be represented as a directed graph, where each vertex is a specific copy of a gene • 11211222 becomes 11->21->12->22 • Genome depicted is 1121311222423241 • Grouped members of same gene family together. View each family as 1 “big” vertex, so that graph has 4 vertices

Picking A corresponds to picking a tour of the “big” vertices. • Suppose A = 1234. Before labeling the adjacencies in A, you don’t know if A+A is A-A or A A • Can you label the adjacencies in A+A such that the number of shared adjacencies between A+A and G is the number of shared adjacencies between A+A and G, if the genes are unlabeled?

Yes. Note that this is only true because we don’t care if A+A is 1 or 2 chr • This means that if you fix A, then bp(A+A) is equal to the number of “small” edges parallel to A • This means we can work with alternate graph, with n vertices, each representing a gene family, where the weight of an edge is the number of adjacencies between the 2 gene families in G

Now, genome halving is reduced to finding max weight directed ham cycle in the complete graph, with edge weights of 0,1,2 • dist(A+A,G) = min_Amin_labellings (A+A, G) • Can simplify problem – if A-A/A A is the minimum possible distance from G, and some adjacency A->B appears twice in G, then A must contain that adjacency, and can merge the 2 vertices

So then, halving is reduced to finding max weight Ham cycle in complete directed graph, where all the edges out of a vertex have weight 0 except 2 of them, which have weight 1 • Alternatively, want to find minimum path cover in a directed 2-regular graph. • If you have a minimum path cover, you can link the paths with edges of weight 0 to get a max weight ham cycle.

Guided halving • Same situation as before, but now we know an ancestor of the current organism, call it B. Then, we wish to minimize bp(A+A,G) + d(A,B) • Note that B only contains 1 copy of each gene, same with A

Then, have similar problem to before – given complete graph where edges are of weight 0,1,2, find maximum weight ham cycle • This is the exact same problem that is needed to solve in the breakpoint median problem. • Likely means guided halving is NP-complete.

Genome Halving – work in progress