1 / 15

Genome Halving – work in progress

Genome Halving – work in progress. Fulton Wang 8-17-2011 ACGT Group Meeting. Motivation: chromosomes have undergone whole duplication in the past, going from “pre duplicated genome” to “perfectly duplicated genome”

bliss
Télécharger la présentation

Genome Halving – work in progress

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Halving – work in progress Fulton Wang 8-17-2011 ACGT Group Meeting

  2. Motivation: chromosomes have undergone whole duplication in the past, going from “pre duplicated genome” to “perfectly duplicated genome” • “pre duplicated genome” A: single circular chr with exactly 1 copy of each gene • “perfectly duplicated chr” A+A: genome that has 2 copies of every adjacency in A • A+A could be either 1 or 2 chrs, ie A-A or A A depending on how you label the adjacencies in A+A • Eg: 1234 -> 12341234 or 1234 1234

  3. After the duplication event, rearrangements occur that alter the adjacencies in the perfectly duplicated chr to give us present day genome G • Eg: 12341234 -> 12314234 • We will assume that G is a single circular chr • Given G, want to infer A

  4. Natural solution: want to find preduplicatedchr A such that dist(A+A, G) is minimized • Simple distance measure: breakpoint distance: • N - # shared adjacencies between 2 chrs • Eg: 1234 and 1243 has breakpoint distance of 1 • But A+A and G each have 2 copies of every gene

  5. What is breakpoint distance between 1212 and 1122? Not well defined. • Need to match up the genes in the 2 chr to each other. Need to label every gene copy • 11211222 and 11122122 gives 0 shared adjacencies • 11211222 and 11122221 gives 1 shared adjacency

  6. If G1, G2 are duplicated genomes(exactly 2 copies of every gene), then dist(G1,G2) is the minimum of bp(G1,G2) over all possible labellings of G1, G2 • Halving problem: given a genome G with exactly 2 copies of each gene, we want to find the minimum over all pre duplicated genomes A of dist(A+A, G). Assume A and G are both single circular chr

  7. Restriction is that the pre duplicated chr A be a single circular chr • If you didn’t care about # pieces in A, then there is simple algorithm to solve halving problem – involves just maximum matching where the vertices are the gene ends, edges between each pair of gene ends weighed by how many times that adjacency appears in G

  8. Restrictions on A make the problem harder • Right now, I am considering the simpler case when I restrict all the genes to be positively oriented, ie no backwards facing genes

  9. Graph representation of problem • A chr can be represented as a directed graph, where each vertex is a specific copy of a gene • 11211222 becomes 11->21->12->22 • Genome depicted is 1121311222423241 • Grouped members of same gene family together. View each family as 1 “big” vertex, so that graph has 4 vertices

  10. Picking A corresponds to picking a tour of the “big” vertices. • Suppose A = 1234. Before labeling the adjacencies in A, you don’t know if A+A is A-A or A A • Can you label the adjacencies in A+A such that the number of shared adjacencies between A+A and G is the number of shared adjacencies between A+A and G, if the genes are unlabeled?

  11. Yes. Note that this is only true because we don’t care if A+A is 1 or 2 chr • This means that if you fix A, then bp(A+A) is equal to the number of “small” edges parallel to A • This means we can work with alternate graph, with n vertices, each representing a gene family, where the weight of an edge is the number of adjacencies between the 2 gene families in G

  12. Now, genome halving is reduced to finding max weight directed ham cycle in the complete graph, with edge weights of 0,1,2 • dist(A+A,G) = min_Amin_labellings (A+A, G) • Can simplify problem – if A-A/A A is the minimum possible distance from G, and some adjacency A->B appears twice in G, then A must contain that adjacency, and can merge the 2 vertices

  13. So then, halving is reduced to finding max weight Ham cycle in complete directed graph, where all the edges out of a vertex have weight 0 except 2 of them, which have weight 1 • Alternatively, want to find minimum path cover in a directed 2-regular graph. • If you have a minimum path cover, you can link the paths with edges of weight 0 to get a max weight ham cycle.

  14. Guided halving • Same situation as before, but now we know an ancestor of the current organism, call it B. Then, we wish to minimize bp(A+A,G) + d(A,B) • Note that B only contains 1 copy of each gene, same with A

  15. Then, have similar problem to before – given complete graph where edges are of weight 0,1,2, find maximum weight ham cycle • This is the exact same problem that is needed to solve in the breakpoint median problem. • Likely means guided halving is NP-complete.

More Related