EFFICIENT ALGORITHMS FOR MULTICHROMOSOMAL GENOME REARRANGEMENTS

EFFICIENT ALGORITHMS FOR MULTICHROMOSOMAL GENOME REARRANGEMENTS Glenn Tesler Journal of Computer and System Sciences 65 (2002) Presented by Liora LEVY Seminar in BioInformatics Technion – Spring 2005

AGENDA • MOTIVATION • 2. THE AUTHOR • 3. THE PROBLEMATIC • 4. THE ALGORITHM OF G. TESLER • 5. SOFTWARE TOOL : GRIMM

MOTIVATION • Compute the distance between two multichromosomal genomes. We’ll define it later WHY? • Scientists are interested by this distance in order to establish phylogenic trees of species.

Until the XIX th century, people believed that all species and in particular the humans were created as they are today, by God. But, then Darwin in 1859 in “The Origin of Species” , developed the idea that all the species evolved from a common ancestor. Charles DarwinBritish Naturalist 1809 -1882 "I have called this principle, by which, each slight variation, if useful, is preservedby the term Natural Selection. "

Which animal was believed to be the closest (to have a common ancestor) to the human…? Maybe you need help… Well, if you do believe that, you may be surprised…

Guillaume Bourque(Montreal), Pavel Pevzner and Glenn Tesler (San Diego, CA) works allow to recontitute the genetic profile of the ancestor of mammals… It’s is a rodent with fur which lived 90 million years ago. They even established that human and rats share about 90% of their genes. Genome Research, April 2004 So, why, or how are we so differents? Even if most of the genes are the same, their order in the chromosome, and in the genome in general is very important.

THE AUTHOR(S) Glenn Tesler Assistant Professor, Department of Mathematics University of California, San Diego The article is based on a previous article from S. Hannenhalli and P. Pevzner Transforming men into mice (polynomial algorithm for genomic distance problem), 1995 Sridhar Hannenhalli , Genetics Dpt, University of Pennsylvania. Left: Pavel Pevzner CS Dpt, University of California, San Diego Right: Glenn Tesler

THE PROBLEMATIC Distance between two multichromosomal genomes: minimum number of reversals, translocations, fissions and fusions required to transform one genome to another. We’ve already seen algorithms for the unichromosomal problem. So why do we need “multichromosomal” ?? It is very simple, mammalians got multichromosomal genomes and so we need to find a way to translate the unichromosomal solution in order to adapt it to the real biological issues.

Human cariotype: 22 pairs of chromosoms + 2 sexual chromosoms.

OLD ALGORITHM vs. NEW ALGORITHM Hannenhalli and Pevzner already gave a polynomial algorithm “genomic_sort” for computing that distance. Glenn Tesler added some details in order to fix some problems they had with the construction. He also improved the speed of the algorithm by combining it with the algorithm of Bader, Moret and Yang that produces reversal scenarios for permutations in linear time.

IMPORTANT MAIN IDEA OF THE ALGORITHMS The main idea to compute the rearrangement distance between two multichromosomal genomes Π and Γ is to concatenate their chromosomes into two permutations π and γ. The purpose of this concatenated genomes is that every rearrangement in a multichromosomal genome Γ can be mimicked by a reversal in a permutation γ. In an optimal concatenate, sorting γ with respect to π actually corresponds to sorting Γ with respect to Π. Tesler also showed that when such an optimal concatenate does not exist , a near-optimal concatenate exists such that sorting this concatenate mimics sorting the multichromosomal genomes and uses a single extra reversal which corresponds to a reodering of the chromosomes.

THE ALGORITHM OF G. TESLER I –Improvement made to the old algorithm • There is a gap in their reduction of the multichromosomal problem to the unichromosomal problem of "sorting by reversals" (where algorithms for efficient generation of such scenarios are known). It is sometimes necessary to reorder and flip certain chromosomes of both multichromosomal genomes to form the permutations used in the unichromosomal problem, but they do not reorder either one. • We will close the gap and prove the following improvement to their algorithm

Theorem 1.Letd=d(Π,Γ) denote the distance between two multichromosomal genomes, Πand Γ. There is a constructive algorithm to produce two permutationsπ*, γ* whose reversal distance isdrev(π*, γ* )=dord+1, such that optimal reversal scenarios between these permutations directly mimic optimal rearrangement scenarios between genomesΠand Γ. All of this takes polynomial time. Whendrev (π*, γ*)= d+1, one reversal step mimics flipping a block of consecutive whole chromosomes, which does not count as an operation in a multichromosomal rearrangement scenario; there are examples when such a step is required. 2. Although the distance is symmetric (d(Π,Γ)=d(Γ, Π)), when the genomes have different numbers of chromosomes their algorithm requires that it be computed as d(Π,Γ) where Π has fewer chromosomes than Γ.

3. We combined this algorithm with the Bader , Moret, Yan linear-time algorithm for computing reversal distance in unichromosomal genomes. • Thus, we’ve reduced computation times: • Time to compute distance : O(n) • Time to compute a rearrangement scenario: O(n2) • (where n is the total number of "markers" in the reduction: the number of genes plus twice the number of chromosomes in the genome with more chromosomes) . 4. We prove a heuristic for selecting good reversals based on breakpoints. The heuristic is not theoretically optimal for producing pairwise rearrangement scenarios, but is fast in practice, and generalizes to phylogenetic trees involving more than two genomes. It is used by MGR, a program for constructing phylogenetic trees.

II - Some definitions and notations Genes, chromosomes, genomes We represent genes by numbers 1,…,Ng. Orientation (strand) of each gene is indicated by a ± sign. A chromosome is a sequence of signed numbers , and the flip of a chromosome is . In studies of rearrangements on unichromosomal genomes, several types of chromosomes have been considered but only Undirected linear chromosomes type is biologically relevant for multichromosomal genomes: and are regarded as equivalent. Genome is a set Π={π(1),…,π(Nc)} with Nc chromosomes. Chromosom i: π(i)=< π(i)1,…, π(i)ni> Each gene j=1,…,Ng occurs once in the genome (+j / -j)

Caps: Ck= Ng+k for k=1,2,…,Nc. Capping for a chromosom : π(i)=< π(i)0, π(i)1,…, π(i)ni, π(i)ni+1 > lcap rcap Capping for a genome is There are (2Nc)! Possible cappings. A concatenate of is a signed permutation of 1,2,…,n formed by choosing one of the Nc! orderings and one of the 2Nc flippings of the chromosomes, and concatenating them together.

Mimicking multichromosomal rearrangement operations by reversals on a single permutation The reversal ρ(i, j) on a signed permutation π =< π1,…, πk> (where 1≤i ≤ j ≤ k) is < π1,…, πi−1,π− j,…, π− i, πj+1,…, πk>. Another representation π=<A,B,C> <A,-B,C> Translocation: π =<A,B> and σ=<C,D> <A,D> and <C,B> Fusion: π =<A,B> and σ=<C,D> <A,B,C,D> Fission: π =<A,B> and σ=<Ø,Ø> <A, Ø> and <Ø,B>

Number of steps in a scenario: d( П,Γ)+# of blockflips+# of cap-exchanges. Maximum 1 for optimal concatenates Non necessary for optimal cappings

Convention for the signs of lcaps and rcaps: Breakpoint graph

Hurdles and relatives Interleaving graph

The distance can be calculated as : When b =number of black edges c= number of cycles and paths PΓΓ=number of ΓΓ paths (Others parameters are from Bader et al. algorithm)

III –The new algorithm

1. Joigning and closing paths, simplified Several steps of genomic_sort add an edge to the graph to join two paths into a larger path. The result is always a ΓΠ-path with an oriented or interchromosomal edge, and a subsequent iteration of the main loop of their algorithm closes that path We simplify this by adding two edges simultaneously to join these paths into a cycle in a single loop iteration. The first such steps join a ΠΠ-path with a ΓΓ-path. The resulting paths never interact with any other path in the main loop, so we separate this out into its own loop (B5–B7). It is also rephrased to account for the new distinction between p and pΓΓ. The other path joining steps (steps A8 and A13) join two Γ-paths. They proved that at least one of the two possible Γ-edges connecting them is oriented or interchromosomal, and they test the edges to add such an edge first. The other edge is guaranteed to be added in a later iteration. Since the order that they are added does not affect the final output, we remove this test and just add them both at once (steps B10 and B13).

2.Adaptation of BMY algorithm BMY : algorithm to compute the connected components of the interleaving graph. They implemented it in the file invdist.c of GRAPPA. We modified it to account for paths (instead of just cycles), deleted tails, and bare edges. The resulting procedure form_components runs in time Θ(n). It identifies the components and computes and stores certain structural information about them. Θ(n)

3. When Γ has fewer chromosoms than Π The original construction of G(Π,Γ) assumes that Nc(Π)≤ Nc(Γ), the solution is to add null chromosom to Π. However, that construction breaks down without that assumption: if Γ has fewer chromosomes and we pad it with nulls, then when we delete a gray edge corresponding to a null in Γ, the construction leaves unresolved how to classify the vertices of the edge into Π-caps and Γ-tails. We have said both vertices should be classified as Π-caps in this case. Changes were made to make the construction truly symmetric, regardless of which genome has more chromosomes.

4. From optimal cappings to optimal concatenates • The procedure genomic_sort, produced a new capping of Γ to prove the distance formula. However, to compute the distance without building a proof certificate (i.e., capping), it is only necessary to compute rearrangement distance. • It is possible to extend that procedure to algorithmically produce an optimal rearrangement scenario between two genomes, but they do not actually give the connection between the capping and the scenario; our added step B19 does this. • Proper flipping • Proper bonding • Procedure form_optimal_concatenate runs in O(n. Nc)

5. Optimal scenarios • Mimicking a rearrangement scenario by a reversal scneario • Several algorithm for producing optiaml scenarios between a pairof permutations: • Hannenhalli and Pevzner: O(n5) and O(n4) • Berman and Hannenhalli: O(n2α(n)) • Kaplan, Shamir and tarjan: O(n2) • These are easily adapted to produce a multichromosomal rearrangement scenario, but must obey the following restriction: • A reversal starts at an lcap  it ends at an rcap.

6.Breakpoint heuristic for optimal scenarios and trees Although the algorithms just named can quickly select good reversals for pairwise genomic rearrangement scenarios, selection of good reversals is NP-hard for even the simplest phylogenetic trees. We have integrated the algorithms in this paper into Guillaume Bourque's program MGR for constructing phylogenetic trees. Let G={ Π1,…, Πm} be a set of genomes, either multichromosomal, or unichromosomal with circular, directed linear, or undirected linear chromosomes. A phylogenetic treeT on G is a tree whose vertices are genomes on a common set of genes, and whose leaves are the genomes in G. A conserved adjacency (x,y) of G is a pair of genes such that every genome in G contains either (x,y) or (−y,−x) consecutively. Let A(Π1,…, Πm) denote the set of all conserved adjacencies.

A conserved strip (x1,…,xk) is a sequence of genes such that every genome contains either it or (−xk,…,−x1) consecutively. It is comprised of k−1 conserved adjacencies. • Theorem 8 • Between any two genomes ( ,Γ), there is an optimal reversal or rearrangement scenario in which the pairs inA( ,Γ) are adjacent at every step. • (b) For a set of genomesG={ 1,…, m}, there is an optimal phylogenetic tree in which the pairs inA( 1,…, m) are adjacencies in every node, and an optimal rearrangement scenario of form (a) exists on each edge.

SOFTWARE TOOL GRIMM Genome Rearrangements in Man and Mouse This is a web server combining rearrangement algorithms for unichromosomal and multichromosomal genomes, with either signed or unsigned gene data. In each case, it computes the minimum possible number of rearrangement steps, and determines a possible scenario taking this number of steps. This is integrated into a related project MGR for constructing optimal phylogenic trees with multiple genomes.

Input: Two genomes

Result : A scenario

Input: Three genomes

Output: 1. A distance matrix

2. A common ancestor

3. A phylogenic tree

New issues for the future: Recent works by Robert Pruitt and Susan Lolle, Purdue University, Indiana, USA on a plant : Arabidopsis thaliana (in particular the mutant HotHead) showed that genetic material may also be transmitted by RNA and not only by DNA. This is in opposition to Mendel theory(1865), and insists on the fact that children can have genes that their parents don’t have (but their grand parents do)… Genetic studies may take a new depart after this discovery. I’ve read it from Sciences et Vies, May 2005 Original parution: Nature, April 2005. Subjects for Doctorate???

Challenge: Genome rearrangements and cancer We insisted on the fact that genome rearrangements were used to study the evolution of a group of organisms. Now, because of a rapid increase of chromosomal mutations frequently observed in cancer cells, it’s possible to study the cancer genome very much like if it was a new organism that had recently diverged from the normal human genomes. The interest is that although cancer progression is frequently associated with genome rearrangements the mechanisms behind these rearrangements are still poorly understood. Source: Guillaume Bourque, Genome Institute of Singapore.

THANK YOU . . .

EFFICIENT ALGORITHMS FOR MULTICHROMOSOMAL GENOME REARRANGEMENTS

EFFICIENT ALGORITHMS FOR MULTICHROMOSOMAL GENOME REARRANGEMENTS

Presentation Transcript

Greedy Algorithms And Genome Rearrangements

Genome Rearrangements

Genome Rearrangements …and YOU!!

Greedy Algorithms and Genome Rearrangements

Genome Rearrangements

Efficient Algorithms for Matching

Genome Rearrangements: from Biological Problem to Combinatorial Algorithms (and back)

Genome Rearrangements

DSL for Pedigree Rearrangements

Genome Rearrangements

Greedy Algorithms And Genome Rearrangements

Energy-Efficient Algorithms

Genome Rearrangements in Evolution and Cancer

BFAM Project BF-S15T07 “ Efficient clustering algorithms for genome-wide expression analysis “

Genome Rearrangements

Algorithms for Efficient Collaborative Filtering

Genome Rearrangements

Efficient Algorithms for Motif Search

EFFICIENT ALGORITHMS FOR MULTICHROMOSOMAL GENOME REARRANGEMENTS

Genome Rearrangements João Meidanis São Paulo, Brazil December, 2004