Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks Jaroslaw Byrka1,2, Steven Kelk2, Katharina T. Hüber3 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam (3) University of East Anglia (UEA), England Email: S.M.Kelk@cwi.nl Web: http://homepages.cwi.nl/~kelk

Phylogenetic tree reconstruction Phylogenetic tree reconstruction is essentially the science of efficiently inferring and constructingplausible evolutionary trees when we only have limited input data about the ‘species’ concerned… At the intersection of biology, bioinformatics, computer science and mathematics. Orangutan Gorilla Chimpanzee Human (This tree borrowed from a presentation by Tandy Warnow)

Dominant methods in phylogenetic reconstruction • Character-based methods • Maximum Parsimony (= Minimum Steiner Tree) • Maximum Likelihood • Bayesian methods (Markov Chain Monte Carlo - MCMC) • Distance-based methods • Neighbour Joining • UPGMA • Triplet-based methods

Triplet-based methods (1) • Triplet-based methods are used for constructing rooted evolutionary trees: there is a root (a hypothetical most-distant ancestor) and edges are directed, explicitly denoting the direction of evolution. • The central idea: build a single, ‘big’ evolutionary tree for a set S of species by combining smaller evolutionary trees on subsets of S such that the big tree respects the structure of the smaller trees. • In triplet-based methods, the small input trees are always defined on size-3 subsets of the species set S (and are called rooted triplets.)

Triplet-based methods (2) w y x z w y x z w y x z • For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}. • I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) solution algorithm w z x y

Triplet-based methods (2) w y x y x z w y z • For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}. • I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) solution z w x algorithm w z x y

Triplet-based methods (2) w y x z w y • For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}. • I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) solution z w x algorithm x y z w z x y

When trees fail • The algorithm of Aho et al. (1981) can be used to construct a tree that is consistent with all the input rooted triplets, if one exists… • But…what if the algorithm fails? Why might the algorithm fail? • Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors. • Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like. • Responses: • try constructing a phylogenetic tree that maximises the number of input triplets it is consistent with, and/or • try and construct not phylogenetic trees, but phylogenetic networks

Networks instead of trees x y z x z y • For example, suppose the input is {xy|z, xz|y}. z y x

Networks instead of trees x z y • For example, suppose the input is {xy|z, xz|y}. x y z z y x

Networks instead of trees x y z • For example, suppose the input is {xy|z, xz|y}. z y x z y x

Level-k phylogenetic networks root (only one!) A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”. split-vertex z y x leaf-vertex (labelled with species) recombination-vertex

The complexity of “LEVEL-k” LEVEL-k Input: Set of rooted triplets T Output: A level-k network N consistent with all the triplets in T, or state that no such network exists. Complexity

What about maximization? • Gasieniec et al. (1999) showed how to find in polynomial time a tree that is consistent with at least 1/3 of the input triplets T. • Is it possible to always find a tree that is consistent with > 1/3 of the input triplets? • No. Let T1(n) be the full triplet set on n species. Contains triplets. • For example, T1(4) = {ab|c, ac|b, cb|a, ab|d, ad|b, bd|a, ac|d, dc|a, ad|c, bc|d, bd|c, dc|b}. • For a given three species, a tree is consistent with at most one triplet on those three species. So at most 1/3 of the triplets in T1(n) can be consistent with a tree. • So for trees, and comparing with the upper bound |T|, 1/3 is worst case optimal.

Formalising the question Assuming that we restrict the set of phylogenetic networks to some subclass, what is the maximum value of 0 ≤ p ≤ 1 such that for every input set T of rooted triplets, there exists some network N(T) from the subclass such that at least p|T| of the triplets are consistent with N(T)? • So for level-0 networks (trees), p=1/3. • This can be trivially converted to a 3-approximation algorithm for the problem MAX-LEVEL-0, where MAX-LEVEL-k is defined as “Given a set of triplets T, what is the maximum number of triplets from T that some level-k network can be consistent with?” • In general, having an algorithm that gets a fraction q of the input triplets, becomes a (1/q)-approximation for the MAX variant. (Better approximation factors for the MAX variant are probably possible, but none yet known!)

Determining the p-fraction for level-1 and higher • For level-1, Jansson, Nguyen and Sung (2005) showed how to find in polynomial time a level-1 network consistent with at least 5/12 ≈ 0.416… of the input triplets. So for level-1, p ≥ 5/12 ≈ 0.416… • They also showed, given the full triplet set T1(n) on n leaves, how to build an optimal level-1 network for those triplets i.e. no other level-1 network can be consistent with a higher fraction of T1(n). • By counting they show that such optimal level-1 networks are consistent with a fraction approaching (from above) ≈ 0.488… of the input triplets, showing that, for level-1, p ≤ 0.488… • Obvious questions: what is the true value of p for level-1? What about higher level networks? Are networks achieving the p-fraction always polynomial-time constructable? What is the role of the full triplet set in determining p? How about p as a function of n = the number of species?

The

Our result: p is defined by the full triplet set! Let N be a network that is consistent with a fraction p’ of the full triplet set T1(n). Then, for any arbitrary triplet input set T on n species, we can convert N in polynomial time into an isomorphic network N’(T) that is consistent with a fraction ≥ p’ of T.(The result also holds for weighted triplet sets.) • All tree shapes (not just caterpillars) can be consistent with 1/3 of input triplets, because every tree is consistent with 1/3 of T1(n). • We get a polynomial-time worst-case optimal algorithm for level-1 networks (for the |T| upper bound.) This means that we can always get at least 0.48… of the input triplets. With a customized derandomization we can do this in time O(|T|n2). • For level-2, we can in polynomial time always get at least 0.61 of the input. • Is this bad news for the biological relevance of triplet methods and/or the level-k hierarchy?

The

Method: labelling an unlabelled network • Suppose we know a network N that is consistent with a fraction p’ of the full triplet set T1(n). Let T be the input set of triplets, on n species. • Note that if the species on the leaves of N are arbitrarily permuted, the resulting network is still consistent with a fraction p’ of T1 – because all species in T1 are indistinguishable. • Hence, we can view N as an unlabelled network i.e. a network without species on the leaves. Only the shape of N is important. • We argue that we can label the leaves of N with species in such a way that the resulting network N’, which will be isomorphic to N, is consistent with a fraction ≥ p’ of T. • We use a probabilistic argument to argue the existence of such a labelling. • We then use the method of conditional expectation to derandomize this i.e. so that the labelling can be found in polynomial time.

Choosing the labelling u.a.r. is good enough • Suppose we know a network N that is consistent with a fraction p’ of the full triplet set T1(n). Let T be the input set of triplets, on n species. • If we choose a random labelling of the leaves of N (i.e. randomly assign the n species from T to the n leaves of N) to get a network N’, the expected fraction of T that N’ is consistent with, is p’.

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? • It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N.

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? • It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N. z x y

Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ? • It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N. z y x

For each “leaf triplet” t1t2|t3 in N, there are 2(n-3)! labellings that map xy|z to that leaf triplet. • A labelling that maps xy|z to a leaf triplet, cannot map xy|z to another leaf triplet. • So the probability that the labelled network N’ is consistent with xy|z, is the probability that xy|z gets mapped to one of the leaf triplets in N. Hand-waving, the probability is thus:

So, for any triplet t in T, we expect a fraction p’ of triplet t to be in the labelled network N’ when N’ is made by randomly labelling N. • Summing over all triplets, we get that the expected fraction of T consistent with N’, is also p’. • We conclude that there exists some labelling of N that achieves a fraction ≥ p’. • This proves that, for a subclass of networks, the p-fraction is indeed defined by the full triplet set, and that any network obtaining the p-fraction for the full triplet set, can be relabelled to obtain the p-fraction for an arbitrary input set T. • But how to find in polynomial time the correct labelling for a given input set T? • Derandomization by the method of conditional expectation.

Derandomizing: a sketch • An appropriate labelling can be found in time O(m4n3) time, where m is the number of vertices in the unlabelled network N. • We do this by labelling the leaves of N, one at a time. • General idea: At a given iteration of the algorithm, let F be the set of leaves of N which have already been labelled with species. • We then arbitrarily pick an unlabelled leaf t and add it to F, by labelling it. But how do we choose the species that labels it? • We choose the species that maximises the expected fraction of T that the finished labelled network N’ will be consistent with, assuming the labelling of the leaves in F U {t} is fixed, and that the remaining leaves are labelled uniformly at random. • The main point to observe is how the probabilities can be computed in polynomial time.

We compute the probability for each triplet independently. • E.g. consider a triplet xy|z. Suppose x and y have already been assigned to leaves. • What is the probability that xy|z will be in, given that the remaining leaves are labelled u.a.r.? • Simply try all possible ways of mapping z into the remaining leaves, and count the successful mappings. x y

= good leaves for z = bad leaves for z • We compute the probability for each triplet independently. • E.g. consider a triplet xy|z. Suppose x and y have already been assigned to leaves. • What is the probability that xy|z will be in, given that the remaining leaves are labelled u.a.r.? • Simply try all possible ways of mapping z into the remaining leaves, and count the successful mappings. x y

Worst-case optimal algorithm for level-1 • Jansson, Nguyen & Sung (2005) showed how to construct the galled caterpillar on n leaves, denoted C(n). • This level-1 network C(n) has the property that no other network is consistent with a higher fraction of the full triplet set T1(n); it is thus in some sense optimal. • It is easy to construct C(n) in time polynomial in n. Combining this with our generic derandomized labelling algorithm, we obtain a polynomial-time worst-case optimal algorithm for level-1. • For level-1 networks, let us parameterize the p-fraction as a function of n, the number of species. Combining our result with that of J&N&S, we get:

The value p(n) seems to smoothly approach a horizontal asymptote of ≈0.4880… from above. With help from Mathematica and some insights into ‘good’ values of a we have bound p(n) below by 0.48 for all n.

The

The galled caterpillar C(17) • Galled caterpillars have a very regular structure, and this allows us to do a faster, customized derandomization, in time O( |T|n2 )

Level-2 • Using a combination of our relabelling technique, Java programming, and Mathematica, we were easily (in one afternoon) able to prove a lower bound on p of 0.61 for level-2 networks. • The real value of p for level-2 is probably somewhere around 2/3. But to prove that conclusively we need to know what optimal level-2 networks look like for the full triplet set! A nice challenge for someone...

Conclusions and open problems • We have shown that all tree shapes are worst-case optimal; we have identified p(n) for level-1 networks, and given a lower bound on p for level-2. • More generally: we show how, for any given subclass of networks, the p-fraction can be obtained by studying only the full triplet set and that (generic or customised) polynomial-time algorithms can be constructed around this. • Obtaining (bounds on) p can also be a first step on the road to good approximation algorithms for the MAX variants; it gives a (1/p) approximation for the MAX variant. • Significance for biology, for the triplet method, for the level-k hierarchy? Our result is probably bad news for the field (not much discriminatory power) • What is the real value of p for level-2, and for higher level networks, and for other subclasses of networks? • Confirming whether or not there are (in polynomial time) better approximation factors possible for the MAX variants than (1/p).

Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3