Incomplete Lineage Sorting: Consistent Phylogeny Estimation

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci& a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07

Lecture Plan • A simple observation about gene trees and population trees. • A comment: on “optimal” and “absolute converging” tree reconstruction • A comment on: “Generic models”. • A comment on: “Network Reconstruction”. • Disclaimer: Last talk – a bit philosophical (but would be happy to provide hard technical proofs )

Gene Trees and Population Trees • Main goal in phylogenetics: • Recovering species/population histories. • Data: Current Genes. • Issue: In recent populations: gene trees may differ from population trees. • Model for evolution of trees in populations: • Coalescence: • Fixed size population N • Each individual chooses a random parent in previous generation. • # generations = N£ branch-length • Main Question: How to reconstruct population trees from gene trees?

Gene Trees: The Engineering Approach • Two common “engineering” approaches: • Approach 1: • Assume all genes come from a single tree. • Kubato-Degnan: Inconsistent. • Approach 2: • Build tree for each tree on its own. • Take majority tree. • Degnan-Rosenberg: Inconsistent. • Q: What should be done instead?

Gene Trees: A Rigorous Approach • M-Roch: A consistent estimator of the molecular distance between two populations d(P1,P2) is: • D(P1,P2) = min {dg(P1,P2) : g 2 Genes} • )distances between populations are identifiable. • )tree is identifiable • Under standard coalescence assumptions, get good rate: • P(topology error) · (# pops) £ exp(-c # genes) • c = shortest branch length. • Estimator can be “plugged in” into any distance based method for reconstructing trees. • In M-Roch, use NJ, but similarly work for: • Short-quartets (ESSW) • Distorted metrics and forests (M) • etc.

Comments on Absolute Convergence • Algorithmic paradigm: Want to reconstruct tree on • n species using • sequence length L and • running time T. • “Absolute Convergence”: L = poly(n); T = poly(n). • Q: Is this the best we can do?

resolution of Steel’s conjecture ancestral reconstruction phylogenetic reconstruction ? Short branches := all branches < lc Long branches := all branches > lc lc depends on mutation model but not on tree, tree size etc. [Daskalakis- M-Roch’06] seq. length L = c log n short branches seq. length L = nC [M’04] long branches n = # species

The algorithmic challenge Conj:For short branches, if data is generated from the model: ML identifies the correct using L = O(log n) samples (best bound known is L = exp(O(N)). Conclusion: In order to “beat” ML, need algorithms with L = O(log n) Challenge: The constant in O is important! Challenge: Deal with short/long branches (contract edges; output forest) Challenge: General mutation models (not just CFN, JC). Comment: Rigorous methods have running time gaurentee. Comment: For L=poly(n), know how to deal with all challenges: ESSW M’07 (forests – long edges). Gornieu et. al (short edges).

On generic parameters From Rhodes talk: “Generic models are easier to identify”. Typically – genetic parameters. How about generic trees?

Mixtures and Phenomena in High Dims • The Geometry of High Dimensions: “Almost every collection of k vectors are almost orthogonal in high enough dimension n”. • M-Roch (in preparation): For every k, as n -> 1the probability that a mixture of k trees on n leaves is identifiable goes to 1. • Holds for most reasonable measures on the space of trees and most mutation models. • Basic idea: In generic situations can (almost) cluster samples according to trees. • Gives an efficient algorithm. • Similar results hold for rates across sites .

A Comment on Dynamic Programming L • Q (Zhang): • Given a tree is it possible to find the • most informativek species? • In terms of Pasrsimony? • In terms of ML? • Note: If we know Parsimony/ML score for left/right sub-tree, we know it for the root. • Q: Can use dynamic programming? • A: Yes – but with the right “data structure” • Information per node: • Discrete version of • the set • of achievable distributions. • Called “Density Evolution” in coding theory / spin-glass theory. • Additive error = 1/poly(n). L2 L1 L L2 L1

Hardness of Distinguishing Network Models with Hidden Nodes G1 • Basic question: Is it possible to recover a network G from observation at a subset of the nodes? • Easier question: Suppose we observe X1,…,Xr. Is it possible to determine if they come from nodes S in G1or nodes T in G2? • Problem: It may be that the two distributions are the same. • Assume: The two distributions are different (large total variation distance) • Q: Assuming the two distributions are different how hard is it to tell if it’s coming from G1or G2? • Related question: What is a computational model of a biologist? G2

The distinguishing problem for Trees T1 • Q: Assuming the two distributions are different how hard is it to tell if it’s coming from T1or T2? • Note: For trees the problem is easy: • Perform likelihood test. • Easy to do efficiently (peeling, pruning, dynamics programming). • # samples needed poly(n). T2

Two Models of a Biologist • The Computationally Limited Biologist: Cannot solve hard computational problems, in particular cannot sample from a general G-distributions. • The Computationally Unlimited Biologist: Can sample from any distribution. • Related to the following problem: Can nature solve computationally hard problems? From Shapiro at Weizmann

Hardness Results G1 • The Computational Limited Biologist (Bogdanov-M): Distinguishing problem can be solved efficiently iff NP=RP. • Computational Unlimited Biologist (Bogdanov-M): The problem is at least zero-knowledge hard. • Zero-Knowledge Problem: Can we decide if samples from a computationally efficient distribution is coming from the uniform distributions? • Related to cryptography. G2

Reconstructing Networks • Motivation: abundance of stochastic networks in biology, social networks, neuro-science etc. etc. • Network defines a distribution as follows: • G=(V,E) = Graph on [n] = {1,2,…,n} • Distribution defined on AV, where A is some finite set. • Too each clique C in G, associate a function C : AC -> R+ and: P[] = CC(C) • Called Markov Random Field, Factorized Distribution etc. • Directed models also common. • Markov Property: If S separates A from B then A and B are conditionally independent given S

Reconstructing Networks . • Task 1: Given samples of , find G. • Task 2: Given samples of  restricted to a set S find G. • Will consider the problem when n large and maximum degree d is small. • (Note that specification of the model is of size max(n,,exp(max |C|)) )

Reconstructing Networks – A Trivial Algorithm • Lower bound (Bresler-M-Sly): • In order to recover G of max-deg d need at least c d log n samples. • Pf follows by “counting # of networks”. • Upper bound (Bresler-M-Sly): • If distribution is “non-degenerate” c d log n samples suffice. • Trivial Algorithm: • For each v 2 V: • Enumerate on N(v) • For each w 2 V check if v ind. of w given N(v). • Non-Degeneracy: • For every v and every w 2 N(v) there exists two assignments to N(v)1 and 2 that differ at w and: dTV(P(v | 1), P(v | 2)) ¸ • For soft-core model suffices to have for all = u,v • maxa,b,c,d |(c,a)-(d,a)+(c,b)-(d,b)| >  • Running time = O(nd+1 log n)

A Trivial Algorithm – Related Result • Trivial Algorithm: • For each v 2 V: • Enumerate on N(v) • For each w 2 V check v ind. of w given N(v). • Related work • Algorithm was suggested before. • Abbeel, D. Koller, A. Ng: without restrictions learn a model whose KL distance from generating model is small (no guarantee of obtaining the true model; in order to get O(1) KL distance need poly samples). • M. J. Wainwright, P. Ravikumar, J. D: Use L1regularization to get true model for Ising models, sampling complexity O(d5 log n) – no running time bounds. • Other related work: assuming special form of potentials 

possible w’s Variants of the Trivial Algorithm • If graph has exponential decay of correlations • Corr(u,v) · exp(-c d(u,v)) • Suffices to enumerate over N(v) • among w correlated with v. • Running time: O(n2 log n + n f(d)). • Missing nodes: Suppose G is triangle free, • then a variant of the algorithm can find one hidden node. • Idea (with M. Biskup’s help): Run the algorithm as if the node is not hidden • Noise: The algorithm tolerates small amounts of noise (statistical robustness). • Q: What about higher amounts of noise? • (From Bresler-M-Sly)

Higher Noise & Non Identifiable Example • Bresler-M-Sly: Example of non-identifiably • Consider • G1= path of length 2, • G2 = triangle + Noise. • Assume Ising model with random interactions and random noise. • Then with constant probability, cannot distinguish between the models. • Ising: P[] = u,v 2 E exp((u) (v)) • Intuitive reason: dimension of distribution is 3 in both cases. = hidden nodes = observed nodes

Thanks !!

Thanks !! • Sebastien Roch • Costis Daskalakis • Andrej Bogdanov

Thanks !! Fascinating workshop: Principal Organiser: Professor Mike Steel (University of Canterbury, NZ) Organisers: Professor Vincent Moulton (University of East Anglia) and Dr Katharina Huber (University of East Anglia) Sponsored by: Allan Wilson Centre for Molecular Ecology and Evolution As part of a great program: Organisers: Professor V Moulton (East Anglia), Professor M Steel (Canterbury) and Professor D Huson (Tubingen)

Incomplete Lineage Sorting: Consistent Phylogeny Estimation

Incomplete Lineage Sorting: Consistent Phylogeny Estimation

Presentation Transcript

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

Bandwidth Estimation in Broadband Access Networks

Joint Inference for Knowledge Extraction from Biomedical Literature

Victor Bahl Joint work with Amer Hassan and Pierre de Vries Microsoft Corporation

Optimizing Cost and Performance for Multihoming

On the Hardness of Being Truthful

Disorderly programming for a distributed world

The Quantum many-body problem:

Bill Thies Microsoft Research India

DryadInc : Reusing work in large-scale computations

Coin flipping from a cosmic source OR Error correction of truly random bits

Lectures prepared by: Elchanan Mossel Yelena Shvets

Lectures prepared by: Elchanan Mossel Yelena Shvets

ZING Systematic State Space Exploration of Concurrent Software

Background on Berkeley Lab DOE Research Facility Managed by UC System 200 acre site ~3,800 Workers

Non Linear Invariance Principles with Applications

Lectures prepared by: Elchanan Mossel elena Shvets

Lectures prepared by: Elchanan Mossel Yelena Shvets

Lectures prepared by: Elchanan Mossel Yelena Shvets