180 likes | 285 Vues
This work presents a method for fitting data to tree structures in hierarchical clustering and phylogenetics, preserving dissimilarity information. We utilize a matrix representation of dissimilarity to develop an objective function aiming to minimize the cost between observed and modeled distance metrics. Applications include evolutionary biology, molecular phylogeny, and historical linguistics, with special focus on ultrametrics. The paper discusses approximation algorithms for fitting trees, offering insights into efficient solutions and future work in this domain.
E N D
Fitting Tree Metrics:Hierarchical Clustering and Phylogeny Nir Ailon Moses Charikar Princeton University
Data with dissimilarity information u • Represented by matrix D • Complete information 10 D(u,v)=1 y 7 v 6 5 3 2 13 8 5 x w (big number = high dissimilarity)
Goal: Fit data to tree structure • Preserve dissimilarity info T • Tree metric dT close to D v dT(u,v) w y x u
Objective function Minimize: cost(T) = || D – dT||p n ( )-dimensional real vectors 2
Applications • Evolutionary biology • Molecular phylogeny:Dissimilarity information from DNA • Gene expression analysis • Historical linguistics • ...
Special case: Ultrametrics (Hierarchical clustering) T , ` y u v M=3 x w y u v x w dT(v,x)=1 dT(u,w)=3 Equivalently: Two largest distances in every equal
Previous results • Fitting ultrametrics under ||.|| in P[FKW95] • Fitting trees under ||.|| APX-Hard[ABFPT99] • Fitting ultrametrics under ||.||1 APX-Hard[W93] under ||.||2 NP-Hard • f(n)-approximation algorithm for ultrametrics(3f(n))-approximation algorithm for trees(under any ||.||p) [ABFPT99]
Previous results • O(min{n1/p, (k logn)1/p})-approx for trees under ||.||p[HKM05] • Fitting ultrametrics for M=2under||.||1 : Correlation Clustering[BBC02, CGW03, ACN05..] • . . .
Our results • (M+1)– approx for fitting level M ultrametrics under ||.||1 • O)(log n loglog n)1/p)- approx for general weighted trees under||.||p
Reconstructing T from ultrametric D • Given ultrametricD {1..M}n x n • Pick pivot vertex u • Recursively solve for neighbor-classes M=3 M=2 2 1 u 3
Minimizing ||.||1 for inconsistent D {1..M}n x n • Same algorithm! • Pick pivot vertex u(uniformly@random) • Freeze distances incident to u • Fix inter-class distances 2 2 X 3 3 X • Fix intra-class distances 3 2 1 X 1 • (Total cost contribution: 4) u 3 • Recurse... • Lemma: no cancellations • Theorem: M+1 approximation
Proof idea w • violating if:1 > 2¸3 • Optimal solution pays¸1-2 • Algorithm chargingscheme: 2 ) 1 1 ) 2 v u ) 2) 1 3 2-3+ 1-2 w 1-2 u v chosen as pivot ) charged
T LM ... ... ... L2 L1 y u v x w General ultrametrics • D2 R+n £ n • Fit D to weighted ultrametric M possible distances: 1 = L1 2 = L1+L2 : M = L1+ . . . + Lm Ex: dt(v,w)=L1+L2
T LM xMuy = 0 x2uy = 0 x1uy = 1 ... ... ... L2 L1 y u v x w Fitting D to M-level weightedUltrametric under || .||1 Linear [0,1] relaxation • Integer program formulation: xtuv {0,1} • xtuv = 1 u,v separated at level t • 0 xMuv xM-1uv ... x1uv=1 • - inequality at each levelxtuv xtuw + xtwv • Cost:min t=1M Lt ( xtuv + (1-xtuv) ) D(u,v) t D(u,v) > t
Rounding the LP:An O(logn loglogn)-approximation • A divisive (top-down) algorithm • At each level t=M, M-1,..., 1: • Solve a multi-cut-like problem • Cluster so as to separate u,v ’s s.t. xtuv¸ 2/3 • Danger: High levels influence low ones!
General ||.||p cost • Similar analysisgives same bound for ||.||pp • Therefore: O( logn loglogn )1/p– approximation • By [ABFPT99], applies also to fitting trees
Future work • O( log n) – algorithm? Better? • Stronger lower bounds • Derandomize (M+1)-approx algorithm • Aggregation [ACN05] • Applications Thank You !!!