240 likes | 381 Vues
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering. Sharlee Climer and Weixiong Zhang. This research was supported in part by NDSEG and Olin Fellowships and by NSF grants IIS-0196057 and ITR/EIA-0113618. Overview. Introduction Example Results
E N D
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in part by NDSEG and Olin Fellowships and by NSF grants IIS-0196057 and ITR/EIA-0113618.
Overview • Introduction • Example • Results • Conclusion Washington University in St. Louis
Introduction • Rearrangement clustering • Rearrange rows of a matrix • Minimize the sum of the differences between adjacent rows • min Sd(i, i+1) • Rows correspond to objects • Columns correspond to features Washington University in St. Louis
Introduction • Applications • Information retrieval • Manufacturing • Software engineering Washington University in St. Louis
Example Washington University in St. Louis
Example • Bond Energy Algorithm (BEA) • Introduced in 1972 (McCormick, Schweitzer, White) • Approximate solution • Still widely used Washington University in St. Louis
Example Washington University in St. Louis
Example • Optimal solution • Lenstra (1974) observed equivalence to the Traveling Salesman Problem (TSP) • Given n cities and the distance between each pair • Find shortest cycle visiting every city • NP-hard problem Washington University in St. Louis
Example • Transform into a TSP • Each object corresponds to a city • Distance between two cities equal to difference between the corresponding objects • Dummy city added to problem • Costs from dummy city to all other cities equal a constant • Location of dummy city indicates position to cut cycle into a path Washington University in St. Louis
Example • TSP solvers extremely slow even for small problems in the 70’s • Massive research efforts to solve TSP over last three decades • Current solvers • Concorde (Applegate, Bixby, Chvatal, Cook, 2001) • Solved a 15,112 city TSP Washington University in St. Louis
Example Washington University in St. Louis
Example • BEA and TSP offer approximate and optimal solutions • We have observed a flaw in the objective function when the objects form natural clusters • The objective minimizes the sum of every pair of adjacent rows • Inter-cluster distances tend to be significantly larger than intra-cluster distances • Summation dominated by inter-cluster distances Washington University in St. Louis
Example • TSPCluster addresses this flaw • Add k dummy cities • k clusters are specified by the output • TSP solver ignores inter-cluster distances • Minimizes sum of intra-cluster distances • Use sufficiently small constant for distances to/from dummy cities • Dummy cities never adjacent to each other Washington University in St. Louis
Example Washington University in St. Louis
Results • Arabidopsis • 499 genes • 25 conditions • Comparison with BEA • Used BEA similarity measure • BEA score: 447,070 • TSPCluster score: 452,109 (k = 1) Washington University in St. Louis
Results BEA TSPCluster Washington University in St. Louis
Results • Compared with Cluster (Eisen et al., 1998) and k-ary (Bar-Joseph et al., 2003) • Used Pearson correlation coefficient • Cluster: 398 • k-ary: 427 • TSPCluster: 436 (k = 1) Washington University in St. Louis
Results Cluster k-ary TSPCluster Washington University in St. Louis
Results • TSPCluster with k equal to 2 to 50 • How many clusters? • Average inter-cluster distances • BEA local peaks: • 6, 13, 19, 26, 29, 35, 40, 47 • Pearson correlation coefficient local peaks: • 3, 9, 12, 21, 26, 40 • Computation time varied • Less than half minute to ~3 minutes Washington University in St. Louis
Results k = 26 k = 40 Washington University in St. Louis
Conclusion • Most problems have errors in their data • Error introduced by approximation algorithms can’t be expected to “undo” this error • Computers are cheap • Computers and solvers are sophisticated • Don’t have to always resort on approximate solutions even for NP-hard problems Washington University in St. Louis
Conclusion • Rearrangement clustering provides a linear ordering • Linear ordering inherent to many applications • Information retrieval • Manufacturing • Software engineering Washington University in St. Louis
Conclusion • Gene data arranged in linear order to examine data • Linear ordering not necessarily essential to gene clustering problems • Current work • Optimally solve subproblems in clustering algorithms Washington University in St. Louis
Questions? Washington University in St. Louis