290 likes | 437 Vues
A Two-Way Visualization Method for Clustered Data. Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda Koren and David Harel. ACM SIGKDD international conference on Knowledge discovery and datamining. Outline. Motivation Objective Introduction Basic Notions
E N D
A Two-Way Visualization Method for Clustered Data Advisor :Dr. Hsu Presenter: Keng-Wei Chang Author: Yehuda Koren and David Harel ACM SIGKDD international conference on Knowledge discovery and datamining
Outline • Motivation • Objective • Introduction • Basic Notions • Computing The x-Coordinates • Computing The y-Coordinates • Result • Related Work • Conclusions • Personal Opinion
Motivation • A number of technological development have led to an explosion of raw data that has to be analyzed • We are especially interested in two families of tools in this domain • Clustering algorithms and data visualization methods
Objective • in this paper, we integrate the two approaches • hierarchical clustering depicted as a dendrogram • low-dimensional embedding
Introduction • A number of technological development have led to an explosion of raw data that has to be analyzed • We are especially interested in two families of tools in this domain • Clustering algorithms and data visualization methods • Clustering methods can be broadly classified • Hierarchical and partitional
Introduction • Our main interest here is hierarchical clustering • The clustering hierarchy is often visualized as a dendrogram • A full binary tree • has a significant disadvantage • does not provide exploratory visual representations of the data itself • another issue is that of cluster validity
Introduction • we are particularly interested in methods for achieving a low-dimensional embedding of data • principal component analysis (PCA) • multidimensional scaling (MDS) • force-directed placement • solve some limitations of dendrogram • but, cannot utilize external clustering information
Introduction • for a demonstration of the relative merits of the two approaches • a dendrogram vs. a low-dimensional embedding
Introduction • in this paper, we integrate the two approaches • hierarchical clustering depicted as a dendrogram • low-dimensional embedding
Basic Notions • given data about n elements {1,…,n} • relationships between pairs of elements are by • distances dij≥ 0 or • similarities wij≥ 0 • 2-dimentional embedding of the data • id defined by two vectors x, y Є • the coordinates of element i are ( xi, yi)
Computing The x-Coordinates • The embedding must place each element exactly below its corresponding leaf in the dendrogram • this means that the x-coordinate must corresponding leaf in the dendrogram • face the problem of • computing the x-coordinates of the dendrogram leaves • preserves the relationships among the data as much as possible
Computing The x-Coordinates • we exhaust all the existing methods, opting for a twofold process • find the best orientation of the dendrogram • this step determines the ordering of the leaves • decide on the exact gaps between consecutive leaves in the ordering
Dendrogram orientation • a dendrogram has 2n-1 different orientations • example:
Dendrogram orientation • one way of defining formally what should be considered a “good” ordering • associate a cost function with the dendrogram • such that finding the best ordering is equivalent to optimizing this function • be the classical minimum linear arrangement problem minimizes
Dendrogram orientation • in our particular problem • also faced with an ordering task • a permutation of {1, …, n} • however, here we should not consider all possible permutations, but only agree with dendrogram’s structure • n! 2n-1 • using dynamic programming, running time is exponential in the dendrogram’s height not in its size
Dendrogram orientation • introduce an additional form of the cost function maximizes
Dendrogram orientation • given an ordered dendrogram T • a node v • Leaves(v):the set of leaves in the substree rooted by v • x be the ordering on the leaves • Let S be Leaves(v) • L be the set of leaves of left of S • R be the set of leaves of right of S • if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
Dendrogram orientation • a key concept of the algorithm is • local arrangement cost, defined as: • if |L| = l, |S| = s, we have x(L) = {1,…,l}, • x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
Dendrogram orientation • two additional related terms will be used • another term that will be used in the algorithm
Determining coordinates of the leaves • computing the exact gaps between each two consecutive leaves • example:
Determining coordinates of the leaves • a better approach is to take a weighted average over all influenced leaf pairs
Computing The y-Coordinates • Principle component analysis • Classical multidimensional scaling • Eigen-projection • Stress minimization
Result • Odors dataset • consists of 30 volatile odorous pure chemicals • contains 262 elements, natural clusters : 30 • use a UPGMA agglomerative clustering to construct the dendrogram
Result • Iris dataset • an example of discriminant analysis • contains 150 elements, natural clusters : 3
Result • Gene expression data:CDC15-synchronized cell cycle • a much larger dataset of gene-expression data • contains 6113 elements
Related Work • TreeView • dendrogram over a color-coded matrix
Discussion • success for integrating two key methods in exploratory data analysis • cluster analysis and low-dimensional embedding • two unique properties • Guaranteed separation between any kind of given clusters • The ability to deal with a predefined hierarchical clustering
Personal Opinion • Advantages • has success for integrating two of clustering methods. • more intuition in analyzing • Application • Real data for clustering and analyzing. • May solve the problem lack of clustering information • Limited • cannot show the real shape of clusters