420 likes | 519 Vues
Learn about the DCCA method for grouping genes with varying expression patterns based on correlation coefficients. Explore the algorithm, results, and conclusions of this bioinformatics technique.
E N D
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008 Divisive Correlation Clustering Algorithm (DCCA) for groupingof genes: detecting varying patterns in expression profiles
Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions
Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions
Introduction • Correlation Clustering
Correlation Clustering • Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004. • It is basically based on the notion of graph partitioning.
Correlation Clustering • How to construct the graph? • Nodes: genes. • Edges: correlation between the genes. • Two types of edges: • Positive edge. • Negative edge.
Correlation Clustering • For example: Positive correlation coefficient: Positive edge( ) X Y Negative correlation coefficient: Negative edge( ) X Y Graph Construction A A Cluster 1 Graph Partitioning B B C C G G E E D D G G F F Cluster 2 H H
Correlation Clustering • How to measure the quality of clusters? • The number of agreements. • The number of disagreements. • The number of agreements: the number of genes that are in correct clusters. • The number of disagreements: the number of genes wrongly clustered.
Correlation Clustering • For example: The measure of agreements is the sum of:(1) # of positive edges in the same clusters(2) # of negative edges in different clusters Cluster 1 A B C 4 + 4 = 8 D E The measure of disagreements is the sum of:(1) # of negative edges in the same clusters(2) # of positive edges in different clusters Cluster 2 0 + 2 = 2
Correlation Clustering • Minimization of disagreements or equivalently Maximization of agreements! • However, it’s NP-Complete proved by Bansal et al., 2004. • Another problem is without the magnitude of correlation coefficients.
Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions
Divisive Correlation Clustering Algorithm • Pearson correlation coefficient • Terms and measurements used in DCCA • Divisive Correlation Clustering Algorithm
Pearson correlation coefficient • Consider a set of genes, , for each of which expression values are given. • The Pearson correlation coefficient between two genes and is defined as: lth sample value of gene mean value of gene from samples
Pearson correlation coefficient • : and are positively correlated with the degree of correlation as its magnitude. • : and are negatively correlated with value .
Terms and measurements used in DCCA • We define some terms and measurements used in DCCA: • Attraction • Repulsion • Attraction/Repulsion value • Average correlation value
Terms and measurements used in DCCA • Attraction: There’s an attraction between and if . • Repulsion: There’s a repulsion between and if . • Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion.
Terms and measurements used in DCCA • The genes will be grouped into disjoint clusters . • Average correlation value: Average correlation value for a gene with respect to cluster is defined as: the number of data points in
Divisive Correlation Clustering Algorithm • indicates that the average correlation for a gene with other genes inside the cluster . • Average correlation value reflects the degree of inclusion of to cluster .
Divisive Correlation Clustering Algorithm • Divisive Correlation Clustering Algorithm m samples K disjoint clusters X1 1 m DCCA C1 C2 Ck n genes Xn 1 m
Divisive Correlation Clustering Algorithm • Step 1: • Step 2: for each iteration, do: • Step 2-i:
Divisive Correlation Clustering Algorithm • Step 2: • Step 2-ii: • Step 2-iii: Which cluster exists the most repulsion value? C1 C2 Cp Cluster C!
Divisive Correlation Clustering Algorithm • Step 2-iv: Cp xi xk xi xk xk xk xk Cq xk xj xj xk Cluster C
Divisive Correlation Clustering Algorithm • Step 2-v: xk Place a copy of xk C1 C2 CK C1 C2 CK xk The highest average correlation value! CNEW: new clusters
Divisive Correlation Clustering Algorithm • Step 2-vi: C1 C2 CK Any change? C1 C2 CK CNEW: new clusters
Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions
Results • Performance comparison • A synthetic dataset ADS • Nine gene expression datasets
Performance comparison • Asynthetic dataset ADS: Three groups.
Performance comparison • Experimental results: Clustering correctly.
Performance comparison • Experimental results: Undesired Clusters. Undesired Clusters.
Performance comparison • Five yeast datasets: • Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al. • Four mammalian datasets: • GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745.
Performance comparison • Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. Mutual information The entropies for each cluster-attribute pair. Attributes The entropies for each of the NA attributes independent of clusters. The entropies for clustering result independent of attributes.
Performance comparison • z-score is defined as: Mean of these MI-values. The computed MI for the clustered data, using the attribute database. The standard deviation of these MI-values. MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained.
Performance comparison • A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result. • Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets.
Performance comparison • Experimental results:
Performance comparison • Experimental results:
Performance comparison • Experimental results:
Performance comparison • Experimental results:
Performance comparison • Experimental results:
Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions
Conclusions • Pros: • DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance. • DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input.
Conclusions • Cons: • The computation cost for repairing any misplacement occurring in clustering step is high. • DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1.