1 / 8

COSC 6335 Fall 2013 Post Analysis Project2

COSC 6335 Fall 2013 Post Analysis Project2. Christoph F. Eick. Arko’s Agreement Code. agreement = function( x,y ) { max<-NROW( x$cluster ); count<-0; total<-max*(max+1)/2; for( i in 1:max) { for(j in i:max) { if(j!= i ) {

binah
Télécharger la présentation

COSC 6335 Fall 2013 Post Analysis Project2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COSC 6335Fall 2013Post Analysis Project2 Christoph F. Eick

  2. Arko’s Agreement Code agreement = function(x,y) { max<-NROW(x$cluster); count<-0; total<-max*(max+1)/2; for(i in 1:max) { for(j in i:max) { if(j!=i) { if((x$cluster[j]==x$cluster[i] & y$cluster[j]==y$cluster[i]) | (x$cluster[j]!=x$cluster[i] & y$cluster[j]!=y$cluster[i])) count<-count+1; } else { if((x$cluster[i]==0 & x$cluster[j]==0) | (x$cluster[i]>0 & x$cluster[j]>0)) count<-count+1; } } } returnValue<-count/total; return(returnValue);

  3. K-means for Complex8 In general, the turquoise and the pink clusters are bad, whereas the brown and green clusters are okay.

  4. Arko’s Code for the Purity Function (except what is in red) purity<-function(a,b,outliers=FALSE) { require('matrixStats'); t<-table(a,b); rowTotals<-rowSums(t); #the same can be with apply(t,1,sum) rowMax<-apply(t,1,max); if(!outliers) { purity<-sum(rowMax)/sum(rowTotals); return (purity) } else { if(NROW(rowTotals)>1) { purity<-(sum(rowMax)-rowMax[1])/(sum(rowTotals)-rowTotals[1]); } else { purity<-NA; } pcOutliers=rowTotals[1]/(sum(rowTotals)); returnVector=vector(mode='double',length=2); returnVector[1]=purity; returnVector[2]=pcOutliers; return(returnVector); } }

  5. Task4: Characterizing the 5 Clusters Remark: As we use k-means, almost everybody should have different clusters and summaries

  6. Project2 Observations • Assuming purity is used as the evaluation measure DBSCAN outperformed Kmeans quite significantly, as K-means was not able to detect the natural clusters; on the other hand, for the Yeast dataset K-means obtained better results than DBSCAN; in general, DBSCAN seems to create one very big cluster or obtain a clustering with a lot of outliers, and it seemed to be very difficult (or even impossible) to obtain solutions that lie between the extremes. • A lot of students failed to observe that k-means fails to identify the natural clusters in the Complex8 Dataset. • For the purity function, some code ignored the assumption that outliers are assumed to be in cluster zero and obtained incorrect results; e.g. considering the objects in cluster 0 in purity computations of DBSCAN results or excluding cluster 1 when computing purity for k-means clusterings. • For task 4 the main goal was to characterize the objects in clusters 1-5; a lot of students did put enough focus on this task; e.g. they provided a general analysis of boxplots rather than analyzing the box plots with respect to separating the 5 clusters and with respect to differences between the distribution in a particular cluster and the distribution in the dataset. • About 35% of the students provided quite sophisticated search procedures to find good DBSCAN parameter settings; unfortunately, I had a very hard time, understanding most of the chosen approaches due to lack of explanation and examples that illustrate the approach. • There was a quite dramatic differences with respect to amount of work and quality of the approach/solutions obtained for Tasks 4 and 6. Overall, some really good work was done by some students for tasks 4 and or 6 (score=9 or higher). • Challenges for Task6 include: • Finding an acceptable range of parameter values so that DBSCAN creates at least “okay” results • How to search for good solutions in the range • Another observation, if we maximize purity, is using a large number of clusters might be beneficiary to obtain better results; however, how to embed this knowledge into the search procedure is a challenge…

  7. Optimal DBSCAN Clustering for Complex 8 For the complex8 dataset, the best results are as follows: Purity = 1 Outliers = 0.4704038% Number of Clusters = 19 (20, if we include cluster 0 as outliers) Eps = 12.8 MinPts= 3 Remark: 3 Students found purity 100% clusters (one extra point for that; results still need to be verified) 0 1 2 3 4 5 6 7 0 0 1 1 1 5 1 3 0 1 60 0 0 0 0 0 0 0 2 0 57 0 0 0 0 0 0 3 0 0 518 0 0 0 0 0 4 0 0 0 482 0 0 0 0 5 0 0 0 0 57 0 0 0 6 0 0 0 0 5 0 0 0 7 0 0 0 0 3 0 0 0 8 0 0 0 0 12 0 0 0 9 0 0 0 0 14 0 0 0 10 0 0 0 0 7 0 0 0 11 0 0 0 0 10 0 0 0 12 0 0 0 0 8 0 0 0 13 0 0 0 0 0 10 0 0 14 0 0 0 0 0 113 0 0 15 0 0 0 0 0 245 0 0 16 0 0 0 0 0 0 66 0 17 0 0 0 0 0 0 407 0 18 0 0 0 0 0 0 8 0 19 0 0 0 0 0 0 0 403 20 0 0 0 0 0 0 0 54

  8. “Optimal” Complex8 DBSCAN Clustering

More Related