COSC 6335 Fall 2013 Post Analysis Project2

COSC 6335Fall 2013Post Analysis Project2 Christoph F. Eick

Arko’s Agreement Code agreement = function(x,y) { max<-NROW(x$cluster); count<-0; total<-max*(max+1)/2; for(i in 1:max) { for(j in i:max) { if(j!=i) { if((x$cluster[j]==x$cluster[i] & y$cluster[j]==y$cluster[i]) | (x$cluster[j]!=x$cluster[i] & y$cluster[j]!=y$cluster[i])) count<-count+1; } else { if((x$cluster[i]==0 & x$cluster[j]==0) | (x$cluster[i]>0 & x$cluster[j]>0)) count<-count+1; } } } returnValue<-count/total; return(returnValue);

K-means for Complex8 In general, the turquoise and the pink clusters are bad, whereas the brown and green clusters are okay.

Arko’s Code for the Purity Function (except what is in red) purity<-function(a,b,outliers=FALSE) { require('matrixStats'); t<-table(a,b); rowTotals<-rowSums(t); #the same can be with apply(t,1,sum) rowMax<-apply(t,1,max); if(!outliers) { purity<-sum(rowMax)/sum(rowTotals); return (purity) } else { if(NROW(rowTotals)>1) { purity<-(sum(rowMax)-rowMax[1])/(sum(rowTotals)-rowTotals[1]); } else { purity<-NA; } pcOutliers=rowTotals[1]/(sum(rowTotals)); returnVector=vector(mode='double',length=2); returnVector[1]=purity; returnVector[2]=pcOutliers; return(returnVector); } }

Task4: Characterizing the 5 Clusters Remark: As we use k-means, almost everybody should have different clusters and summaries

Project2 Observations • Assuming purity is used as the evaluation measure DBSCAN outperformed Kmeans quite significantly, as K-means was not able to detect the natural clusters; on the other hand, for the Yeast dataset K-means obtained better results than DBSCAN; in general, DBSCAN seems to create one very big cluster or obtain a clustering with a lot of outliers, and it seemed to be very difficult (or even impossible) to obtain solutions that lie between the extremes. • A lot of students failed to observe that k-means fails to identify the natural clusters in the Complex8 Dataset. • For the purity function, some code ignored the assumption that outliers are assumed to be in cluster zero and obtained incorrect results; e.g. considering the objects in cluster 0 in purity computations of DBSCAN results or excluding cluster 1 when computing purity for k-means clusterings. • For task 4 the main goal was to characterize the objects in clusters 1-5; a lot of students did put enough focus on this task; e.g. they provided a general analysis of boxplots rather than analyzing the box plots with respect to separating the 5 clusters and with respect to differences between the distribution in a particular cluster and the distribution in the dataset. • About 35% of the students provided quite sophisticated search procedures to find good DBSCAN parameter settings; unfortunately, I had a very hard time, understanding most of the chosen approaches due to lack of explanation and examples that illustrate the approach. • There was a quite dramatic differences with respect to amount of work and quality of the approach/solutions obtained for Tasks 4 and 6. Overall, some really good work was done by some students for tasks 4 and or 6 (score=9 or higher). • Challenges for Task6 include: • Finding an acceptable range of parameter values so that DBSCAN creates at least “okay” results • How to search for good solutions in the range • Another observation, if we maximize purity, is using a large number of clusters might be beneficiary to obtain better results; however, how to embed this knowledge into the search procedure is a challenge…

Optimal DBSCAN Clustering for Complex 8 For the complex8 dataset, the best results are as follows: Purity = 1 Outliers = 0.4704038% Number of Clusters = 19 (20, if we include cluster 0 as outliers) Eps = 12.8 MinPts= 3 Remark: 3 Students found purity 100% clusters (one extra point for that; results still need to be verified) 0 1 2 3 4 5 6 7 0 0 1 1 1 5 1 3 0 1 60 0 0 0 0 0 0 0 2 0 57 0 0 0 0 0 0 3 0 0 518 0 0 0 0 0 4 0 0 0 482 0 0 0 0 5 0 0 0 0 57 0 0 0 6 0 0 0 0 5 0 0 0 7 0 0 0 0 3 0 0 0 8 0 0 0 0 12 0 0 0 9 0 0 0 0 14 0 0 0 10 0 0 0 0 7 0 0 0 11 0 0 0 0 10 0 0 0 12 0 0 0 0 8 0 0 0 13 0 0 0 0 0 10 0 0 14 0 0 0 0 0 113 0 0 15 0 0 0 0 0 245 0 0 16 0 0 0 0 0 0 66 0 17 0 0 0 0 0 0 407 0 18 0 0 0 0 0 0 8 0 19 0 0 0 0 0 0 0 403 20 0 0 0 0 0 0 0 54

“Optimal” Complex8 DBSCAN Clustering

COSC 6335 Fall 2013 Post Analysis Project2

COSC 6335 Fall 2013 Post Analysis Project2

Presentation Transcript

Fall 2013

Project2 Post Analysis —General Things

COSC 6335 Data Mining Fall 2009: Assignment3a Post Analysis

Fall 2013

COSC 6335 —What is left in 2013?

Fall 2013

COSC 6335 Project3 Fall 2012

COSC 6335 Project4 Fall 2011

English 1302 Literary Analysis Fall 2013

COMP6791 Project2

Fall 2013

Fall, 2013

Fall 2013

Fall 2013

Lab COSC 3480 Fall 2000

Project2

Fall 2013

Fall 2013

Solution to the Fourth COSC 6360 Quiz for Fall 2013