Comprehensive Overview of Clustering and Classification Techniques in Data Analysis

Special Techniques

Cluster Analysis • Classification or Categorization • Classification is mathematical and objective while interpretation is somewhat subjective • Minimize within group variation and maximize between group variation • Data exploration • Data structure is unknown • 3 Basic Methods of clustering algorithms • Hierarchical (n< 200) • K Means (n > 200) • 2 Step ( large samples and categorical or continuous variables)

Hierarchical • Clusters are nested • Larger clusters at later stages may contain smaller clusters at earlier stages • Evaluate results in a dendrogram with agglomeration schedule • Use K means with specified n to validate • Several options for distance measure and clustering method • Interval or count data • Interval- sq euclidean distance or euclidean distance measure with between groups linkage

K Means • Uses Euclidean Distance • Desired number of clusters specified in advance • Does not require case vs case proximity matrix • Observations are grouped by distance to cluster mean at each iteration and cluster means shift after each iteration • Similar to ANOVA • Iterations stop when cluster means are stable or when defined iteration limit is reached • Final decision on number of clusters is subjective • Raw data should be carefully analyzed with new cluster membership and several examples

2 Step • Very large datasets • Categorical or continuous data • Pre-clusters identified and then used in a hierarchical procedure • randomization

Discriminant Function Analysis • Logistic regression is more popular now • Classify cases into the values of a dichotomous dependent • Purposes • To classify cases into groups using a discriminant prediction equation. • To test theory by observing whether cases are classified as predicted. • To investigate differences between or among groups. • To determine the most parsimonious way to distinguish among groups. • To determine the percent of variance in the dependent variable explained by the independents. • To assess the relative importance of the independent variables in classifying the dependent variable. • To discard variables which are little related to group distinctions.

Time Series Analysis • Differs from other methods by having equally spaced time intervals on the X • Objectives • Identify the distribution pattern of the variable over time. • Pattern vs noise (error) • Trend vs seasonality • Trend analysis and autocorrelation • Forecast predicted future variables

Comprehensive Overview of Clustering and Classification Techniques in Data Analysis