NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

CLASSIFICATION • Agglomerative hierarchical cluster analysis • Two-way indicator species analysis – TWINSPAN • Non-hierarchical k-means clustering • ‘Fuzzy’ clustering • Mixture models and latent class analysis • Detection of indicator species • Interpretation of classifications using external data • Comparing classifications • Software

BOOKS ON NUMERICAL CLASSIFICATION M. R. Anderberg, 1973, Cluster analysis for applications. Academic H.T. Clifford & W. Stephenson, 1975, An introduction to numerical classification. Academic B. Everitt, 1993, Cluster analysis. Halsted Press A.D. Gordon, 1999, Classification. Chapman & Hall A.K. Jain & R.C. Dubes, 1988, Algorithms for clustering data. Prentice Hall L. Kaufman & P.J. Rousseeuw, 1990, Finding groups in data. An introduction to cluster analysis. Wiley H.C. Romesburg, 1984, Cluster analysis for researchers. Lifetime Learning Publications P.H. A. Sneath & R.R. Sokal, 1973, Numerical taxonomy. W.H. Freeman H. Späth, 1980, Cluster analysis algorithms for data reduction and classification of objects

BOOKS ON NUMERICAL CLASSIFICATION IN ECOLOGY P.G.N. Digby & R.A. Kempton, 1987, Multivariate analysis of ecological communities. Chapman & Hall P. Greig-Smith, 1983, Quantitative plant ecology. Blackwell R.H.G. Jongman, C.J.F. ter Braak & O.F.R. van Tongeren (eds), 1995, Data analysis in cummunity and landscape ecology. Cambridge University Press P. Legendre & L. Legendre, 1998, Numerical ecology. Elsevier (Second English Edition) J.A. Ludwig & J.F. Reynolds, 1988, Statistical ecology. J. Wiley L. Orloci, 1978, Multivariate analysis in vegetation research. Dr. Junk E.C. Pielou, 1984, The interpretation of ecological data. J. Wiley J. Podani, 2000, Introduction to the exploration of multivariate biological data. Backhuys W.T. Williams, 1976, Pattern analysis in agricultural science. CSIRO Melbourne Most important are Chapters 7 and 8 in Legendre and Legendre (1998)

BASIC AIM Partition set of data (objects) into groups or clusters. Partition into g groups so as to optimise some stated mathematical criterion, e.g. minimum sum-of-squares. Divide data into g groups so as to minimise the total variance or within-groups sum-of-squares, i.e. to make within-group variance as small as possible, thereby maximising the between-group variance. Reduce data to a few groups. Can be very useful. Compromise 50 objects, 1080 possible classifications Hierarchical classification Agglomerative, divisive Major reviews A.D. Gordon, 1996, Hierarchical classification in clustering and classification (ed. P. Arabie & L.J. Hubert) pp 65-121. World Scientific Publishing, River Edge, NJ A.D. Gordon, 1999, Classification (Second edition). Chapman & Hall

CLASSIFICATION OF CLASSIFICATIONS

MAIN APPROACHES All UNSUPERVISED classifications Hierarchical cluster analysis formal, hierarchical, quantitative, agglomerative, polythetic, sharp, not always useful. Two-way indicator species analysis (TWINSPAN) formal, hierarchical, semi-quantitative, divisive, semi-polythetic, sharp, usually useful. k-means clustering formal, non-hierarchical, quantitative, semi-agglomerative, polythetic, sharp, usually useful. Fuzzy clustering formal, non-hierarchical, quantitative, semi-agglomerative, polythetic, fuzzy, rarely used but potentially useful. Mixture models and latent class analysis formal (too formal!) non-hierarchical, quantitative, polythetic, sharp or fuzzy, rarely used, perhaps not potentially useful with complex data-sets.

Warning! “The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other“statistical” innovation (with the possible exception of multiple regression techniques)” Cormack, 1970

AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS • Calculate matrix of proximity or dissimilarity coefficients • Clustering • Graphical display • Check for distortion • Validation of results

PROXIMITY OR DISTANCE OR DISSIMILARITY MEASURES A. Binary Data Jaccard coefficient Dissimilarity (1-S) Simple matching coefficient Baroni-Urbani & Buser Syst. Zool. (1976) 25; 251-259

i Xi2 dij2 Variable 2 j Xj2 Xi1 Xj1 Variable 1 B. Quantitative Data Euclidean distance dominated by large values Manhattan or city-block metric less dominated by large values sensitive to extreme values relates minima to average values and represents the relative influence of abundant and uncommon variables Bray & Curtis (percentage similarity)

B. Quantitative Data (cont) Similarity ratio or Steinhaus-Marczewski coefficient ( Jaccard) less dominated by extremes Chord distance for % data “signal to noise” C. Percentage Data (e.g. pollen, diatoms) Standardised Euclidean distance - gives all variables ‘equal’ weight, increases noise in data Euclidean distance - dominated by large values, rare variables almost no influence Chord distance (= Euclidean distance - good compromise, maximises signal of square-roottransformed data) to noise ratio

D. Transformations Normalise samples - ‘equal’ weight Normalise variables - ‘equal’ weight, rare species inflated No transformation- quantity dominated Double transformation - equalise both, compromise Noy-Meir et al. (1975) J. Ecology 63; 779-800 E. Mixed data (e.g. quantitative, qualitative, binary) Gower coefficient (see Lecture 12)

AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS (five stages) • Calculate matrix of proximity (similarity or dissimilarity measures) between all pairs of n samples ½ n (n - 1) • Fuse objects into groups using stated criterion, ‘clustering’ or sorting strategy • Graphical display of results - dendrograms or trees • - graphs • - shadings • iv. Check for distortion • v. Validation results?

i. Simple Distance Matrix

ii. Clustering Strategy using Single-Link Criterion Find objects with smallest dij = d12 = 2 Calculate distances between this group (1 and 2) and other objects d(12)3 = min { d13, d23 } = d23 = 5 d(12)4 = min { d14, d24 } = d24 = 9 d(12)5 = min { d15, d25 } = d25 = 8 Find objects with smallest dij = d45 = 3 Calculate distances between (1, 2), 3, and (4, 5) Find object with smallest dij = d3(4, 5) = 4 Fuse object 3 with group (4 + 5) Now fuse (1, 2) with (3, 4, 5) at distance 5

I&J fuse Need to calculate distance of K to (I, J)

Also: Unweighted group-average distance between K and (I,J) is average of all distances from objects in I and J to K, i.e. Weighted group-average distance between K and (I,J) is average of distance between K and J (i.e. d/4) and between I and K i.e.

Single-link (nearest neighbour) Complete-link (furthest neighbour) Median Centroid Unweighted group-average Weighted group-average Minimum variance, sum-of-squares Orloci (1967) J. Ecology 55, 193-206 Ward’s method QI, QJ, QK within-group variance Fuse I with J to give (I, J) if and only if or QJK – (QJ + QK) i.e. only fuse I and J if neither will combine better and make lower sum-of-squares with some other group.

GENERALISED SORTING STRATEGY Wishart, (1969) Biometrics 25, 165-170 dk(ij) = idki + j dkj + dij + dki – dkj  (distance between group kand group (i, j) follows arecurrence formula, where, ,andare parameters fordifferent methods) CLUSTER CLUSTAN-PC CLUSTAN-GRAPHICS

Single-link example, to calculate distanced3(1,2) = d3(1,2) = i dki + j dkj +  dij + | dki – dkj | = ½ d31(6) + ½ d32(5) + 0dij + –½ | d31(6) – d32(5) | = ½ 6 + ½ 5 + 0 – ½ 1 = 3 + 2.5 – 0.5 = 5 Can also have Flexible clustering with user-specified  (usually –0.25)

CLUSTERING STRATEGIES Single link = nearest neighbour Finds the minimum spanning tree, the shortest tree that connects all points Finds discontinuities if they exist in data Chaining common Clusters of unequal size Complete-link = furthest neighbour Compact clusters of ± equal size Makes compact clusters even when none exist Average-linkage methods Intermediate between single and complete link Unweighted GA maximises cophenetic correlation Clusters often quite compact Make quite compact clusters even when none exist Median and centroid Can form reversals in the tree Minimum variance sum-of-squares Compact clusters of ± equal size Makes very compact clusters even when none exist Very intense clustering method

iii. Graphical display Dendrogram ‘Tree Diagram’

Parsimonious Trees Limit number of different values taken by heights of internal nodes or number of internal nodes. Global parsimonious tree of the dendrogram. Group average dendrogram of 65 regions in Europe; The measure of pairwise similarity is Jaccard’s coefficient, based on the presence or absence of 144 species of fern.

Local parsimonious tree of the dendrogram

Matrix Shading A similarity matrix based on scores for 15 qualities of 48 applicants for a job. The dendrogram shows a furthest-neighbour cluster analysis, the end points of which correspond to the 48 applicants in sorted order. Ling (1973) Comm. Asoc. Computing Mach. 16, 355-361

Re-order Data Matrix Schematic way of combining row and column hierarchical analyses

Summarised two-way table of the Malham data set. The representation of the species groups (1-23) delimited by minimum variance cluster analyses in the eight quadrat clusters (A-H) is shown by the size of the circle. In addition, both the quadrat and species dendrograms derived from minimum-variance clustering are shown to show the relationships between groups.

iv. Tests for Distortion Cophenetic correlations. The similarity matrix S contains the original similarity values between the OTU’s (in this example it is a dissimilarity matrix U of taxonomic distances). The UPGMA phenogram derived from it is shown, and from the phenogram the cophenetic distances are obtained to give the matrix C. The cophenetic correlation coefficient rcsis the correlation between corresponding pairs from C and S, and is 0.9911. R CLUSTER

Which Cluster Method to Use? Cluster analysis of the Mancetter data Ward’s Method analysis of the data Average link analysis

J. Oksanen (2002)

SINGLE LINK

MINIMUM VARIANCE

CLUSTERING AND SPACE Convex hull encloses all points so that no line between two points can be drawn outside the convex hull. J. Oksanen (2002)

General Behaviour of Different Methods Single-link Often results in chaining Complete-link Intense clustering Group-average (weighted) Tends to join clusters with small variances Group-average (unweighted) Intermediate between single and complete link Median Can result in reversals Centroid Can result in reversals Minimum variance Often forms clusters of equal size General Experience Minimum variance is usually most useful but tends to produce clusters of fairly equal size, followed by group average. Single-link is least useful.

SIMULATION STUDIES Clustering of random data on two variables. Note: Diagram (a) is a plot of two randomly generated variables labelled according to the clusters suggested by Ward’s method in diagram (b) Baxter (1994)

v. VALIDATION OF RESULTS TESTS FOR ASSESSING CLUSTERS • Validation tests for • The complete absence of any group structure in the data • The validity of an individual cluster • The validity of a partition • The validity of a complete hierarchical classification • Main interest in (2) and (3) - generally assume there is some ‘group structure’, rarely interested in validating a complete hierarchical classification. Gordon, A.D. (1995) Statistics in Transition 2: 207-217 Gordon, A.D. (1996) In: From Data to Knowledge (ed. W. Gaul & D. Pfeifer, Springer

Cluster analysis of joint occurrence of 43 species of fish in the Susquehanna River drainage area of Pennsylvania, constructed with the UPGMA clustering algorithm (Sneath & Sokal, 1973). The three short perpendicular lines on the dissimilarity scale represent the critical values C1, C2, and C3obtained from the null nodal distributions of the null frequency histogram. Significant clusters are indicated by solid lines. The non-significant portion of the dendrogram is drawn in dotted lines. Strauss, (1982) Ecology 63, 634-639.

Hunter & McCoy 2004 J. Vegetation Science 15: 135-138. Problem of creating ecologically relevant 'random' or 'null' data-sets. Within a 'significant' cluster, linkages are often identified as 'significant' even when species are actually randomly distributed among the sites in the group. Artificial data 2 groups of 20 sites, no species in common species sites

Randomisation test identifies both groups and all linkages within them as 'significant'. Same test finds all linkages non-significant if only use one of the groups!  = significant  = not significant

Arises because the randomisation matrices need to be created at each classification step, not just at the beginning. Can test for significance of groups by comparing linkage distances to a null distribution derived from randomisation and clustering of a sub-matrix containing only the sites within the larger group. In other words, this is testing the null hypothesis that within the significant group, sites represent random assemblages of species. Sequential randomisation allows evaluation of all nodes in the classification.

OTHER APPROACHES TO ASSESSING AND VALIDATING CLUSTERS If replicate samples are available, can use bootstrapping to evaluate significance. Can also use within-cluster samples as ‘replicates’. BOOTCLUSMcKenna (2003) Environmental Modelling & Software 18, 205-220 (www.glsc.usgs.gov/data/bootclus.htm) SAMPLEREPillar (1999) Ecology 89, 2508-2516 Compares cluster analysis groups and uses bootstrapping (resampling with replacement) to test the null hypothesis that the clusters in the bootstrap samples are random samples of their most similar corresponding clusters in the observed data. The resulting probability indicates whether the groups in the partition are sharp enough to reappear consistently in resampling.

NUMBER OF CLUSTERS There are as many fusion levels as there are observations. Hierarchical classification can be cut at any level. User generally wants to use groups all at one level, hence ‘cut level’ or ‘stopping rules’. No optimality criteria or guidelines. Select what is useful for the purpose at hand. No right or wrong answer, just useful or not useful! Mathematical criteria - see A.D. Gordon (1999) pp. 60-65

CRITERIA FOR GOOD CLUSTERS • Divide underlying gradients into equal parts • Compact clusters • Groups of equal size • Discontinuous groups These criteria often in conflict, and cannot all be satisfied simultaneously. J. Oksanen (2002)

2. TWINSPAN – Two-Way Indicator Species Analysis TWINSPAN Mark Hill (1979) Differential variables characterise groups, i.e. variables common on one side of dichotomy. Involves qualitative (+/–) concept, have to analyse numerical data as PSEUDO-VARIABLES (conjoint coding). Species A 1-5%  SPECIES A1 Species A 5-10%  SPECIES A2 Species A 10-25%  SPECIES A3  cut level Basic idea is to construct hierarchical classification by successive division. Ordinate samples by correspondence analysis, divide at middle  group to left negative; group to right positive. Now refine classification using variables with maximum indicator value, so-called iterative character weighting and do a second ordination that gives a greater weight to the ‘preferentials’, namely species on one or other side of dichotomy. Identify number of indicators that differ most in frequency of occurrence between two groups. Those associated with positive side +1 score, negative side -1. If variable 3 times more frequent on one side than other, variable is good indicator. Samples now reordered on basis of indicator scores. Refine second time to take account of other variables. Repeat on 2 groups to give 4, 8, 16 and so on until group reaches below minimum size.

TWINSPAN

Pseudo-species Concept Each species can be represented by several pseudo-species, depending on the species abundance. A pseudo-species is present if the species value equals or exceeds the relevant user-defined cut-level. Cut levels 1, 5, and 20 (user-defined) Thus quantitative data are transformed into categorical nominal (1/0) variables.

Variables classified in much same way. Variables classified using sample weights based on sample classification. Classified on basis of fidelity - how confined variables are to particular sample groups. Ratio of mean occurrence of variable in samples in group to mean occurrence of variable in samples not in group. Variables are ordered on basis of degree of fidelity within group, and then print out structured two-way table. Concepts of INDICATOR SPECIES DIFFERENTIALS and PREFERENTIALS FIDELITY Gauch & Whittaker (1981) J. Ecology 69, 537-557 “two-way indicator species analysis usually best. There are cases where other techniques may be complementary or superior”. Very robust - considers overall data structure “best general purpose method when a data set is complex, noisy, large or unfamiliar. In many cases a single TWINSPAN classification is likely to be all that is necessary”. TWINSPAN, TWINGRP, TWINDEND, WINTWINS

TWINSPAN TESTS OF ROBUSTNESS & RELIABILITY • van Groenewoud (1992) J. Veg. Sci. 3, 239-246 • Belbin & McDonald (1993) J. Veg. Sci. 4, 341-348 • Artificial data of known properties and structure • Reliability of TWINSPAN depends on: • How well correspondence analysis extracts axes that have ecological meaning • How well the CA axes are divided into meaningful segments • How faithful certain species are to certain segments of the multivariate space

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA