NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 9. Discriminant Analysis

DISCRIMINANT ANALYSIS Discriminant analysis of two groups Assumptions of discriminant analysis - Multivariate normality Homogeneity Comparison of properties of two groups Identification of unknowns - Picea pollen Canonical variates analysis (= multiple discriminant analysis) of three or more groups Discriminant analysis in the framework of regression Discriminant analysis and artificial neural networks Niche analysis of species Relation of canonical correspondence analysis (CCA) to canonical variates analysis (CVA) Generalised distance-based canonical variates analysis Discriminant analysis and classification trees Software

IMPORTANCE OF CONSIDERING GROUP STRUCTURE Visual comparison of the method used to reduce dimensions in (a) an unconstrained and (b) a constrained ordination procedure. Data were simulated from a multivariate normal distribution with the two groups having different centroids (6, 9) and (9, 7), but both variables had a standard deviation of 2, and the correlation between the two variables was 0.9. Note the difference in scale between the first canonical axis (CV1) and the first principal component (PC1).

Klovan & Billings (1967) Bull. Canad. Petrol. Geol. 15, 313-330 DISCRIMINANT ANALYSIS • Taxonomy – species discrimination e.g. Iris setosa, I. virginica 2. Pollen analysis – pollen grain separation 3. Morphometrics – sexual dimorphism 4. Geology – distinguishing rock samples Discriminant function – linear combination of variablesx1 and x2. z = b1x1 + b2x2 where b1andb2are weights attached to each variable that determine the relative contributions of the variable. Geometrically – line that passes through where group ellipsoids cut each other L, then draw a line perpendicular to it, M, that passes through the origin, O. Project ellipses onto the perpendicular to give two univariate distributions S1 and S2 on discriminant function M.

X2 Schematic diagram indicating part of the concept underlying discriminant functions. Plot of two bivariate distributions, showing overlap between groups A and B along both variables X1 and X2. Groups can be distinguished by projecting members of the two groups onto the discriminant function line. z = b1x1 + b2x2

m discriminant function coefficients formvariables Sw = D = (x1 – x2) m x m matrix of pooled variances and covariances vector of mean differences inverse of Sw  = Sw-1(x1 – x2) = Sw-1D Can generalise for three or more variables Solve from:

SIMPLE EXAMPLE OF LINEAR DISCRIMINANT ANALYSIS Group AMean of variable x1 = 0.330 With na individuals Mean of variable x2 = 1.167 Mean vector = [0.330 1.167] Group B Mean of variable x1 = 0.340 With nb individuals Mean of variable x2 = 1.210 Mean vector = [0.340 1.210] Vector of mean differences (D) = [-0.010 -0.043] Variance-covariance matrix Sij = wherexikis the value of variable iforindividual k.

Covariance matrix for group A (SA) = 0.00092 -0.00489 -0.00489 0.07566 and for group B (SB) =0.00138 -0.00844 -0.00844 0.10700 Pooled matrix SW = SA + SB na + nb - 2 = 0.00003 -0.00017 -0.00017 0.00231

To solve [SW] [] = [D] we need the inverse of SW = SW-1 = 59112.2804312.646 4312.646 747.132 Now SW-1 . D =  59112.2804312.646 -0.010 -783.63 x1 = 4312.646747.132 -0.043 -75.62 x2 i.e. discriminant function coefficients are -783.63 forvariable x1and -75.62 for x2. [z = -783.63 x1 - 75.62 x2 ]

MATRIX INVERSION Division of one matrix by another, in the sense of ordinary algebraic division, cannot be performed. To solve the equation for matrix [X] [A] . [X] = [B] we first find the inverse matrix [A], generally represented as [A]-1. The inverse or reciprocal matrix of [A] satisfies the relationship [A] . [A]-1 = [I] where [I] is an identity matrix with zeros in all the elements except the diagonal where the elements are all 1. To solve for [X] we multiply both sides by [A]-1 to get [A]-1 . [A]. [X] = [A]-1 . B As [A-] . [A] = [I] and [I].[X] = [X] the above equation reduces to [X] = [A]-1 . B

If matrix A is 4 10 10 30 to find its inverse we first place an identity matrix [I] next to it. 4 10 . 1 0 10 30 0 1 We now want to convert the diagonal elements of A to ones and the off-diagonal elements to zeros. We do this by dividing the matrix rows by constants and subtracting the rows of the matrix from other rows, i.e. 1 2.5 0.25 0 Row one is divided by 4 to 0 5 0 1 produce an element a11 = 1 To reduce a21 to zero we now subtract ten times row one from row 2 to give 1 2.5 0.25 0 0 5 -2.5 1 To make a22 = 1 we now divide row two by 5, 1 2.5 0.25 0 0 1 -0.5 0.2

To reduce element a12to zero, we now subtract 2.5 times row one to give 1 0 1.5 -0.5 0 1 -0.5 0.2 The inverse of A is thus 1.5 -0.5 -0.50.2 This can be checked by multiplying [A] by [A]-1 which should yield the identity matrixI i.e. 1.5 -0.5 . 410 = 10 -0.5 0.2 10 30 0 1

R.A. Fisher

Can position the means of group A and of group B on the discriminant function RA= 1x1 + 2x2Rb = -783.63 x 0.340 + -75.62 x 1.210 = -783.63 x 0.330 + -75.62 x 1.167 = -357.81 = -346.64 D2 = (x1 – x2) Sw-1 (x1 – x2) We can position individual samples along discriminant axis. The distance between the means = D2 = 11.17 To test the significance of this we use Hotelling's T2 test for differences between means = na nb D2with an F ratio of na + nb – m – 1T2 na + nb(na + nb – 2) m and m and (na + nb – m – 1) degrees of freedom. R CANOCO

ASSUMPTIONS • Objects in each group are randomly chosen. • Variables are normally distributed within each group. • Variance-covariance matrices of groups are statistically homogeneous (similar size, shape, orientation). • None of the objects used to calculate discriminant function is misclassified. Also in identification: 5. Probability of unknown object belonging to either group is equal and cannot be from any other group.

MULTIVARIATE NORMALITY Mardia (1970) Biometrika 57, 519–530 SKEWNESS SignificanceA = n.b1,m/6 x2distribution withm(m + 1)(m + 2)/6 degrees of freedom. KURTOSIS Test significance Asymptotically distributed as N(0,1). MULTNORM Probability plotting D2 plots.

Multidimensional probability plotting. The top three diagrams show probability plots (on arith-metic paper) of generalized distances between two variables: left, plot of D2 against probability; middle, plot of D2 against prob-ability; right, a similar plot after removal of four out-lying values and recal-culations of D2 values. If the distributions are normal, such plots should approximate to S-shaped. The third curve is much closer to being S-shaped than is the second, so that we surmise that removal of the outlying values has converted the distribution to normal. Again, however, a judgement as to degree of fit to a curved line is difficult to make visually. Replotting the second and third figuresabove on probability paper gives the probability plots shown in the bottom two diagrams. It is now quite clear that the full data set does not approximate to a straight line; the data set after removal of four outliers is, on visual inspection alone, remarkably close to a straight line.

STATISTICAL HOMOGENEITY OF COVARIANCE MATRICES Primary causes Secondary causes Approximatex2distribution ½ m(m + 1)d.f.andBdistributionwith ½ m(m + 1)d.f. ORNTDIST Campbell (1981) Austr. J. Stat. 23, 21-37

COMPARISON OF PROPERTIES OF TWO MULTIVARIATE GROUPS ORNTDIST Outliers - probability plots of D2 gamma plot (m/2 shape parameter) PCA of both groups separately and test for homogeneity of group matrices

Chi-square probability plot of generalized distances. (In this and subsequent figures of probability the theoretical quantities are plotted along the x-axis and the ordered observations along the y-axis.) Ordered observa-tions D2 Chi-square probability plot of generalized distances D2

1  2 1 = 2 1  2 1  2 Lengths  

TESTS FOR ORIENTATION DIFFERENCES Anderson (1963) Ann. Math. Stat. 34, 122-148 ORNTDIST Calculate wherenis sample size of dispersion matrixS1, diis eigenvaluei, biiseigenvectoriof dispersion matrixS2(larger of the two).This isx2distributedwith (m – 1) d.f. If heterogeneous, can test whether due to differences in orientation. If no differences in orientation, heterogeneity due to differences in size and shape of ellipsoids.

SQUARED GENERALISED DISTANCE VALUES * Percentages within parentheses (Reyment (1969) Bull. Geol. Inst. Uppsala 1, 97-119)

Mahalanobis D2 = whereS-1is the inverse of the pooled variance-covariance matrix, anddis the vector of differences between the vectors of means of the two samples. GENERALISED STATISTICAL DISTANCES BETWEEN TWO GROUPS ORNTDIST Anderson and Bahadur D2 = where b = , S1andS2are the respective group covariance matrices, andtis a scalar term between zero and 1 that is improved iteratively Reyment D2 =whereSris the sample covariance matrix of differences obtained from the random pairing of the two groups. As N1Dr2/2 = T2, can test significance. Average D2 =whereSa = ½ (S1 + S2 )

Dempster’s directed distances D(1)2 = and D(2)2 = Dempster’s generalised distance D12 = and D22 = Dempster’s delta distance D2 = where S =

IDENTIFICATION OF UNKNOWN OBJECTS DISKFN, R Assumption that probability of unknown object belonging to either group only is equal. Presupposes no other possible groups it could come from. Closeness rather than either/or identification. If unknown, u, has position on discriminant function: then: m degrees of freedom Birks & Peglar (1980) Can. J. Bot. 58, 2043-2058 Picea glauca (white spruce) pollen Picea mariana (black spruce) pollen

Quantitative characters of Picea pollen (variables x1 – x7). The means (vertical line),  1 standard deviation (open box), and range (horizontal line) are shown for the reference populations of the three species.

Delete x7 (redundant, invariant variable) Kullback's test suggests that there is now no reason to reject the hypothesis that the covariance matrices are homogenous (B2 = 31.3 which, for 2 = 0.64 and 21 degrees of freedom is not significant (p = 0.07)). These results show that when only variables x1 – x6 are considered the assumptions of linear discriminant analysis are justified. All the subsequent numerical analyses discussed here are thus based on variables x1 – x6 only. Results of testing for multivariate skewness and kurtosis in the seven size variables for Picea glauca and P. mariana pollen Results of testing for multivariate skewness and kurtosis in the size variablesx1 – x6 for Picea glauca and P. mariana pollen. The homogeneity of the covariance matrices based on all seven size variables (x1 – x7, Fig 3) was tested by means of the FORTRAN IV program ORNTDIST written by Reyment et al (1969) and modified by H.J.B Birks. The value of B2 obtained is 52.85, which for 2 = 0.98 and 28 degrees of freedom is significant (p = 0.003). This indicates that the hypothesis of homogenous covariance matrices cannot be accepted. Thus the assumption implicit in linear discriminant analysis of homogenous matrices is not justified for these data. NOTE: None of the values for A or B is significant at the 0.05 probability level.

Representation of the discriminant function for two populations and two variables. The population means I and II and associated 95% probability contours are shown. The vector c is the discriminant vector. The points yI and yII represent the discriminant means for the two populations. The points (e), (f) and (h) represent three new individuals to be allocated. The points (q) and (r) are the discriminant scores for the individuals (e) and (f). The point (0I) is the discriminant mean yI. ~ Alternative representation of the discriminant function. The axes PCI and PCII represent ortho-normal linear combinations of the original variables. The 95% probability ellipses become 95% probability circles in the space of the orthonormal variables. The population means I and II for the discriminant function for the orthonormal variables are equal to the discriminant means yI and yII. Pythagorean distance can be used to determine the distances from the new individuals to the population means.

CANONICAL VARIATES ANALYSIS MULTIPLE DISCRIMINANT ANALYSIS

Bivariate plot of three populations. A diagrammatic representation of the positions of three populations, A, B and C, when viewed as the bivariate plot of measurements x and y (transformed as in fig 20 to equalize variations) and taken from the specimens (a's, b's and c's) in each population. The positions of the populations in relation to the transformed measurements are shown. A diagrammatic representation of the process of generalized distance analysis performed upon the data of left figure; d1, d2, and d3 represent the appropriate distances. D2 A diagrammatic representation of the process of canonical analysis when applied to the data of top left figure. The new axes ' and " represent the appropriate canonical axes. The positions of the populations A, B, and C in relation to the canonical axes are shown.

g groups g – 1 axes (comparison between two means - 1 degree of freedom three means - 2 degrees of freedom) m variables If m < g – 1, only need m axes i.e. min (m, g – 1) Dimension reduction technique

An analysis of 'canonical variates' was also made for all six variables measured by GILCHRIST.* The variables are: body length (x1), abdomen length (x2), length of prosoma (x3), width of abdomen (x4), length of furca (x5), and number of setae per furca (x6). (Prosoma = head plus thorax.) The eigenvalues are, in order of magnitude, 33.213, 1.600, 0.746, 0.157, 0.030, -0.734. The first two eigenvalues account for about 99 percent of the total variation. The equations deriving from the eigenvalues of the first two eigenvectors are: E1 = –0.13x1 + 0.70x2 + 0-07x3 – 0.36x4 – 0.35x5 – 0.14x6 E2 = –0.56x1 + 0.48x2 + 0-08x3 – 0.18x4 – 0.20x5 + 0.31x6 By substituting the means for each sample, the sets of mean canonical variates shown in Table App II.10 (below) were obtained. Artemia salina (brine shrimp) 14 groups Six variables Five localities 35 ‰, 140 ‰ ♂, ♀ 2669 individuals CANVAR R.A. Reyment

Example of the relationship between the shape of the body of the brine shrimp Artemia salina and salinity. Redrawn from Reyment (1996). The salinities are marked in the confidence circles (35‰, respectively, 140‰). The first canonical variate reflects geograph-ical variation in morphology, the second canonical variate indicates shape variation. The numbers in brackets after localities identify the samples. Sexual dimorphism ♂ (green) to left of ♀ (pink) ♀ ♂ Salinity changes 35‰ 140‰

SOME REVISION PCA MatrixX(nxm)Y(mxm) Y1Y(sum of squares and cross-products matrix) Canonical variates analysis X(nxm)gsubmatrices centring by variable means for each submatrix within group SSP matrix = Wi within-groups SSP matrix total SSP matrix between-groups SSP matrix 

PCA CVA or i.e. obvious difference is thatBW-1has replacedT. Number of CVA eigenvalues = m or g – 1, which ever is smaller. Maximise ratio ofBtoW. Canonical variate is linear combination of variables that maximises the ratio of between-group sum of squaresBto within-group sum of squaresW. i.e. (cf. PCA 1 = u1Tu )

Normalised eigenvectors (sum of squares = 1, divide by x) give normalised canonical variates. Adjusted canonical variates – within-group degrees of freedom. Standardised vectors – multiply eigenvectors by pooled within-group standard deviation. Scores Dimension reduction technique.

Other statistics relevant CANVAR, R 1) Multivariate analysis of variance 2) Homogeneity of dispersion matrices where ni is sample size –1, W pooled matrix, Wi is group i matrix x2 distribution Geometrical interpretation – Campbell & Atchley (1981)

Ellipses for pooled within-group matrix W P1, P2 principal components of W 7 groups, 2 variables Scatter ellipses and means Scale each principal component to unit variance (divide by ) PCA of groups means, I and II I and II are canonical roots. Project 7 groups onto principal component areas p1, and P2 Euclidian space Reverse from orthonormal to orthogonal Reverse from orthogonal to original variables (as in (a) and (b)) (as in (c)) Illustration of the rotation and scaling implicit in the calculation of the canonical vectors.

AIDS TO INTERPRETATION CANVAR Plot group means and individuals Goodness of fit i/i Plot axes 1 & 2, 2 & 3, 1 & 3 Individual scores and 95% group confidence contours 2 standard deviations of group scores or z/n where z is standardised normal deviate at required probability level, n is number of individuals in group. 95% confidence circle, 5% tabulated value of F based on 2 and (n – 2) degrees of freedom. Minimum spanning tree of D2 Scale axes to ’s Total D2 = D2 on axes + D2 on other axes (the latter should be small if the model is a good fit) Residual D2 of group means

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA