Canonical Discriminant Analysis for Classification and Prediction

Canonical Discriminant Analysis for Classification and Prediction

Discriminant analysis techniques are used to classify individuals into one of two or more alternative groups (or populations) on the basis of a set of measurements. The populations are known to be distinct, and each individual belongs to one of them. These techniques can also be used to identify which variables contribute to making the classification. Thus, as in regression analysis, we have two uses, prediction and description. As an example, consider an archeologist who wishes to determine which of two possible tribes created a particular statue found in a dig. The archeologist takes measurements on several characteristics of the statue and must decide whether these measurements are more likely to have come from the distribution characterizing the statues of one tribe or from the other tribe's distribution. The distributions are based on data from statues known to have been created by members of one tribe or the other. The problem of classification is therefore to guess who made the newly found statue on the basis of measurements obtained from statues whose identities are certain.As another example, consider a loan officer at a bank who wishes to decide whether to approve an applicant's automobile loan. This decision is made by determining whether the applicant's characteristics are more similar to those persons who in the past repaid loans successfully or to those persons who defaulted. Information on these two groups, available from past records, would include factors such as age, income, marital status, outstanding debt, and home ownership.

Assumptions • Linearity—Similar to other multivariate techniques that employ a variate (i.e., linear combination that represents the weighted sum of two or more predictor variables that discriminate best between a priori defined groups), an implicit assumption is that all relation- ships among all pairs of predictors within each group are linear. However, violation of this assumption is less serious than others in that it tends to lead to reduced power rather than increase Type I error (Tabachnick and Fidell, 2001). • Multivariate normality—The assumption is that scores on each predictor variable are normally distributed (univariate normality) and that the sampling distribution of the combination of two or more predictors is also normally distributed (multivariate normality). Multivariate normality is difficult to test and currently there are no specific tests capable of testing the normality of all linear combinations of sampling distributions of predictors. However, since multivariate normality implies univariate normality (although the reverse is not necessarily true), a situation in which all variables exhibit univariate normality will help gain, although not guarantee, multivariate normality (Hair et al., 1995).

Assumptions (Contd.) • Homogeneity of variance-covariance matrices—When sample sizes are unequal and small, unequal covariance matrices can adversely affect the results of significance testing. Even with decently sized samples, heterogeneity of variance-covariance matrices can affect the classification process whereby cases are “over-classified” into groups with greater variability (Tabachnick and Fidell, 2001). A test of this assumption can be made via Box’s M. As this test is overly sensitive (increases the probability of Type I error), an alpha level of .001 is recommended. • Multicollinearity—As with multiple regression analysis, multicollinearity denotes the situation where the independent/predictor variables are highly correlated. When independent variables are multicollinear, there is “overlap” or sharing of predictive power so that one variable can be highly explained or predicted by the other variable(s). Thus, that predictor variable adds little to the explanatory power of the entire set.

Application Areas 1. The major application area for this technique is where we want to be able to distinguish between two or more sets of objects or people, based on the knowledge of some of their characteristics. 2. Examples include the selection process for a job, the admission process of an educational programme in a college, or dividing a group of people into potential buyers and non-buyers. 3. Discriminant analysis can be, and is in fact used, by credit rating agencies to rate individuals, to classify them into good lending risks or bad lending risks. 4. To summarize, we can use discriminant analysis when we have a nominal dependent variable, such as to classify objects into two or more groups based on the knowledge of some variables (characteristics) related to them. Typically, these groups would be users-non-users, potentially successful salesman – potentially unsuccessful salesman, high risk – low risk consumer, or on similar lines.

Methods, Data etc. 1. Discriminant analysis is very similar to the multiple regression technique. The form of the equation in a two-variable discriminant analysis is: Y = a + k1 x1 + k2 x2 2. This is called the discriminant function. Also, like in a regression analysis, Y is the dependent variable and x1 and x2 are independent variables. k1 and k2 are the coefficients of the independent variables, and a is a constant. In practice, there may be any number of x variables. 3. Please note that Y in this case is a categorical variable (unlike in regression analysis, where it is continuous). x1 and x2 are however, continuous (metric) variables. k1 and k2 are determined by appropriate algorithms in the computer package used, but the underlying objective is that these two coefficients should maximize the separation or differences between the two groups of the Y variable. 4. Y will have 2 possible values in a 2 group discriminant analysis, and 3 values in a 3 group discriminant analysis, and so on.

5. K1 and K2 are also called the unstandardised discriminant function coefficients 6. As mentioned above, Y is a classification into 2 or more groups and therefore, a ‘grouping’ variable, in the terminology of discriminant analysis. That is, groups are formed on the basis of existing data, and coded as 1 and 2 etc. 7. The independent (x) variables are continuous scale variables, and used as predictors of the group to which the objects will belong. Therefore, to be able to use discriminant analysis, we need to have some data on y and the x variables from experience and / or past records.

Building a Model for Prediction/Classification Assuming we have data on both the y and x variables of interest, we estimate the coefficients of the model which is a linear equation of the form shown earlier, and use the coefficients to calculate the y value (discriminant score) – for any new data points that we want to classify into one of the groups. A decision rule is formulated for this process – to determine the cut off score, which is usually the midpoint of the mean discriminant scores of the two groups. Accuracy of Classification: Then, the classification of the existing data points is done using the equation, and the accuracy of the model is determined. This output is given by the classification matrix (also called the confusion matrix), which tells us what percentage of the existing data points is correctly classified by this model.

This percentage is somewhat analogous to the R2 in regression analysis (percentage of variation in dependent variable explained by the model). Of course, the actual predictive accuracy of the discriminant model may be less than the figure obtained by applying it to the data points on which it was based. Stepwise / Fixed Model: Just as in regression, we have the option of entering one variable at a time (Stepwise) into the discriminant equation, or entering all variables which we plan to use. Depending on the correlations between the independent variables, and the objective of the study, the choice is left to the researcher.

Relative Importance of Independent Variables 1. Suppose we have two independent variables, x1 and x2. How do we know which one is more important in discriminating between groups? 2. The coefficients of x1 and x2 are the ones which provide the answer, but not the raw (unstandardised) coefficients. To overcome the problem of different measurement units, we must obtain standardised discriminant coefficients. These are available from the computer output. 3. The higher the standardised discriminant coefficient of a variable, the higher its discriminating power.

Example Suppose a Bank wants to start credit card division. They want to use discriminant analysis and set up a system to screen applicants and classify them as either ‘low risk’ or ‘high risk’ (risk of default on credit card bill payments), based on information collected from their applications for a credit card. Suppose the Bank has managed to get from another Bank, its sister bank, some data on their credit card holders who turned out to be ‘low risk’ (no default) and ‘high risk’ (defaulting on payments) customers. These data on 18 customers are given in fig. 1 (File: Discriminant.sav).

Variables: RISKL (Risk Level: 1-Low Risk, 2-High Risk), AG (age), INC (income), YRSM (years of marriage)

We will perform a discriminant analysis and advise the Bank on how to set up its system to screen potential good customers (low risk) from bad customers (high risk). In particular, we will build a discriminant function (model) and find out • .The percentage of customers that it is able to classify correctly. • .Statistical significance of the discriminant function. • .Which variables (age, income, or years of marriage) are relatively better in discriminating between ‘low’ and ‘high’ risk applicants. • .How to classify a new credit card applicant into one of the two groups – ‘low risk’ or ‘high risk’, by building a decision rule and a cut off score.

SPSS Analysis steps: Analyze – Classify – Discriminant Grouping variable – RISKL Independents – age, income, years of marriage; Statistics – Descriptive: Means, Univariate ANOVAs, Box’s M; Function coefficients : Unstandardized; Classify – Display : Summary Table; OK

Validity of discriminant analysis: The validity of the analysis is judged by the Wilks Lambda statistic. Wilks’ Lambda is the ratio of within-groups sums of squares to the total sums of squares. This is the proportion of the total variance in the discriminant scores not explained by differences among groups. This is a badness of fit. It ranges from 0 to 1. For a good discriminant analysis, it must be as close to zero as possible (although a value of 0.3 or 0.4 is suggested). A lambda of 1.00 occurs when observed group means are equal (all the variance is explained by factors other than difference between those means), while a small lambda occurs when within-groups variability is small compared to the total variability. A small lambda indicates that group means appear to differ. The associated significance value indicate whether the difference is significant. Here, the Lambda of 0.319 has a significant value (Sig. = 0.001); thus, the group means appear to differ. H0 : The discriminant analysis is not valid H1 : The discriminant analysis is valid If the actual sig. is < 0.05, we reject H0

The value of Wilks’ Lamba is 0.319. This value is between 0 and 1, and a low value (closer to 0) indicates better discriminating power of the model. Thus, 0.319 is an indicator of the model being good. The probability value of the Chi Square test indicates that the discrimination between the two groups is highly significant. This indicates that the F test would be significant at a confidence level of upto 99.9%.

Significance of predictors: Wilks's lambda: In discriminant analysis, the Wilk’s Lamba is also used to test the significance of the predictors by themselves. Since the p values are all <0.05, age, income and years of marriage are each significant predictors by themselves.

Box's M Test of Equality of Covariance Matrices This assumption is tested with Box’s M test and the results are presented under the heading Box’s Test of Equality of Covariance Matrices. The results indicate that the Box’s M value of 5.992 (F = 0.793) is associated with an alpha level of .576. As mentioned earlier, Box’s M is highly sensitive to factors other than just covariance differences (e.g., normality of the variables and large sample size). As such, an alpha level of .001 is recommended. On the basis of this alpha level, the computed level of .576 is not significant (p > .001). Thus, the assumption of equality of covariance matrices has not been violated. This is also interpreted as indication that the data do not differ significantly from multivariate normal, as significance value of 0.576 is more than 0.001. This means one can proceed with the analysis.

Eigenvalues An eigenvalue indicates the proportion of variance explained. A large eigenvalue is associated with a strong function. Canonical correlation The canonical correlation is a measure of the association between the groups in the dependent variable and the discriminant function. A high correlation indicates a function that discriminates well. We can square the Canonical Correlation to compute the effect size for the discriminant function.

Checking for Multicollinearity: In Multiple Regression using SPSS, it is possible to request the display of Tolerance and VIF values for each predictor as a check for multicollinearity. A tolerance value is an indication of the percent of variance in the predictor that cannot be accounted for by the other predictors. Hence, very small values indicate “overlap” or sharing of predictive power (i.e., the predictor is redundant). Values that are less than 0.10 may merit further investigation. The VIF, which stands for variance inflation factor, is computed as “1/tolerance,” and it is suggested that predictor variables whose VIF values are greater than 10 may merit further investigation. For this example, multicollinearity is not a problem.

The Structure Matrix provides the correlations of each independent variable with the standardized discriminating function. Observe that age and years of marriage have notable positive correlations with the function, but income is moderately correlated.

Let us take an example of a credit card application to the Bank who is aged 40, has an income of SAR 25,000 per month and has been married for 15 years. Plugging these values into the discriminant function or model above, we find his discriminant score y to be • - 10.0036 + 40 (.24560) + 25000 (.00008) • +15 (.08465), which is • = -10.0036 + 9.824 + 2 + 1.26975 • = 3.09015 • According to our decision rule, any discriminant score to the right of the midpoint of 0 leads to a classification in the low risk group. Therefore, we can give this person a credit card, as he is a low risk customer. The same process is to be followed for any new applicant. If his discriminant score is to the right of the midpoint of 0, he should be given a credit card, as he is a ‘low risk’ customer.

Discriminant Function Analysis with Three or More Groups The data used in this example have 244 observations on four variables, as indicated in the file discrim-3groups.sav. The variables include three continuous, numeric variables (outdoor, social and conservative scores) and one categorical variable (job) with three levels: 1) customer service, 2) mechanic and 3) dispatcher. We are interested in the relationship between the three continuous variables and our categorical variable. In this example, job has three levels and three discriminating variables were used, so two functions are calculated. Each function acts as projections of the data onto a dimension that best separates or discriminates between the groups. The number of functions is equal to the number of discriminating variables, if there are more groups than variables, or 1 less than the number of levels in the group variable.

SPSS Analysis steps: Analyze – Classify – Discriminant Grouping variable – Job Independents – outdoor, social and conservative ; Statistics – Descriptive: Means, Univariate ANOVAs, Box’s M; Function coefficients : Unstandardized; Classify – Display : Summary Table; OK

Validity of discriminant analysis: The validity of the analysis is judged by the Wilks Lambda statistic. H0 : The discriminant analysis is not valid H1 : The discriminant analysis is valid If the actual sig. is < 0.05, we reject H0 Significance of predictors: Wilks's lambda: In discriminant analysis, the Wilk’s Lamba is also used to test the significance of the predictors by themselves. Since the p values are all <0.05, outdoor, social and conservative are each significant predictors by themselves.

Box's M Test of Equality of Covariance Matrices The computed level of .012 is not significant (p > .001). Thus, the assumption of equality of covariance matrices has not been violated. Canonical correlation A high canonical correlation indicates a function that discriminates well. We can square the Canonical Correlation to compute the effect size for the discriminant function. The magnitudes of the eigenvalues are indicative of the functions' discriminating abilities. % of Variance is the proportion of discriminating ability of the three continuous variables found in a given function.

Checking for Multicollinearity: A tolerance value is an indication of the percent of variance in the predictor that cannot be accounted for by the other predictors. Hence, very small values indicate “overlap” or sharing of predictive power (i.e., the predictor is redundant). Values that are less than 0.10 may merit further investigation. The VIF, which stands for variance inflation factor, is computed as “1/tolerance,” and it is suggested that predictor variables whose VIF values are greater than 10 may merit further investigation. For this example, multicollinearity is not a problem.

The Structure Matrix provides the correlations of each independent variable with the standardized discriminating function. Observe that social and conservative have significant correlations with the first function and outdoor is significantly correlated with the second function..

The group centroids for the three job categories: 1.customer service, 2.mechanic and 3.dispatch are: A(-1.219, -0.389), B(0.107, 0.715) and C(1.420, -0.506) Draw the graph considering the two functions F1 and F2 on the two axes and locate the points A, B and C. For prediction, locate the nearest centroid by calculating the discriminant function values for F1 and F2 for given scores of the independent variables. For example, consider the scores for outdoor=10, social=20 and conservative=5. The value of F1=-1.248 and F2=-0.808. The point (-1.248,-0.808) is close to the centroid A, corresponding to the job1- customer service (see the graph in the next slide).

Canonical Discriminant Analysis for Classification and Prediction

Canonical Discriminant Analysis for Classification and Prediction

Presentation Transcript

Discriminant Analysis

Discriminant Analysis

Discriminant Analysis and Classification

Classification Discriminant Analysis

Classification and Prediction

Classification and Prediction

Classification and Prediction

Classification and Prediction

Financial classification models – Part I: Discriminant Analysis

Classification and Prediction

Kernelized Discriminant Analysis and Adaptive Methods for Discriminant Analysis

Discriminant Analysis

Classification and Prediction

Discriminant Analysis

Discriminant Analysis

Classification and Prediction

Classification and Prediction

Financial classification models – Part I: Discriminant Analysis

Discriminant Analysis

Classification Discriminant Analysis