150 likes | 256 Vues
Basics of discriminant analysis. Purpose Various situations Examples R commands Demonstration. Purpose. 2D example.
E N D
Basics of discriminant analysis • Purpose • Various situations • Examples • R commands • Demonstration
Purpose 2D example Assume that we have several groups and we know that our observations must belong to one of these groups. For example we may know that there are several diseases and by symptoms we want to decide which disease we are dealing with. Or we may have several species of plants. When we observe various characteristics of some specie we want to know to which specie it belongs to. We want to divide our space into regions and when we observe an observation then we decide which region it belongs to. Each region is assigned to one of the classes. If an observation belongs to the region number k then we say that this observation belongs to class number k. In the picture we have 3 regions. If an observation belongs to the region 1 then we decide that it is a member of the class 1. Discriminant analysis is widely used in many fields. For example it is an integral part of neural networks. 3 1 2
Various situations There can be several situations: • We know the distribution for each class (it is an unrealistic assumption). Then the problem becomes easy. If we have an observation then calculate probability of this observation using formula for each class. Whichever has the maximum value that wins. • We know the form of the distributions but we do not know their parameters. For example we may know that distribution for each class is normal but we do not know mean and variances for these distributions. Then we need to have representatives for each class. Once we have representatives we can estimate parameters of the distributions (mean and variance for the normal case). When we have new observation we can use these parameters as true parameters and calculate probabilities. Again the largest probability wins • When we have prior probabilities. E.g. in the case of diseases we may know that one of them has prior probability of 0.7 and another one may have prior probability 0.3. In this case we can use these probabilities when we calculate probability of the observation by simple multiplications
Various situations: Unknown parameters If we know that probability distributions are normal then we have two cases • Variances of these distributions are same In this case space is divided by hyperplanes. In one dimensional case with two classes we have one point that divides line into two regions. This point is in the middle of means for two distributions. In two dimensional case with two classes we have a line that divides space into two regions. This lines intersects line segment joining to means of distributions in the middle. In three dimensional space we will have planes. • Variances are different In this case we will have shapes defined by quadratic forms that divide space into regions. In one dimensional case we will have two points. In two dimensional case we may have ellipse, hyperbola, parabola, two lines. Form of these lines are dependent on the differences of variances. In three dimensional case we can have ellipsoid, hyrperboloid etc.
Maximum likelihood discriminant analysis Let us assume that we have g populations (groups). Each of the population has the probability distribution Li(x). Then for an observation likelihood of all populations is calculated and the population with the largest likelihood is taken. If two of the populations have the same likelihood then one of them can be chosen. Let us assume we are dealing with one dimensional populations and their distributions are normal. Moreover let us assume that we have only two populations then we will have This quadratic inequality divides real numbers line into two regions. When this inequality is satisfied then the observation belongs to the class 1 otherwise it belongs to the class 2. When variances are equal then we have a linear inequality. Then if 1 > 2 and x>(1+ 2)/2 then this rule puts x into group 1. Multidimensional cases are similar to one dimensional cases except inequalities are multidimensional. When variances are equal then the space is divided by a hyperplane (line in two dimensional case) If parameters of the distributions are not known they are calculated using given observations
Distributions with equal and known variances: 1D example The probability distributions for classes are known. They are normal. Variances for both of them are 1. One of them has mean value 5 and another one has 8. Anything below 6.5 belongs to class 1 and anything above 6.5 belongs to class 2. Observation with value 6.5 can belong to both classes The observations a and b will be assigned to class 1 and the observations c and d will be assigned to class 2. Anything smaller than the middle of two means will be assigned to the call 1 and anything bigger than this value will belong to class 2. class 1 class 2 2 1 distrimination point a b c d new observations
Distributions with known but different variances: 1D example Interval for class 1 Assume that we have two classes. Probability distributions for both of them is normal. Means and variances of distributions are known. One of the distributions is much sharper than another one. In this case the probability of the observation b for the class 2 is higher than that for the class 1. Probability of c for the class 1 is higher than for the class 2. Probability of observation a, although very small, for the class 1 is higher than for the class 1. Thus the observations a, c ,d will be assigned to the class 1 and the observation b to the class 2. Very small and large observation will belong to class 1 and medium observations to class 2. class 2 class 1 a b c d new observations
Two dimensional example Discrimination line In two dimensional case we want to divide the whole plane into two two (or more) sections. If new observations belong to one of these regions then we decide its class number. Red dot is on the region corresponding to class 1 and Blue dot is on the region belonging to class 1. Parameters of the distributions are calculated using sample points (shown by small black dots). There are 50 observations for each class. If it turns out that variances of distributions are equal then we will have linear discriminations. If variances would be unequal then we would have quadratic discriminations (lines would be quadratic). class 1 class 2 new observations
Likelihood ratio discriminant analysis Likelihood ratio discriminant rule is a technique that puts a given observation to the group that is being tested and parameters are re-estimated. It is done for each group. Observation is allocated to a group that has the largest likelihood. This technique tend to put an observation to a population that has larger sample size.
Fisher’s discriminant function Fisher’s discrimination rule maximises the ratio of between groups sum of squares to within group sum of squares: Where W is the within group sum of squares: n is the total number of observations, g is the number of groups, i, ni is the number of observations in the group i. There are several ways of calculating between groups sum of squares. One popular way is a weighted sume of squares. Then problem of finding discrimination rule reduces to finding maximum eigenvalue and corresponding eigenvector of the matrix W-1B. New observation x is put into the group i if the following inequality holds
When parameters of distributions are unknown In general the problem consists of two parts • Classification. At this stage space is divided into regions and each region belongs to one class. In some sense it means that we need to find a function or inequalities to divide space into parts. It is done usually by probability distribution for each class. In a way this stage can be considered as a rule generation. • Discrimination. Once space has been partitioned or rules have ben generated then using these rules new observations are assigned to classes Note that if each observation belongs to one class only then it is a deterministic rule. There are other rules also. One of them is fuzzy rules. Each observation has degree of belongness to a class. For example observation may belong to class 1 with degree equal to 0.7 and to class 2 with degree 0.3.
Probability of misclassification Let us assume we have g groups (classes). Probability of misclassification is defined as probality of putting an observation to the class i when it is from the class j. It is denoted as pij. In particular probability of correct allocation for the class i is pii and probability of misclassification for this call is 1-pii. Assume that we have two discriminant rules - d and d’. It is said that discriminant rule d is as good as d’ if: pii p’ii for i=1,,,g d is better than d’ if at least in one of the cases inequality is strict. If there is no better rule than d then it is called an admissible rule. In general it may happen that it is not possible to compare two rules. For example it may happen that p11>p’11 but p22<p’22.
Resampling and misclassification Resubstitution: Estimate disciminant rule and then for each observation calculate probability of misclassification. Problem with this technique is that it gives, as expected, optimistic estimation. Jacknife: From each class one observation in turn removed, discriminant rule is defined. Removed observation is predicted. Then probability of misclassification is calculated using ni1/n1. Where n1is the number of observation in the first group, ni1 is number of cases when observation from group 1 was classified as belonging to group i. Similar misclassification probability is calculated for each class. Bootstrap: Resample the sample of observations. There several techniques that applies bootstrap. One of them is described here. First calculate misclassification probabilities using resubstitution. Denote it by eai. There are two ways: Resample all observations simultaneously or resample each group (i.e. take a sample of n1 points from the group 1 etc). Then define discrimination rule and then estimate probabilities of misclassification for bootstrap sample and for the original sample. Denote them epiband pib. Calculate differences dib=epib-pib. Repeat it B times and averag. It is the bootstrap bias correction. Then probability of misclassification is eai-<d>
R commands for discriminant analysis Commands for discrimnant analyses are in the library MASS. This library should be loaded. library(MASS) Necessary commands are: lda – linear discriminant analysis. Using given observation this command calculates discrimination lines (hyperplanes) qda – quadratic discrimination analysis. This command calculates necessary equations. It does not assume equality of the variances. predict – For new observation it makes decision to which class it belongs. Example of uses: z = lda(data,grouping=groupings) predict(z,newobservations) Similarly for quadratic discriminant analysis z = qda(data,groupins=groupings) predict(z,newobservations)$class data are data matrix given us for discrimination rule calculations. They can be considered as a data set for training. grooupings defines which observation belongs to which class.
References • Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis. Kendall’s library of statistics • Mardia, K.V. Kent, J.T. and Bibby, J.M. (2003) Multivariate analysis