1 / 57

Combining Multiple Classifiers

Combining Multiple Classifiers. Pattern Recognition Best possible classification rates Increase efficiency & accuracy Multiple classifier systems Improve generalization, robustness, and accuracy. Combining Multiple Classifiers. Why?

shaun
Télécharger la présentation

Combining Multiple Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Multiple Classifiers • Pattern Recognition • Best possible classification rates • Increase efficiency & accuracy • Multiple classifier systems • Improve generalization, robustness, and accuracy

  2. Combining Multiple Classifiers • Why? • Multiple classifiers are available, but none of them are perfect • Multiple types of features can be extracted for a given pattern • Certain complimentary properties exist among different • classifiers and different features. • Issues • How many classifiers are needed? • What kind of classifiers should be used? • What features to be used in each classifier? • How to combine results from different classifiers?

  3. Need for Combination • There are number of different classifiers. • Sometimes more than a single training set is available. • Different classifiers may show strong local differences. • Some classifiers show different results with different parameters, one can combine them, thereby taking advantage of all the attempts to learn from the data. • The training data may not provide sufficient information for choosing a single best classifier from the hypothesis space. • The learning algorithms may not be able to solve the difficult search problems. • The hypothesis space may not contain the true classification, instead, it may include several equally good estimates.

  4. Combination Methods • For different applications we may have different feature sets, different training sets, different classification methods or different training sessions, all resulting in a set of classifiers whose outputs may be combined, with the hope of improving the overall classification accuracy. • hybrid methods, decision combination, multiple experts, mixture of experts, classifier ensambles, cooperative agents, union pool, sensor fusion, and more ... • Various combination methods may differ from each other in their architectures, the characteristics of the combiner, and selection of the individual classifiers. • - Parallel - Serial -Hierarchical

  5. Approaches • Majority Voting Principle • A pattern is assigned to the class which receives the highest vote • from multiple classifiers. • Re-Ranking (Re-Ordering) Approaches: • Each classifier produces a set of ranked candidates, and the • candidates in the union of all the individual sets are re-ranked based • on their old ranks in each set. • Hierarchical Re-Ranking Approach: • All the classifiers are ordered based on their individual performance. • A classifier is used for re-ordering only if its predecessors are not • `confident` in their ranking.

  6. Bayesian Optimization Techniques: • The idea is to minimize the probability of error given all the decisions • made by individual classifiers. • Linear Combination Methods: • New decision is made based on a linear combination of the • confidence measures of individual classifiers. • Dempster-Shafer Theory: • Dempster-Shafer Theory provides a method for combining the • contribution of individual classifiers to give the final result.

  7. Classifier Combination According to the Bayesian theory, given a specific feature vector xtetk, the sample test pattern xtet , should be assigned to class wc, provided the a posteriori probability of that interpretation is maximum, Assign xtetwc if Let us rewrite the a posteriori probability P(wc|xtet1,…,xtetK) using the Bayes theorem. We have , where the unconditional joint probability density function can be expressed in terms of conditional probabilities as

  8. Product Rule • Let us assume that the classifiers are statistically independent, which will lead us to the rewrite the joint probability density function . • Product Rule is assign xtetwc if • In terms of the posterior probabilities the rule could be written as: • Assign xtetwc if

  9. Min Rule We can approximate the product rule with min rule, by bounding the product of posterior probabilities from above: we obtain Assign xtetwc if If we further assume that the prior probabilities are equal, this simplifies to Assign xtetwc if

  10. Sum Rule • Let us assume that the posterior probabilities can be expressed as P(wc|xtetk)=P(wc)(1+ck), where ck satisfies |ck|<<1. • If we expand the product and neglect any terms of second and higher order, we can approximate the right-hand side as • This simplifies to the sum rule • Assign xtetwc if

  11. Max Rule We can approximate the sum rule by the maximum of the posterior probabilities, since Assign xtetwc if If we further assume that the prior probabilities are equal, this simplifies to Assign xtetwc if

  12. Mean Rule If we assume equal prior probabilities, the sum rule can be viewed as computing the average posterior probabilities for each class over all the classifier outputs: Assign xtetwc if Assign xtetwc if

  13. Majority Vote Rule Let us force the posterior probabilities to produce binary valued function ctk=1 if and 0 otherwise. This function results in combining decision outputs to be class labels rather than posterior probabilities. If we further assume that the prior probabilities are equal we find: Assign xtetwc if Note that for each class wi the sum on the RHS is the count of the votes received from the individual classifiers. The class, which receives the largest number of votes, is then selected as majority decision.

  14. Error Sensitivity • and assume that eitk<<P(wi|xtetk), P(wi|xtetk)>0 • Product rule error factor • Sum rule error factor • The sum decision rule is much more reliable to estimation errors.

  15. Performance Measures • Performance • Reliability (ratio for correct classification) (probability of correct classification for that class)

  16. Performance Measures • Class performance • Probability performance (ratio of correct classification to the sample size) (based on distances of the posterior probabilities p’t of the classification result and the true classification probability pt. )

  17. Performance Measures • Overall classification performance • Sum of squared errors based on probabilities (combines the products of performancei and reliabilityi with the counts of samplesi at corresponding classi as a weight.) (sum of squared differences of the posterior probabilities of the classification result and the true classification probability.)

  18. Performance Measures • Distance of probabilities (Euclidean distance between posterior probabilities of the classification result and the true classification probability.)

  19. Class based combination Classifier’s class assignments are used for combination. The classifiers are forced to produce binary valued function ctk using the posterior probabilities as: ctk=1 if and 0 otherwise. Assign xtetwc if Assign xtetwc if

  20. Probability based combination We use the posterior probabilities of classifiers to carry out the combination. • Assign xtetwc if • Assign xtetwc if

  21. Combined class and probability based combination We can convert the class assignments of class and probability based combination algorithm to posterior probabilities.

  22. Combined class and probability based combination • Assign xtetwc if

  23. Combined class and probability based combination • Similarly, we can integrate reliability • Assign xtetwc if

  24. Weight assignment for combination • Equal weights: • Normalized overall performance are assigned as weights:

  25. Weight assignment for combination Another proposal is to assign weights using a linear fit on the posterior probabilities of leave-one-out results. • Least square fit parameters for the training data set is used as weights of the classifiers in the combination: • Integration of the reliability of the classifier for the assigned class:

  26. Result file

  27. Classifier set KMClus: K-means clustering with maximum iteration=10; maximum error=0.5. SOM : Self organizing map clustering with iteration=1000; learning rate=1. FANN : Fuzzy neural network classifier with fuzzification level=3; fuzzification type=0; number of hidden layer units=25; learning rate 0.001; maximum iteration=1000; minimum error=0.02. ANN : Artificial neural network classifier with number of hidden layer units=25; learning rate 0.001; maximum iteration=1000; minimum error=0.02.

  28. Classifier set KMClas: K-means classifier. Parzen : Parzen classifier with alfa=1. KNN : K-nearest neighbour classifier with k=3. PQD : Piecewise quadratic distance classifier. PLD : Piecewise linear distance classifier. SVC : Support vector machine using radial basis kernel with p=1

  29. Data sets BIO : Cariers of a rare genetic disorder. 5(127+67) DIB : Pigma Indians Diabetes. 8(500+268) D10 : Duin 10 dimensional distribution. 10(100+100) GID : Glass Identification. 9(70+76+17+13+9+29) IMX : IMOX IEEE data file of letters. 8(48+48+48+48)

  30. Data sets SMR: Sonar. 60(97+111) 2SD : Two spirals two dimensional. 2(97+97) WQD: Wine quality. 13(59+71+48) 80X : IEEE 80X data set. 8(15+15+15) ZMM: 6 Zernike moments of 8 characters. 6(12+12+12+12+12+12+12+12)

  31. Data sets BEM: Equal mean but different variance(20 % Bayes error). 2(100+100) BEV : Different mean but equal variance (20 % Bayes error). 2(100+100) HRD: Highleyman random patterns. 2(100+100) IFD : Classical Fisher’s iris flowers. 4(50+50+50)

  32. Time performance of classifiers

  33. Performance of classifiers on data sets

  34. Performance of boosting a classifier using weighted combination

  35. Increasing the learning performance linear linear poly3 third degree polynomial rbf radial basis with unit width erbf radial basis with a unit width and square root of distances sigmoid sigmoid with scale one and no offset fourier fourier with zero degree spline spline bspline third degree bspline.

  36. Increasing the learning performance

  37. Performance of boosting a classifier’ learning performance

  38. Performance of class based classifier combination

  39. Performance of probability based classifier combination

  40. Performance of combined classifier combination

  41. Sensitivity Analysis • Removal of worst classifiers • Removal of best classifier • Best classifier subset • Incremental classifier addition

  42. Class based combination - No Clustering

  43. Probability based combination - No Clustering

  44. Combined combination - No Clustering

  45. Class based combination - No K-NN

  46. Probability based combination - No K-NN

  47. Combined combination - No K-NN

  48. Sum of Squred Errors on probabilities of classifier set

More Related