1 / 50

AMCS/CS 340: Data Mining

Classification: SVM. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Support Vector Machines Ensemble Methods.

marli
Télécharger la présentation

AMCS/CS 340: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification: SVM AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Support Vector Machines Ensemble Methods Classification Techniques 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. Find a linear hyperplane(decision boundary) that will separate the data Support Vector Machines 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  4. One Possible Solution Support Vector Machines 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  5. Support Vector Machines Another Possible Solution 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. Support Vector Machines Other Possible Solutions 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. Support Vector Machines • Which one is better ? B1 or B2 ? • How do you define the better ? 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. SVM: Margins and Support Vectors • Find hyperplane maximizes the margin  B1 is better than B2 Support Vectors are those data points that the margin pushes up against 8

  9. Support Vector Machines • SVM finds this hyperplane (decision boundary) using support vectors (“essential” training tuples) and margins (defined by the support vectors) • Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s • Used both classificationand prediction • Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) • Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. Margin and hyperplane Separating hyperplane w is a weight vector b is a scalar (bias) Side of margin Side of margin classifier 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  11. Support Vector Machines 11 • We want to maximize: • Which is equivalent to minimizing: • But subjected to the following constraints: • This is a constrained optimization problem • Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. SVM: not linearly separable • What if the problem is not linearly separable? • Introduce slack variables (soft margin, allow training errors) is not on the correct side • Modify the object function to be Cis a cost parameter, can be chosen based on cross-validation 12

  13. Nonlinear SVM • What if the decision boundary is not linear ? 13

  14. Nonlinear SVM • Transform data into higher dimensional space 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. SVM optimization (mapping) 15 • Minimize: • subject to • Nonlinear separation x a higher dimensional feature space e.g., • Minimize: • subject to Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. Lagrange function 16 • Minimize: subject to • Lagrange function (generalized Lagrange multipliers with inequality constraints) • Weak duality: minimizing the Lagrange function provides lower bounds to the optimization problem where is the optimal solution • New optimization problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. Dual form 17 • Find solution for • Optimal conditions , yield • Equivalently maximize in (Convex quadratic optimization problem can be solved using the dual form) subject to Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. Dual form with soft margin 18 • Find solution for • Optimal conditions , yield • Equivalently maximize in (Convex quadratic optimization problem can be solved using the dual form) subject to Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. How to find α ? Q is a N*N matrix: depend on training input x, label y Call this problem quadratic programming 20 • Quadratic Programming • Maximize • Minimize • Subject to • There exist algorithms for finding the constrained quadratic optima: • Projected conjugate gradients (Burges, 1998) • Decomposition methods (Osuna et al, 1996) • Sequential minimal optimization (Platt, 1999) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. After solution of α 21 • Having solution of • support vectors, lies on one of the hyperplanes • other • predict the class of given point X Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  21. SVM: kernel functions • Apply a kernel function to original datax for nonlinear separation • Computing the dot product of ? • Need to explicitly know the definition of ? • NO, use kernel function ! 22 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  22. SVM: kernel functions • Learning classifier • Make predication • Typical kernel functions 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  23. Example of kernel functions • Example of polynomial kernel • If d=2, 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Multi-class SVMs • One-versus-all • Train k binary classifiers, one for each class against all other classes. • Predicted class is the class of the most confident classifier • One-versus-one • Train k(k-1)/2 classifiers, each discriminating between a pair of classes • Several strategies for selecting the final classification based on the output of the binary SVMs • Truly Multi-class SVMs • Generalize the SVM formulation to multiple categories 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. Why is SVM effective on high dimensional data ? 27 • The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data • The support vectors are the essential or critical training examples —they lie closest to the decision boundary • If all other training examples are removed and the training is repeated, the same separating hyperplanewould be found • The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality • Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. What You Should Know ? 40 • Linear SVMs • The definition of a maximum margin classifier • What QP can do for you (you may need not to know how it does)? • How Maximum Margin can be turned into a QP problem? • How we deal with noisy (non-separable) data? • How we permit non-linear boundaries? • How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basis function terms? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. Open issues of SVM 41 • Speed up the quadratic programming training method (both time complexity & storage capacity problem are increasing as train data increase) • The choice of kernel function : there are no guidelines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. SVM Related Links 42 • SVM Website http://www.kernel-machines.org/ • Tutorial C. J. C. Burges 1998. A Tutorial on Support Vector Machines for Pattern Recognition. • Representative implementations • LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ an efficient implementation of SVM, multi-class classifications, nu-SVM (soft margin), one-class SVM, regression, including also various interfaces with java, python, etc. • SVM-light: http://www.cs.cornell.edu/People/tj/svm_light/ simpler but support only binary classification and only C language • More: http://www.kernel-machines.org/software Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  29. Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Support Vector Machines Ensemble Methods Classification Techniques 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  30. Ensemble Methods • Improve the performance of models for classification and regression? • Combine multiple models together • make the prediction and the discrimination by combining the multiple models instead of just using a single model Two heads are better than one. اثنين أحسن من واحد ( يعني يد واحده مابتصفق ) 三个臭皮匠,顶个诸葛亮 Cuatro ojos ven mas que dos. 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  31. Ensemble Methods • Construct a set of classifiers from the training data • Predict class label of previously unseen records by aggregating predictions made by multiple classifiers • Advantage: often improves predictive performance • Disadvantage: usually produce output that is very hard to analyze • However, there are approaches that aim to produce a single comprehensive structure 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  32. Win a challenge 46 • KDDcup 2009, classification of mobile customer, 10,000 euros • Difficulties: • Large number of training examples: 50,000 • Large number of features: 15,000 • Large number of missing values: 60% • Unbalanced class proportions: fewer than 10% of the exemplars of positive class • Winners: • IBM Research: an ensemble of a wide variety of classifiers • ID Analytics: boosting decision tree and bagging • David Slate & Peter Frey: an ensemble of decision trees • University of Melbourne: boosting with classification trees • Financial Engineering Group: gradient tree classifier boosting • National Taiwan University: AdaBoost with tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  33. Main types of Ensemble Methods • Combine multiple models together • Bagging • Make classification by voting over a collection of classifiers • Boostings • Train multiple models in sequence • Decision tree • Different models are responsible for making predictions in different regions of input space 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  34. Bagging • Construct a set of classifiers from the training data • Predict class label of previously unseen records by aggregating predictions made by multiple classifiers 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  35. Bagging • Sampling with replacement (bootstrap), each sample has probability (1 – 1/n)n of being selected • Build classifier on each bootstrap sample Mi • Classification: classify an unknown sample x • Each classifier Mi returns its class prediction • The bagged classifier M* counts the votes and assign the class with the most (averaged) votes to x • Accuracy • Often significant better than a single classifier derived from D • For noise data: not considerably worse, more robust 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  36. Suppose there are 25 base classifiers Each classifier has error rate,  = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction (according to the majority vote of the base-classifiers) The reduced error rate 50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  37. Main types of Ensemble Methods • Combine multiple models together • Bagging • Make classification by voting over a collection of classifiers • Boostings • Train multiple models in sequence • Decision tree • Different models are responsible for making predictions in different regions of input space 51 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  38. Boosting • An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records • Initially, all n records are assigned equal weights w=1/n • For t = 1, 2, …, M, Do • Obtain a classifier yt(x) under {wnt} • Calculate the error of yt(x) and re-weight examples based on the errors {wnt+1} • Output a weighted sum of all the classifiers,where the weight of each classifier's vote is a function of its accuracy 52 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  39. Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased Boosting example • Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds 53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  40. Boosting vs Bagging • Committees/Bagging • Base classifiers are trained in parallel on samples of data set • Boosting • Base classifiers are trained in sequence by using a weighted form of the data set • weighting coefficient of each data point depends on the performance of the previous classifiers. • misclassified points are given greater weight when used to train the next classifier in the sequence. • boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data 54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  41. AdaBoost(Freund and Schapire, 1997) • AdaBoost (adaptive boosting): a popular boosting algo • Given a set of N class-labeled examples, D={(X1, y1), …, (XN, yN)} • Initially, all the weights of tuples are set the same (1/N) • Generate k classifiers in k rounds. At round i, • Examples from D are sampled (with replacement) to form a training set Di of the same size • Each example’s chance of being selected is based on its weight • A classification model Mi is derived from Di • Error rate of Miis calculated using Di as a test set • Weights of training examples are adjusted depending on how they were classified • Correctly classified: Decrease weight • Incorrectly classified: Increase weight 55 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  42. Base classifiers: M1, M2, …, MT Error rate of a classifier Mi: Importance of a classifier Mi: Example: AdaBoost 56 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  43. Update weight wiof example xi (from round j to round j+1) If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/N and the resampling procedure is repeated Classification: Example: AdaBoost 57 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  44. Illustrating AdaBoost • One-dimensional input data: • Base classifiers: decision tree of height two, with one split • Maximal attainable accuracy: 80% + + + - - - - - + + x 0.5 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  45. Illustrating AdaBoost Data points for training Initial weights for each data point 0.0625 0.0625 0.25 0.6931 http://www.lri.fr/~xlzhang/KAUST/CS340_slides/adaboost-illustration.m 59 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  46. Illustrating AdaBoost 0.0625 0.0625 0.25 0.6931 0.1667 0.0385 0.1538 0.7332 0.1032 0.1 0.0952 0.7175 60 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  47. Main types of Ensemble Methods • Combine multiple models together • Bagging • Make classification by voting over a collection of classifiers • Boostings • Train multiple models in sequence • Decision tree • Different models are responsible for making predictions in different regions of input space 61 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  48. Random Forests • Ensemble of decision trees • Input set: N tuples, M attributes • Each tree is learned on a reduced training set • Randomly selectm<<M attributes • Sample training data • with replacement • only keep m randomly selected attributes • the best split on these m is used to split the node • m is held constant during the forest growing • Bagging using decision trees is a special case of random forests when m=M 62 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  49. Random Forests 63 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  50. Random Forests Algorithm • Good accuracy without over-fitting, but interpretability decreases • Fast algorithm (can be faster than growing/pruning a single tree); easily parallelized • Handle high dimensional data without much problem • Only one tuning parameter mtry = , usually not sensitive to it 64 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

More Related