Kernel Matching Reduction Algorithms for Classification Jianwu Li and Xiaocheng Deng

Kernel Matching Reduction Algorithms for Classification Jianwu Li and Xiaocheng Deng Beijing Institute of Technology

Introduction • Kernel-based pattern classification techniques • Support vector machines (SVM) • Kernel linear discriminant analysis (KLDA) • Kernel Perceptrons • ……

Introduction • Support vector machines (SVM) • Structural risk minimization (SRM) • Maximum margin classification • Quadratic optimization problem • Kernel trick

Introduction • Support vector machines (SVM) • Support vectors (SV) • Sparse solutions

Introduction • Kernel matching pursuit (KMP) • KMP appends functions to an initially empty basis sequentially, from a redundant dictionary of functions, to approximate a classification function by using a certain loss criterion. • KMP can produce much sparser models than SVMs.

Introduction • Kernel Matching Reduction Algorithms (KMRAs) • Inspired by KMP and SVMs, we propose kernel matching reduction algorithms. • Different from KMP, kernel matching reduction algorithms (KMRAs), are proposed to perform a reverse procedure in this paper.

Introduction • Kernel Matching Reduction Algorithms (KMRAs) • Firstly, all training examples are selected to construct a function dictionary. • Then the function dictionary is reduced iteratively by linear support vector machines (SVMs). • During the reduction process, the parameters of the functions in the dictionary can be adjusted dynamically.

Kernel Matching Reduction Algorithms • Constructing a Kernel-Based Dictionary • For a binary classification problem, assume there exist l training examples, which form the training set S = {(x1, y1), (x2, y2), . . . , (xl, yl)}, where xi∈ Rd, yi∈{−1, +1}, and yirepresents the class label of the point xi, i = 1, 2, . . . , l.

Kernel Matching Reduction Algorithms • Constructing a Kernel-Based Dictionary • Given a kernel function K : Rd × Rd → R, similar to KMP, we use kernel functions, centered on the training points, as our dictionary: D = {K(x, xi)|i =1, . . . , l}.

Kernel Matching Reduction Algorithms • Constructing a Kernel-Based Dictionary • Here, the Gaussian kernel function is selected:

Kernel Matching Reduction Algorithms • Constructing a Kernel-Based Dictionary • The value of σi should be set to keep the influence of the local domain around xiand prevent xifrom having a high activation for the field far from xi.

Kernel Matching Reduction Algorithms • Constructing a Kernel-Based Dictionary • Therefore, we adopt the following heuristic method • Where are p nearest neighbors of xi. Such, the receptive width of each point is determined to cover a certain region in the sample space.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • Using all the kernel functions from the kernel-based dictionary D ={K(x, xi)|i = 1, . . . , l}, we construct a mapping from original space to feature space. • Any training example xiin S is mapped to a corresponding point ziin S, where zi= (K(xi, x1),K(xi, x2), . . . , K(xi, xl)). • The training set S = {(x1, y1), (x2, y2), . . . , (xl, yl)}in original space is mapped to S ={(z1, y1), (z2, y2), . . . , (zl, yl)}in feature space.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • we design a linear decision function gl(zt) = sign(fl(zt)) in feature space, and which corresponds to the nonlinear form in original space: where w = (w1, w2, . . . , wl) represents weights of every dimension in z.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • We can decide which kernel functions are important for classification, and which are not, according to their weights magnitudes |wi| in (3) or (4), where |wi| denotes the absolute value of wi. Those redundant kernel functions, which have lowest weights magnitudes, can be deleted from the dictionary to reduce the model.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • If we use the usual least squares error criterion to find this function, it is not practical, since the number of training examples, at the beginning, is equal to, or near to, the dimension number of the feature space S, and we will confront the problem of the not-invertible matrix.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • In fact, support vector machines (SVMs), based on the structural risk minimization, are fit for solving supervised classification problems with high dimensions. we also adopt linear SVMs to find the classification function in (3) or (4) on S.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • The optimization objective of linear SVMs is to minimize • subject to the constraints yi[(w • zi) + b] ≥ 1 − ξi, and ξi ≥ 0, i = 1, 2, · · ·, l ,

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • widenotes the contribution of zito the classifier in (3), and the higher the value of |wi|, the more contribution of zito the model. • Consequently, we can rank ziaccording to the values of |wi| (i = 1, 2, · · ·, l) from large to small. We can also rank xi by |wi|, because xiis the preimage of ziin the original space.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • The xiwith the smallest |wi| can be deleted from the dictionary D, and D can be reduced to D’. • Then we can continue this procedure on the new dictionary D’. Thus, the process can be iteratively performed until a given stop criterion is satisfied. • Note that, each σ should be computed again on the new dictionary D’, according to (2), after D is reduced to D’ every time, such that the receptive widths of kernel functions in D’ can always cover the whole sample space.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • We can set a tolerant minimum accuracy δ for the training examples, as the termination criterion of this procedure. • We expect to gain a simplest model under the condition of guaranteeing the satisfied classification accuracy for all training examples. • This idea accords with the principles of minimum description length and Occam’s Razor. • Therefore, this algorithm can be expected to have a good generalization ability.

Kernel Matching Reduction Algorithms • Reducing the Kernel-Based Dictionary by Linear SVMs • Different from KMP which appends kernel functions to the last model gradually, this reduction strategy can expect to avoid local optima, just due to deleting redundant functions from the functions dictionary iteratively.

Kernel Matching Reduction Algorithms • The Detailed Procedure of KMRAs • Step 1, Set the parameter p in (2), the cross validation fold number v for determining C in (5), and the required classification accuracy δ on the training examples. • Step 2, Input training examples S = {(x1, y1), (x2, y2), . . . , (xl, yl)}. • Step 3, Compute each σ by the equation (2), and construct the kernel-based dictionary D = {K(x, xi)|i = 1, . . . , l}.

Kernel Matching Reduction Algorithms • The Detailed Procedure of KMRAs • Step 4, Transform S to S’ by the dictionary D. • Step 5, Determine C by v-fold cross validation. • Step 6, Train the linear SVM with the penalty factor C on S’, and obtain the classification model, including wi, i = 1, 2, . . . , l. • Step 7, Rank xiby their weights magnitudes |wi|, i = 1, 2, . . . , l.

Kernel Matching Reduction Algorithms • The Detailed Procedure of KMRAs • Step 8, If the classification accuracy of this model for training data is higher than δ, delete from D the K(x, xi) which has the smallest |wi|, then adjust each σ for new D by (2), and go to Step 4; Otherwise go to Step 9. • Step 9, Output the classification model, which satisfies the accuracyδ with the simplest structure.

Kernel Matching Reduction Algorithms • The Detailed Procedure of KMRAs • The reduction step 8 can be generalized to remove more than one basis function per iteration for improving the training speed.

Comparing with Other Machine Learning Algorithms • Although KMRAs, KMP, SVMs, HSSVMs, and RBFNNs can all generate a similar decision function shape as the equation (4), KMRAs have distinct characteristics, in the essence, compared with several other algorithms.

Comparing with Other Machine Learning Algorithms • Differences with KMP • Both KMRA and KMP build kernel-based dictionaries, but they adopt different ways to select basis functions for last solutions. KMP appends kernel functions iteratively to the classification model. By contrary, KMRAs reduce the size of the dictionary step by step, by deleting redundant kernel functions. • Moreover, different from KMP, KMRAs utilize linear SVMs to find solutions in feature space.

Comparing with Other Machine Learning Algorithms • KMRA Versus SVM • The main difference between KMRA and SVM consists in the approaches of producing feature spaces. KMRAs create the feature space by a kernel-based dictionary, whereas SVMs by kernel functions. • Kernel functions in SVMs must satisfy Mercers theorem, while KMRAs have no restrictions on kernel functions in the dictionary . The comparison between KMRAs and SVMs is similar to that between KMP and SVM. In fact, we select Gaussian kernel functions in this paper, which can have different kernel widths obtained by the equation (2), but those Gaussian kernel functions, for all support vectors of SVMs, have the same kernel width.

Comparing with Other Machine Learning Algorithms • Linking with HSSVMs • Hidden space support vector machines (HSSVMs), also map input patterns into a high-dimensional hidden space by a set of nonlinear functions, and then train linear SVMs in the hidden space. From this viewpoint of constructing feature spaces and performing linear SVMs, KMRAs are similar to HSSVMs. But we adopt an iterative procedure to eliminate redundant kernel functions, until obtaining a condense solution. • KMRAs can be considered as an improved version of HSSVMs.

Comparing with Other Machine Learning Algorithms • Relation with RBFNNs • Although RBFNNs also build feature spaces using usually Gaussian kernel functions, they create discrimination functions in the least square sense. However, KMRAs use linear SVMs, i.e. the idea of structural risk minimization, to find solutions. • In a broad sense, we can think of KMRAs as a special model of RBFNNs with a new configuration design strategy.

Experiments • Description on Data Sets and Parameter Settings • We compare KMRAs with SVMs, on four datasets: Wisconsin Breast Cancer, Pima Indians Diabetes, Heart, and Australian, in which the former two are from the UCI machine learning databases, and the latter two from the Statlog database. • We directly use the LIBSVM software package for performing the normal SVM.

Experiments • Description on Data Sets and Parameter Settings • Throughout the experiments: • 1. All training data and test data are normalized to [−1, 1]. • 2. Two-thirds of examples are randomly selected as training examples, and the remaining one-third as test those. • 3. Gaussian kernel functions are chosen for SVMs, in which the kernel width σ and the penalty parameter C are decided by ten-fold cross validation on the training set. • 4. p = 2, in equation (2), is adopted. • 5. v = 5, in Step 5 of algorithm KMRA, is set. • 6. For any dataset, SVM is firstly trained, and then according to the classification accuracy of SVM, we determine the stop accuracyδ for KMRAs.

Experiments • Experimental Results • We first illustrate the results from standard SVMs, including their parameters C and σ in Table 1, and support vector numbers #SVs, and the prediction accuracy in Table 2.

Experiments • Experimental Results • We set the termination accuracy δ = 0.97, 0.8, 0.8, and 0.9 in KMRAs for these four datasets respectively, according to the classification accuracies of SVMs in Table 2. • We perform KMRAs on these datasets, and record classification accuracies for test datasets per iteration with algorithms running. Then we also show the results in Fig. 1.

Experiments • Experimental Results • In Fig. 1, the accuracies of SVMs on test examples are expressed in the thick straight lines, and the thin curves represent the classification performance of KMRAs. The row axis denotes iteration times of KMRAs, that is to say, numbers of kernel functions in the dictionary decrease gradually from left to right.

Experiments • Experimental Results • For Diabetes and Australian, we can find the prediction accuracies of KMRAs are improved gradually with kernel functions in the dictionary reducing. At the beginning of KMRAs’ runs, we can conclude that the overfittings happen. Before KMRAs end, the performance of KMRAs approaches to, even is superior to, that of SVMs. • For Breast and Heart, from the beginning to the end, the curves of KMRAs fluctuate up and down around the accuracy lines of SVMs.

Experiments • Experimental Results • We further illustrate, in the Table 2, the numbers of kernel functions (i.e. #SVs), which appear in the last classification functions, as well as the corresponding prediction accuracies, when KMRAs terminate. • Moreover, we record the best performance during the iterative process of KMRAs, and also list them in the Table 2. • From Table 2, compared with SVMs, KMRAs use much sparser support vectors, whereas they can obtain comparable results.

Experiments • Experimental Results

Conclusions • We propose KMRAs, which delete redundant kernel functions from a kernel-based dictionary, iteratively. Therefore, we expect KMRAs can avoid local optima, and can have a good generalization ability. • Experimental results demonstrate that, compared with SVMs, KMRAs show comparable accuracies, but with typically much sparser representations. This means that KMRAs can have a fast classification speed for test examples than SVMs. • In addition, analogous to SVMs, we can extend KMRAs to solve multi-classification problems, though we only consider the two-class situation in this paper.

Conclusions • We can also find, KMRAs gain sparser models at the expense of a long training time. Consequently, future work should attempt to explore how to reduce the training cost. • In conclusion, KMRAs provide a new problem solving approach for classification.

Thanks!

Kernel Matching Reduction Algorithms for Classification Jianwu Li and Xiaocheng Deng

Kernel Matching Reduction Algorithms for Classification Jianwu Li and Xiaocheng Deng

Presentation Transcript

Algorithms for Classification:

Schema Matching Algorithms

Matching and Routing: Structures and Algorithms

Classification Algorithms

Online Multiple Kernel Classification

Efficient Algorithms for Matching

A Graph-Matching Kernel for Object Categorization

Classification ( SVMs / Kernel method)

Li Deng Microsoft Research, Redmond

A Graph-Matching Kernel for Object Categorization

State Reduction: Row Matching

Algorithms for Classification:

Strings and Pattern Matching Algorithms

String Matching Algorithms

String Matching Algorithms

Classification Algorithms

Classification Algorithms

Algorithms for Classification:

Algorithms for Classification: