1 / 31

Nonlinear Knowledge in Kernel Approximation

Nonlinear Knowledge in Kernel Approximation. Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison. Objectives. Primary objective : Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge

eze
Télécharger la présentation

Nonlinear Knowledge in Kernel Approximation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison

  2. Objectives • Primary objective: Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge • Secondary objective: Achieve transparency of the prior knowledge for practical applications

  3. Outline • Use kernels for function approximation • Incorporate prior knowledge • Previous approaches: require transformation of knowledge • New approach does not require any transformation of knowledge • Knowledge given over completely arbitrary sets • Experimental results • Two synthetic examples and one real world example related to breast cancer prognosis • Approximations with prior knowledge more accurate than approximations without prior knowledge

  4. Function Approximation • Given a set m of points in n-dimensional real space Rn and corresponding function values in R • Points are represented by rows of a matrix A  Rm n • Exact or approximate function values for each point are given by a corresponding vector y  Rm • Find a function f:RnR based on the given data • f(Ai0)yi

  5. Function Approximation • Problem: utilizing only given data may result in a poor approximation • Points may be noisy • Sampling may be costly • Solution: use prior knowledge to improve the approximation

  6. Adding Prior Knowledge • Standard approach: fit function at given data points without knowledge • Constrained approach: satisfy inequalities at given points • 2004 MSW Paper: satisfy linear inequalities over polyhedralregions • Proposed new approach: satisfy nonlinear inequalities over arbitrary regions

  7. Kernel Approximation • Approximate f by a nonlinear kernel function K • f(x) K(x0, A0) + b • K(x0, A0) = exp(-||x-Ai||2)i • Error in kernel approximation of given data • -sK(A, A0) + be - ys • e is a vector of ones in Rm

  8. Kernel Approximation • Trade off between solution complexity (||||1) and data fitting (||s||1) • Convert to a linear program • At solution • e0a = ||||1 • e0s = ||s||1

  9. Incorporating Nonlinear Prior Knowledge (MSW 2004) • Bxd0Ax+ bh0x +  • Need to “kernelize” knowledge from input space to feature space of kernel • Requires change of variable x = A0t • BA0t  d 0AA0t + b h0A0t + • K(B, A0)td0K(A, A0)t + bh0A0t +  • Motzkin’s theorem of the alternative gives an equivalent linear system of inequalities which is added to a linear program • Achieves good numerical results, but kernelization is not readily interpretable in the original space

  10. Incorporating Nonlinear Prior Knowledge: New Approach • Start with arbitrary nonlinear knowledge implication • g(x)  0K(x0, A0) + bh(x), 8x2 ½ Rn • g, h are arbitrary functions • g:! Rk, h:! R • Problem: need to add this knowledge to the optimization problem • Logically equivalent system: • g(x)  0, K(x0, A0) + b - h(x) < 0 hasno solution x

  11. Prior Knowledge as a System of Linear Inequalities • Use a theorem of the alternative for convex functions: Assume that g(x), K(x0, A0) + b, -h(x) are convex functions of x, that  is convex and 9x2 : g(x)<0. Then either: • I. g(x)  0, K(x0, A0) + b - h(x) < 0 has a solution x,or • II. vRk, v 0: K(x0, A0) + b - h(x) + v0g(x)  0 x • But never both. • If we can find v 0: K(x0, A0) + b - h(x) + v0g(x)  0x, then by above theorem • g(x)  0, K(x0, A0) + b - h(x) < 0 hasno solution x or equivalently: • g(x)  0 K(x0, A0) + bh(x), 8x2

  12. Proof • I  II • Follows from OLM 1969, Corollary 4.2.2 and the existence of an x2 such that g(x) <0. • I  II • Suppose not. That is, there exists x2, v 2Rk,, v¸ 0: • g(x)  0, K(x0, A0) + b - h(x) < 0, (I) • v  0, v0g(x) +K(x0, A0) + b - h(x)  0 , 8 x 2  (II) • Then we have the contradiction: • 0 > v0g(x) +K(x0, A0) + b - h(x)  0 • Requires no assumptions on g, h, K, or  whatsoever

  13. Example g(x) = 1250 - x3 , f(x) = x4, h(x) = x2 + 5000 g(x) · 0)f(x)¸h(x) v0g(x) + f(x) - h(x) ¸ 0 :I II x3 1250 x4x2 + 5000

  14. Incorporating Prior Knowledge • Linear semi-infinite program: infinite number of constraints • Discretize: finite linear program • Slacks allow knowledge to be satisfied inexactly • Add term to objective function to drive slacks to zero

  15. Numerical Experience • Evaluate on three datasets • Two synthetic datasets • Wisconsin Prognostic Breast Cancer Database • Compare approximation with prior knowledge to one without prior knowledge • Prior knowledge leads to an improved approximation • Prior knowledge used cannot be handled exactly by previous work • No kernelization needed on knowledge set

  16. Two-Dimensional Hyperboloid • f(x1, x2) = x1x2

  17. Two-Dimensional Hyperboloid x2 • Given exact values only at 11 points along line x1 = x2 • At x12 {-5, …, 5} x1

  18. Two-Dimensional Hyperboloid Approximation without Prior Knowledge

  19. Two-Dimensional Hyperboloid • Add prior knowledge: • x1x2 1 f(x1, x2) x1x2 • Nonlinear term x1x2 can not be handled exactly by any previous approach • Discretization used only 11 points along the line x1 = -x2, x1 {-5, -4, …, 4, 5}

  20. Two-Dimensional HyperboloidApproximationwithPrior Knowledge

  21. Two-Dimensional Tower Function

  22. Two-Dimensional Tower Function Data • Given 400 points on the grid [-4, 4]  [-4, 4] • Values are min{g(x), 2}, where g(x) is the exact tower function

  23. Two-Dimensional Tower Function Approximation withoutPrior Knowledge

  24. Two-Dimensional Tower FunctionPrior Knowledge • Add prior knowledge: • (x1, x2)  [-4, 4]  [-4, 4] f(x) = g(x) • Prior knowledge is the exact function value. • Enforced at 2500 points on the grid [-4, 4]  [-4, 4] through above implication • Principal objective of prior knowledge is to overcome poor given data

  25. Two-Dimensional Tower Function Approximation with Prior Knowledge

  26. Predicting Lymph Node Metastasis • Number of metastasized lymph nodes is an important prognostic indicator for breast cancer recurrence • Determined by surgery in addition to the removal of the tumor • Wisconsin Prognostic Breast Cancer (WPBC) data • Lymph node metastasis for 194 patients • 30 cytological features from a fine-needle aspirate • Tumor size, obtained during surgery • Task: predict the number of metastasized lymph nodes given tumor size alone

  27. Predicting Lymph Node Metastasis • Split data into two portions • Past data: 20% used to find prior knowledge • Present data: 80% used to evaluate performance • Simulates acquiring prior knowledge from an expert’s experience

  28. Prior Knowledge for Lymph Node Metastasis • Use kernel approximation without knowledge on the past data • f1(x) = K(x0, A10)1 + b1 • A1 is the matrix of the past data points • Use density estimation to decide where to enforce knowledge • p(x) is the empirical density of the past data • Number of metastasized lymph nodes is greater than the predicted value on the past data, with tolerance of 0.01 • p(x)  0.1 f(x) f1(x) - 0.01

  29. Predicting Lymph Node Metastasis: Results • Table shows root-mean-squared-error (RMSE) of past data (20%) approximation f1(x) on present data (80%) • Leave-one-out (LOO) RMSE reported for approximations with and without knowledge • Improvement due to knowledge: 14.8%

  30. Conclusion • Added general nonlinear prior knowledge to kernel approximation • Implemented as linear inequalities in a linear programming problem • Knowledge incorporated transparently • Demonstrated effectiveness • Two synthetic examples • Real world problem from breast cancer prognosis • Future work • More general prior knowledge with inequalities replaced by more general functions • Apply to classification problems

  31. Questions • Websites linking to papers and talks: • http://www.cs.wisc.edu/~olvi/ • http://www.cs.wisc.edu/~wildt/

More Related