1 / 79

Statistical Learning

Statistical Learning. Dong Liu Dept. EEIS, USTC. Chapter 1. Linear Regression. From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach. A motivating example 1/2. What is the height of Mount Qomolangma ?

mercury
Télécharger la présentation

Statistical Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Learning Dong Liu Dept. EEIS, USTC

  2. Chapter 1.Linear Regression • From one to two • Regularization • Basis functions • Bias-variance decomposition • Different regularization forms • Bayesian approach Chap 1. Linear Regression

  3. A motivating example 1/2 • What is the height of Mount Qomolangma? • A piece of knowledge – one variable • How do we achieve this “knowledge” from data? • We have a series of measurements: • For example, we can use the (arithmetic) mean: def HeightOfQomolangma(): return 8848.0 def SLHeightOfQomolangma(data): return sum(data) / len(data) Chap 1. Linear Regression

  4. A motivating example 2/2 • Or in this way hQomo = 0 def LearnHeightOfQomolangma(data): global hQomo hQomo = sum(data) / len(data) def UseHeightOfQomolangma(): global hQomo return hQomo Learning/Training Using/Testing Chap 1. Linear Regression

  5. Why arithmetic mean? • Least squares • Solving the problem • Relative (local) minimum • Absolute (global) minimum • In statistical learning, we often formulate such optimization problems and try to solve them • How to formulate? • How to solve? Chap 1. Linear Regression

  6. From the statistical perspective • The height of Qomolangma is a random variable , which obeys a specific probability distribution • For example, Gaussian (normal) distribution • The measurements are observations of the random variable, and are used to estimate the distribution • Assumption: independent and identical distribution (i.i.d.) Chap 1. Linear Regression

  7. Maximum likelihood estimation • Likelihood function: as a function of • Overall likelihood function (recall iid): • We need to find a parameter that maximizes the overall likelihood: • And it reduces to least squares! Chap 1. Linear Regression

  8. More is implied • We can also estimate other parameters, e.g. • We can use other estimators, like unbiased: • We can give range estimation rather than point estimation Chap 1. Linear Regression

  9. Correlated variables • The height of Mount Qomolangma is correlated to the season • So what is the correlation between two variables? • Why not an affine function: Spring Summer Fall Winter def UseSeasonalHeight(x, a, b): return a*x+b Chap 1. Linear Regression

  10. Least squares • We formulate the optimization problem as • And (fortunately) have the closed-form solution • Result ↗ • Seemingly not good, how to improve? Chap 1. Linear Regression

  11. Variable (re)mapping • Previously we use • Now we use • Result ↗ def ErrorOfHeight(datax, datay, a, b): fity = UseSeasonalHeight(datax, a, b) error = datay - fity return sum(error**2) Season: 3.6146646254138832 Remapped season: 0.9404394822254982 Chap 1. Linear Regression

  12. From the statistical perspective • We have two random variables • Height: a dependent, continuous variable • Season: an independent, discrete variable • The season’s probability distribution • The height’s probability distribution • The overall likelihood function: Chap 1. Linear Regression

  13. History review Carl Friedrich Gauss (German, 1777-1855) Adrien-Marie Legendre (French, 1752-1833) Chap 1. Linear Regression

  14. Notes • Correlation is not Causality, but inspires efforts to interpret • Remapped/Latent variables are important Chap 1. Linear Regression

  15. Chapter 1.Linear Regression • From one to two • Regularization • Basis functions • Bias-variance decomposition • Different regularization forms • Bayesian approach Chap 1. Linear Regression

  16. As we are not confident about our data • Height is correlated to season, but also correlated to other variables • Can we constrain the level of correlation between height and season? • So we want to constrain the slope parameter • We have two choices • Given a range of possible values of slope parameter, find the least squares • Minimize the least squares and the (e.g. square of) slope parameter simultaneously Chap 1. Linear Regression

  17. Two optimization problems • Constraint form • Unconstrained form • Solution Reg 0: y = 1.184060 * x +8846.369904, error: 0.940439 Reg 1: y = 1.014908 * x +8846.708207, error: 1.112113 Reg 2: y = 0.888045 * x +8846.961934, error: 1.466188 Reg 3: y = 0.789373 * x +8847.159277, error: 1.875104 Reg 4: y = 0.710436 * x +8847.317152, error: 2.286357 Reg 5: y = 0.645851 * x +8847.446322, error: 2.678452 Reg 6: y = 0.592030 * x +8847.553964, error: 3.043435 Reg 7: y = 0.546489 * x +8847.645045, error: 3.379417 Reg 8: y = 0.507454 * x +8847.723115, error: 3.687209 Reg 9: y = 0.473624 * x +8847.790776, error: 3.968753 Reg 10: y = 0.444022 * x +8847.849978, error: 4.226370 Chap 1. Linear Regression

  18. How to solve a constrained optimization problem? • Consider a general optimization problem • The basic idea is to construct an augmented objective function where and are Lagrange multipliers • Then consider the dual function • When , the dual function is a lower bound of the original problem Chap 1. Linear Regression

  19. Duality • The dual problem • Weak duality (always true) • Strong duality (under some condition) • Considering differentiable functions, strong duality implies • For convex optimization, KKT condition implies strong duality KKT condition Chap 1. Linear Regression

  20. Understanding the KKT condition: From the equivalence of constrained and unconstrained • Equation as constraint • Inequation as constraint Chap 1. Linear Regression

  21. Understanding the KKT condition: From the geometrical perspective Example: • No constraint • With equality as constraint • With inequality as constraint Chap 1. Linear Regression

  22. More about convex optimization 1/2 • Convex set • Convex function • A function that is defined on a convex set • Concave function: if its negative is convex • Affine function is both convex and concave • Convex optimization is to minimize a convex function (or maximize a concave function) over a convex set Chap 1. Linear Regression

  23. More about convex optimization 2/2 • For a convex optimization problem, any local minimum is also global minimum • Proof: by reductio • For a convex optimization problem, if the function is strictly convex, then there is only one global minimum • Proof: the definition of convex function • The dual problem is convex optimization Chap 1. Linear Regression

  24. What & Why is regularization? • What • A process of introducing additional information in order to solve an ill-posed problem • Why • Want to introduce additional information • Have difficulty in solving the ill-posed problem With regularization: Without regularization: Chap 1. Linear Regression

  25. From the statistical perspective • The Bayes formula • Maximum a posterior (MAP) estimation (Bayesian estimation) • We need to specify a prior, e.g.: • Finally it reduce to the regularized least squares with Chap 1. Linear Regression

  26. Bayesian interpretation of regularization • The prior is “additional information” • Many statisticians question this point • How much regularization depends on • How confident we are about the data • How confident we are about the prior Chap 1. Linear Regression

  27. Chapter 1.Linear Regression • From one to two • Regularization • Basis functions • Bias-variance decomposition • Different regularization forms • Bayesian approach Chap 1. Linear Regression

  28. Polynomial curve fitting • Basis functions • Weights • Another form: weights and bias Chap 1. Linear Regression

  29. Basis functions Global vs. Local Polynomial Gaussian Other choices: Fourier basis (sinusoidal), wavelet, spline Sigmoid Chap 1. Linear Regression

  30. Variable remapping • Using basis functions will remap the variable(s) in a non-linear manner • Change the dimensionality • To enable a simpler (linear) model Chap 1. Linear Regression

  31. Maximum likelihood • Assume observations are from a deterministic function with additive Gaussian noise • Then • Given observed inputs and targets • The likelihood function is Chap 1. Linear Regression

  32. Maximum likelihood and least squares • Maximizing is equivalent to minimizing which is known as sum of squared errors (SSE) Chap 1. Linear Regression

  33. Maximum likelihood solution • Solution is The design matrix The pseudo-inverse Chap 1. Linear Regression

  34. Geometrical interpretation • Let • And let the columns of be • They span a subspace • Then, is the orthogonal projection of on the subspace , so as to minimize the Euclidean distance Chap 1. Linear Regression

  35. Regularized least squares • Construct the “joint” error function • Use SSE as data term, and quadratic regularization term (ridge regression): • The solution is Data term + Regularization term Chap 1. Linear Regression

  36. Equivalent kernel • For a new input, the predicted output is • Predictions can be calculated directly from the equivalent kernel, without calculating the parameters Equivalent kernel Chap 1. Linear Regression

  37. Equivalent kernel for Gaussian basis functions Chap 1. Linear Regression

  38. Equivalent kernel for other basis functions Polynomial Sigmoidal Equivalent kernel is “local”: nearby points have more weights Chap 1. Linear Regression

  39. Properties of equivalent kernel • Sums to 1 if is 0 • May have negative values • Can be seen as inner product Chap 1. Linear Regression

  40. Chapter 1.Linear Regression • From one to two • Regularization • Basis functions • Bias-variance decomposition • Different regularization forms • Bayesian approach Chap 1. Linear Regression

  41. Example Reproduced from PRML • Generate 100 data sets, each having 25 points • A sine functionplus Gaussian noise • Perform ridge regression on each data set with 24 Gaussian basis functions and different values of regularization weight Chap 1. Linear Regression

  42. Simulation results 1/3 • High regularization, the variance is small but bias is large Fitted curve (shown 20 fits) The average curve after 100 fits Chap 1. Linear Regression

  43. Simulation results 2/3 • Moderate regularization Fitted curve (shown 20 fits) The average curve after 100 fits Chap 1. Linear Regression

  44. Simulation results 3/3 • Low regularization, the variance is large but bias is small Fitted curve (shown 20 fits) The average curve after 100 fits Chap 1. Linear Regression

  45. Bias-variance decomposition • The second term is intrinsic “noise”, consider the first term • Suppose we have a dataset and we can calculate the parameter based on the dataset • Then we take expectation with respect to dataset • Finally we have: expected “loss” = (bias)2 + variance + noise Chap 1. Linear Regression

  46. Bias-variance trade-off • Over-regularized modelwill have a high bias, while under-regularized modelwill have a high variance • How can we achieve the trade-off? • For example, by cross validation (will be discussed later) Chap 1. Linear Regression

  47. Chapter 1.Linear Regression • From one to two • Regularization • Basis functions • Bias-variance decomposition • Different regularization forms • Bayesian approach Chap 1. Linear Regression

  48. Other forms? • Least squares • Ridge regression • Norm regularized regression norm: Chap 1. Linear Regression

  49. Different norms What about 0 and ∞? Chap 1. Linear Regression

  50. Best subset selection • Define “norm” as • Best subset selection regression: • Also known as “sparse” • Unfortunately, this is NP-hard Chap 1. Linear Regression

More Related