00:00

Understanding Supervised Learning in Machine Learning

Supervised learning involves learning an unknown mapping through a set of input-output pairs. This type of learning can be used to classify data or predict numerical values. Factors like sample size, hypothesis set, and learning algorithm influence the learnability of the process. Probably Approximately Correct (PAC) learning and Vapnik-Chervonenkis (VC) dimension help understand the complexities involved. Noise and model complexity are also important considerations in supervised learning.

nicolao
Télécharger la présentation

Understanding Supervised Learning in Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CH. 2: Supervised Learning 2.1 Introduction Supervised learning learns an unknown mapping f through a set of (input i, output o) pairs. f can be a numerical function or a classifier. N i   {( , )} , i i o Given a training sample, where X 1 i ( , ) i i o is a training example, figure out f. i 1

  2.  Example: Learn the class c of “family cars”, i.e., learn a discriminant that separates c from different classes of cars.  T x A car is represented as where x1: price, x2: engine power ( , ) x x 1 2 1 0 class otherwise   c  Let r Given a sample {( , )}N i x  S r  1 i i 2

  3. Due to a car x is represented as a 2D vector , the family car class c forms a rectangle in the 2D-space ( , ). x x  ( , )T x x x 1 2 1 2 A hypothesis set H deduced from class c is then defined as { | ( i i H h h p e i i        ) x p 1 1 2 i i ( )} x e 1 2 2 3

  4. Problem: Given S and H, the learning algorithm A attempts to find a hypothesis that minimizes error ) ) (empirical error). t E h r     x  h H t t N 11( ( 4

  5. 2.2 Learnability In the supervised learning process, three items: S (sample), H (hypothesis set), and A (learning algorithm) will influence the learnability of the process. Probably approximately correct (PAC) learning and Vapnik-Chervonenkis (VC) dimension will give the answers to the complexities of sample and hypothesis set, respectively. 5

  6. 2.2.1 Probably Approximately Correct (PAC) Learning -- Answer to the sample complexity, i.e., the number of training examples required to achieve a satisfied (probably approximate correct PAC) answer. Hypotheses s, g, and Version Space V 6

  7. Hypotheses s and g can be determined from the training set X by first looking for s from outside moving inward; then look for g from s heading outward. Version space:  { | is between and } h h V s g In general, choose the h V  hypothesis with largest margin between s and h. 7

  8. PAC learnable – using the tightest rectangle as the hypothesis class, i.e., h = s, how many training examples N should have, such that with probability h has error at most ? Mathematically,  1   at least   ) 1     ( P ch where : ch  the region of difference between c and h. 8

  9. In order that the probability of a positive car falling in (i.e., error) is at most  ch   Probability that fall in a strip is upper bounded by  /4     Probability that miss (i.e., correct) a strip 1 /4 /4)    (1 N  Probability that N instances miss a strip  Probability that N instances miss 4 strips 4(1 /4) N  This probability should be upper bounded by 4(1 /4)     N 9

  10.     (1 /4) /4       N 4(1  /4)  N      log(4/ ) - (A)  4/  log(1 /4) N  (1 /4) N   x    1 log(1 e x x  ) x        /4)  From log(1 ), /4 log(1 x x     /4),      /4) - (B)  /4 log(1 /4 log(1 N N  N   (A), (B) /4 log(4/ )  (4/ )log(4/ )   N , which depends on given ,   10

  11. 2.2.2 Vapnik-Chervonenkis (VC) Dimension --A measure of hypothesis complexity.  N points can be labeled in 2Nways as +/– (or 1/0), e.g., N = 3, 2N= 8  labeling , if that separates +/– examples,   h H H shatters N points, e.g., H: line or rectangle hypothesis class both can shatter 3 points VC(H): the maximum number of points that can be shattered by H 11

  12. Examples: i) The VC dimension of the ‘line’ hypothesis class is 3 in 2D space, i.e., VC(line) = 3 Cannot be shattered by “line” hypothesis class Can be shattered by “line” hypothesis class 12

  13. Only rectangles covering 2 (no 1, 3, 4) points are shown (ii) An axis-aligned (AA) rectangle shatters 4 points, i.e., VC(AA rectangle) = 4 5 points can not be shattered by a rectangle (iii) VC(triangle) = 7 13

  14. Assignment: Show that for any finite H, VC( H  ) log | |. H 2  Extension -- Each point can be labelled as one of K labels. N points can be labelled in KNways. , if of different labels, H shatters N points. VC(H): the maximum number of points that can be shattered by H  labeling   that separates the examples h H Assignment (monotonicity property): For any two hypothesis sets if H   "line" "triangle", H    , then VC( H  ) VC( ) H H H "line" 14

  15. 2.6 Model Selection -- determine one from a given set of models  Example: Learn a Boolean function from examples 15

  16. Each training example removes half the hypotheses, e.g., input , output 0. This removes because their outputs are 1. ( , x x  ) (0,1) 1 2 , , , , h h h h h h h h , , , 5 6 7 8 13 14 15 16 The above example illustrates that a learning starts with all possible models (hypotheses) and as more examples are seen, those inconsistent models are removed. 2.3 Noise and Model Complexity Noise due to error, imprecision, uncertainty, etc.. Complicated hypotheses are generally necessary to cope with noise. 16

  17. 2) Regression 1) Classification However, simpler hypotheses make more sense because simple to use, easy to check, train, and explain, and good generalization 17

  18. 2.4 Learning Multiple Classes Multiple Classes, Cii = 1, ..., K    x x 1 0 C C t { , } ,  x r  X t t N Training set: r i t  1 t t i  j i Treat a K-class classification problem as K 2-class problems, i.e., train hypotheses         the total error 1 t i E h   x x 1 0 C C  t x ( ) , 1, , h i K i t that minimize t i j i N  K  x 11( ( ) ). r t t i i  arg min ( , , ) KE h h Problem: 1 K , 1, , h i   , 18 i

  19. 2.5 Regression f   x ( ) () r f Find , s.t. t t   tr R { , } , x r Training set: X t t N  1 t Let g be the estimate of f. 1 N 2 N        x ( | ) ( ) E g X r g t t Expected total error: 1 t   ( ) g x wx w For linear model: 1 0 1 N 2 N         ( , | ) ( ) E w w X r wx w t t 0 1 1 0 1 t argmin ( , | ) w wE w w X Problem: 0 1 , 0 1 19

  20. 2 N         ( ) r w x w  w t t E w w X   ( , w | ) 1 N   1 0 Let  0 1 1 t 1 1 2 N          ( ) 0 r w x w x t t t N 1 0  1 t N         ( ) 0 r w x w x t t t 1 0 where x   1 t N  N  N     ( ) x 0 r x w w x 2 t t t t / x N t  t 1 0    1 1 1 t t t N  N     ( ) x 0 ----- (1) r x w w Nx 2 t t t 1 0   1 1 t t 2 N         (  ) r w x w t t E w w X   ( , w | ) 1 N   1 0 0 0 1 1 t w 0 0 20

  21. N         ( ) 0 r w x w t t where r  1 0  1 t N  N     / r N t 0 r w x Nw t t  t 1 0   1 1 t Nr t    0 ----- (2) w Nx Nw 1 0 and w w Solve (1) and (2) for 1 0   x r xrN Nx t t    t = , w w r wx ( ) x 1 0 1 2 2 t  t    ( ) g x w x w x w 2 For quadratic model: 2 1 0 , and w w w Solve for 2 1 0 21

  22. Example: 22

  23. 2.7 Dimensions of a Supervised Learning Algorithm  Given a sample: { , } x X r t t N  1 t g  ( | ):  x ( ) g 1. Model defines the hypothesis set H ( |   | )))  E  x ) ( , ( L r g 2. Error function ( ): E X t t  t ( ): L  where loss functin expresses the difference between (label) and ( r g | ) (return)  x t t    3. Optimization procedure: argmin ( |  ) E X * 23

  24. 2.8 Notes Ill-posed problem: training examples are not sufficient to lead to a unique solution Inductive bias: additional information and prior knowledge for making learning possible Model selection: how to choose the right bias Generalization: how well a model trained on the training set predicts the right output for new instances 24

  25. Underfitting: hypothesis set H is less complex than the function underlying the data, e.g., fit a line to data sampled from a 3rd order polynomial Overfitting: hypothesis set H is more complex than the function, e.g., fit a 3rdorder polynomial to data sampled from a line 25

  26. Triple trade-off: trade-off between 3 factors 1. The size of training set, N, 2. The complexity (or capacity) of H, O(H), 3. Generalization error, E, on new data As ( O H As , ; N E   ) , first , then E    Cross-validation: data are divided into i) training set, ii) validation set, and iii) test set. Training set: induces a hypothesis Validation set: tests the generalization ability of the induced hypothesis Test set: provides the expected error of the induced hypothesis 26

  27. Summary • Supervised Learning: Given S and H, A attempts to  t t   find a that minimizes h H N t   x ( | ) E h S 1( ( h ) ) r 1 27

  28. • Learnability depends on the complexities of sample S, hypothesis H, and learning algorithm A. • PAC Learning model -- answers to the sample complexity   ) 1       sampling size    holds for ( P ch 0 and 0,  (4/ )log(4/ ).   N • Vapnik-Chervonenkis (VC) Dimension -- answers to the hypothesis complexity VC(H): the maximum number of points that can be shattered by H 28

  29. H shatters N points: N points can be labeled in KN ways, if each point can be labeled as one of K different labels. the examples of different labels. that can separate    labeling, h H Noise and Model Complexity • Complicated hypotheses are generally necessary to cope with noise. However, simpler hypotheses are easy to use, check, train, explain, and generalize. 29

  30. • Dimensions of a Supervised Learning Algorithm  { , } x Given a sample X r t t N  1 t g  ( | ):  x ( ) g 1. Model defines the hypothesis set H E  ( |   | )))  x ( ) ) ( , ( L r g 2. Error function : E X t t  t ( ): L  where loss functin expresses the difference between (label) and ( r g | ) (return)  x t t    3. Optimization procedure: argmin ( |  ) E X * 30

  31.  Ill-posed problem -> Inductive bias -> Model selection -> Generalization -> Underfitting -> Overfitting ->  Triple trade-off: trade-off between 3 factors 1. The size of training set, N, 2. The complexity (or capacity) of H, O(H), 3. Generalization error, E, on new data ) , first , then E      As ( O H As , ; N E 31

  32.  Cross-validation: Data are divided into i) training set, ii) validation set, and iii) test set. Training set: for inducing a hypothesis Validation set: for testing the generalization ability of the induced hypothesis Test set: for providing the expected error of the induced hypothesis 32

More Related