1 / 27

Bayesian Learning, Part 1 of (probably) 4

Bayesian Learning, Part 1 of (probably) 4. Reading: Bishop Ch. 1.2, 1.5, 2.3. Administrivia. Office hours tomorrow moved: noon-2:00 Thesis defense announcement: Sergey Plis, Improving the information derived from human brain mapping experiments .

kin
Télécharger la présentation

Bayesian Learning, Part 1 of (probably) 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Learning, Part 1 of (probably) 4 Reading: Bishop Ch. 1.2, 1.5, 2.3

  2. Administrivia • Office hours tomorrow moved: • noon-2:00 • Thesis defense announcement: • Sergey Plis, Improving the information derived from human brain mapping experiments. • Application of ML/statistical techniques to analysis of MEG neuroimaging data • Feb 21, 9:00-11:00 AM • FEC 141; everybody welcome

  3. Yesterday, today, and... • Last time: • Finish up SVMs • This time: • HW3 • Intro to statistical/generative modeling • Statistical decision theory • The Bayesian viewpoint • Discussion of R1

  4. Homework (proj) 3 • Data sets: • MNIST Database of handwritten digits: • http://yann.lecun.com/exdb/mnist/ • One other (forthcoming) • Algorithms: • Decision tree: http://www.cs.waikato.ac.nz/ml/weka/ • Linear LSE classifier (roll your own) • SVM (ditto, and compare to Weka’s) • Gaussian kernel; poly degree 4, 10, 20; sigmoid • Question: which algorithm is better on these data sets? Why? Prove it.

  5. HW 3 additional details • Due: Tues Mar 6, 2007, beginning of class • 2 weeks from today -- many office hours between now and then • Feel free to talk to each other, but write your own code • Must code LSE, SVM yourself; can use pre-packaged DT • Use a QP library/solver for SVM (e.g., Matlab’s quadprog() function) • Hint: QPs are sloooow for large data; probably want to sub-sample data set. • Q’: what effect does this have? • Extra credit: roll your own DT

  6. ML trivia of the day... • Which data mining techniques [have] you used in a successfully deployed application? http://www.kdnuggets.com/

  7. Assumptions • “Assume makes an a** out of U and ME”... • Bull**** • Assumptions are unavoidable • It is not possible to have an assumption-free learning algorithm • Must always have some assumption about how the data works • Makes learning faster, more accurate, more robust

  8. Example assumptions • Decision tree: • Axis orthogonality • Impurity-based splitting • Greedy search ok • Accuracy (0/1 loss) objective function

  9. Example assumptions • Linear discriminant (hyperplane classifier) via MSE: • Data is linearly separable • Squared-error cost

  10. Example assumptions • Support vector machines • Data is (close to) linearly separable... • ... in some high-dimensional projection of input space • Interesting nonlinearities can be captured by kernel functions • Max margin objective function

  11. Specifying assumptions • Bayesian learning assumes: • Data were generated by some stochastic process • Can write down (some) mathematical form for that process • CDF/PDF/PMF • Mathematical form needs to be parameterized • Have some “prior beliefs” about those params

  12. Specifying assumptions • Makes strong assumptions about form (distribution) of data • Essentially, an attempt to make assumptions explicit and to divorce them from learning algorithm • In practice, not a single learning algorithm, but a recipe for generating problem-specific algs. • Will work well to the extent that these assumptions are right

  13. Example • F={height, weight} • Ω={male, female} • Q1: Any guesses about individual distributions of height/weight by class? • What probability function (PDF)? • Q2: What about the joint distribution? • Q3: What about the means of each? • Reasonable guess for the upper/lower bounds on the means?

  14. Some actual data* * Actual synthesized data, anyway...

  15. General idea • Find probability distribution that describes classes of data • Find decision surface in terms of those probability distributions

  16. H/W data as PDFs

  17. Or, if you prefer...

  18. General idea • Find probability distribution that describes classes of data • Find decision surface in terms of those probability distributions • What would be a good rule?

  19. 5 minutes of math • Bayesian decision rule: Bayes optimality • Want to pick the class that minimizes expected cost • Simplest case: cost==misclassification • Expected cost == expected misclassification rate

  20. 5 minutes of math • Expectation only defined w.r.t. a probability distribution: • Posterior probability of class i given data x: • Interpreted as: chance that the real class is , given that the observed data is x

  21. Cost of classifying a class j thing as a class i 5 minutes of math • Expected cost is then: • cost of getting it wrong * prob of getting it wrong • integrated over all possible outcomes (true classes) • More formally:

  22. 5 minutes of math • Expected cost is then: • cost of getting it wrong * prob of getting it wrong • integrated over all possible outcomes (true classes) • More formally: • Want to pick that minimizes this

  23. 5 minutes of math • For 0/1 cost, reduces to:

  24. 5 minutes of math • For 0/1 cost, reduces to: • To minimize, pick the that minimizes:

  25. 5 minutes of math • In pictures:

  26. 5 minutes of math • In pictures:

  27. 5 minutes of math • These thresholds are called the Bayes decision thresholds • The corresponding cost (err rate) is called the Bayes optimal cost A real-world example:

More Related