Overview of Today’s Lecture

Overview of Today’s Lecture • Last Time: course introduction • Reading assignment posted to class webpage • Don’t get discouraged • Today: introduction to “Supervised Machine Learning” • Our first ML algorithm: K-nearest neighbor • HW 0 out online • Create a dataset of • “fixed-length feature vectors” • Due next Tuesday Sept 19 (4 PM) • Instructions for handing in HW0 coming soon

Supervised Learning: Overview Digital Representation (feature space) Real World classification rules select features construct classifier If feature 2 = X then APPLY BREAK = TRUE machine humans HW 1-2 HW 0

Supervised Learning: Task Definition • Given • A collection of positive examples of some concept/class/category (i.e., members of the class) and, possibly, a collection of the negative examples (i.e., non-members) • Produce • A description that covers (includes) all (most) of the positive examples and none (few) of the negative examples (which, hopefully, properly categorizes most future examples!) The Key Point! Note: one can easily extend this definition to handle more than two classes

Example Positive Examples Negative Examples How does this symbol classify? • Concept • Solid Red Circle in a Regular Polygon • What about? • Figure with red solid circles not in larger red circle • Figures on left side of page etc

HW0 – Your “Personal Concept” • Step 1: Choose a Boolean (true/false) concept • Subjective judgment (can’t articulate) • Books I like/dislike • Movies I like/dislike • www pages I like/dislike • “time will tell” concepts • Stocks to buy • Medical treatment (at time t, predict outcome at time (t +∆t)) • Sensory interpretation • Face recognition (See text) • Handwritten digit recognition • Sound recognition • Hard to program functions

HW0 – Your “Personal Concept” • Step 2: Choose a feature space • We will use fixed-length feature vectors • Choose N features • Each feature has Vipossible values • Each example is represented by a vector of N feature values (i.e., is a point in the feature space) e.g.: <red, 50, round> colorweight shape • Feature Types • Boolean • Nominal • Ordered • Hierarchical • Step 3: Collect examples (“I/O” pairs) Defines a space We will not use hierarchical features

closed polygon continuous square triangle circle ellipse Standard Feature Typesfor representing training examples – source of “domain knowledge” • Nominal (Boolean is a special case) • No relationship among possible values e.g., color є {red, blue, green} (vs. color = 1000 Hertz) • Linear (or Ordered) • Possible values of the feature are totally ordered e.g., size є {small, medium, large} ←discrete weight є [0…500] ←continuous • Hierarchical • Possible values are partiallyordered in an ISA hierarchy e.g. for shape->

Product Pet Foods Tea 99 Product Classes 2302 Product Subclasses Dried Cat Food Canned Cat Food Friskies Liver, 250g ~30k Products Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17) • Structure of one feature! • “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.” • - From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001 * Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers

Digitized camera image Learned Function Steering Angle age = 13 sex = M wgt = 18 Learned Function ill vs healthy Some Famous Examples • Car Steering (Pomerleau) • Medical Diagnosis (Quinlan) • DNA Categorization • TV-pilot rating • Chemical-plant control • Back gammon playing • WWW page scoring • Credit application scoring Medical record

HW0: Creating your dataset • Choose a dataset • based on interest/familiarity • meets basic requirements • >1000 examples • category (function) learned should be binary valued • ~500 “true” and “false” examples → Internet Movie Database (IMDb)

Example Database: IMDb • Name • Country • Movies • Name • Year of birth • Movies • Name • Year of birth • Gender • Oscars • Movies Studio Actor Director/ Producer Made Acted in Directed Produced • Title • Genre • Year • Opening Weekend • BO receipts • List of actors/actresses • Release season Movie

HW0: Creating your dataset Choose Boolean target function (category) • Some examples: • Opening weekend box office receipts > $2 million • Movie is drama? (action, sci-fi,…) • Movies I like/dislike (e.g. Tivo)

HW0: Creating your dataset • Movie • Average age of actors • Number of producers • Percent female actors • Studio • Number of movies made • Average movie gross • Percent movies released in US Create your feature space • Director/Producer • Years of experience • Most prevalent genre • Number of award winning movies • Average movie gross • Actor • Gender • Has previous Oscar award or nominations • Most prevalent genre

HW0: Creating your dataset David Jensen’s group at UMass used Naïve Bayes (NB) to predict the following based on attributes they selected and a novel way of sampling from the data: • Opening weekend box office receipts > $2 million • 25 attributes • Accuracy = 83.3% • Default accuracy = 56% • Movie is drama? • 12 attributes • Accuracy = 71.9% • Default accuracy = 51% • http://kdl.cs.umass.edu/proximity/about.html

Back to Supervised Learning One way learning systems differ is in how they represent concepts: Neural Net Backpropagation C4.5, CART Decision Tree Training Examples AQ, FOIL Φ <- X^Y Φ <- Z Rules . . . SVMs If 5x1 + 9x2 – 3x3 > 12 Then +

Feature Space If examples are described in terms of values of features, they can be plotted as points in an N-dimensional space. Size Big ? Color Gray 2500 Weight A “concept” is then a (possibly disjoint) volume in this space.

Supervised Learning = Learning from Labeled Examples • Most common & successful form of ML Venn Diagram - - - - + + + - + - - - • Examples – points in multi-dimensional “feature space” • Concepts – “function” that labels points in feature space • (as +, -, and possibly ?)

Brief Review Instances • Conjunctive Concept • Color(?obj1, red) ^ • Size(?obj1, large) • Disjunctive Concept • Color(?obj2, blue) v • Size(?obj2, small) “and” “or” A A A

Empirical Learning and Venn Diagrams Venn Diagram Concept = A or B (Disjunctive concept) Examples = labeled points in feature space Concept = a label for a set of points - - - - - - - - + + - - - - + - - + - + - + + + + + + + + + + - - - - - A - - - + + + - + - B - - - - - - - - Feature Space

Aspects of an ML System • “Language” for representing examples • “Language” for representing “Concepts” • Technique for producing concept “consistent” with the training examples • Technique for classifying new instance Each of these limits the expressiveness/efficiency of the supervised learning algorithm. HW 0 Other HW’s

Nearest-Neighbor Algorithms (aka. Exemplar models, instance-based learning (IBL), case-based learning) • Learning ≈ memorize training examples • Problem solving = find most similar example in memory; output its category Venn - - + + + + “Voronoi Diagrams” (pg 233) + - … - - - + - + + + + ? -

Sample Experimental Results Simple algorithm works quite well!

Simple Example – 1-NN (1-NN ≡one nearest neighbor) Training Set • a=0, b=0, c=1+ • a=0, b=1, c=0- • a=1, b=1, c=1- Test Example • a=0, b=1, c=0 ? • “Hamming Distance” • Ex 1 = 2 • Ex 2 = 1 • Ex 3 = 2 So output -

K-NN Algorithm Collect K nearest neighbors, select majority classification (or somehow combine their classes) • What should K be? • It probability is problem dependent • Can use tuning sets (later) to select a good setting for K Shouldn’t really “connect the dots” (Why?) Tuning Set Error Rate 2 3 4 5 K 1

Some Common Jargon • Classification • Learning a discrete valued function • Regression • Learning a real valued function IBL easily extended to regression tasks (and to multi-category classification) Discrete/Real Outputs

Variations on a Theme (From Aha, Kibler and Albert in ML Journal) • IB1 – keep all examples • IB2 – keep next instance if incorrectly classified by using previous instances • Uses less storage • Order dependent • Sensitive to noisy data

Variations on a Theme (cont.) • IB3– extend IB2 to more intelligently decide which examples to keep (see article) • Better handling of noisy data • Another Idea - cluster groups, keep “examples” from each (median/centroid)

Next time • Finish K-NN • Begin linear separators • Naïve Bayes

Overview of Today’s Lecture