Learning to Predict

Presenter: Russell Greiner Learning to Predict

Vision Statement Helping the world understand data … and make informed decisions. … and make informed decisions. • Single decision: determine • class label of an instance • set of labels of set of pixels, … • value of a property of an instance, …

Motivation for Training a Predictor • Need to know “label” of an instance,to determine appropriate action • PredictorMed( patient#2 ) =?“treatX is Ok” • Unfortunately, Predictor( . )not known a priori • But many examples of patient, treatX Predictor Ok

Learner Motivation for Training a Predictor • Machine learning provide alg’s for mapping {  patient, treatX  } to Predictor(.)function Temp. Press. Sore Throat … Colour treatX 35 95 Y … Pale No 22 110 N … Clear Ok : : : : 10 87 N … Pale No Predictor treatX Temp Press. Sore- Throat … Color Ok 32 90 N … Pale

Learner Motivation for Training a Predictor • Need to learn (not program it in) when predictor is … • … not known • … not expressible • … changing • … user dependent Temp. Press. Sore Throat … Colour treatX 35 95 Y … Pale No 22 110 N … Clear Ok : : : : 10 87 N … Pale No Predictor treatX Temp Press. Sore- Throat … Color No 32 90 N … Pale

Personnel • PI synergy: • Greiner, Schuurmans, Holte, Sutton, Szepesvari, Goebel • 5 Postdocs • 16 Grad students (5 MSc, 11 PhD) • 5 Supporting technical staff + personnel for Bioinformatics thrust

Partners/Collaborators • 4 UofA CS profs • 1 UofAlberta Math/Stat • Non-UofA collaborators: Google, Yahoo!, Electronic Arts, UofMontreal, UofWaterloo, UofNebraska, NICTA, NRC-IIT,… + Bioinformatics thrust collaborators

Additional Resources • Grants • $225K CFI • $100K MITACS • $100K Google • Hardware • 68 processor, 2TB, Opteron Cluster • 54 processor, dual core, 1.5TB, Opteron Cluster + funds/data for Bioinformatics thrust

Highlights • IJCAI 2005 – Distinguished Paper Prize • UM 2003 – Best Student Paper Prize • WebIC technology is foundation for start-up company • Significant advances in extending SVMs to use Un-supervised/Semi-supervised data, and for structured data + Highlights from Bioinformatics thrust

Temp. Press Sore Throat … Colour treatX 35 95 Y … Pale No 22 110 N … Clear Ok : : : : 10 87 N … Pale No Learner Predictor treatX Temp Press. Sore- Throat … Color No 32 90 N … Pale Learning to Predict: Challenges Simplifying assumptions re: training data • IID / unstructured • Lots of instances • Low dimensions • Complete features • Completely labeled • Balanced data • is sufficient

Segmenting Brain Tumors Learning to Predict: Challenges Simplifying assumptions re: training data • IID / unstructured ? • Lots of instances • Low dimensions • Complete features • Completely labeled • Balanced data • is sufficient Extensions to Conditional Random Fields, …

Learning to Predict: Challenges Simplifying assumptions re training data • IID / unstructured • Lots of instances ? • Low dimensions? • Complete features • Completely labeled • Balanced data • is sufficient N  10’s m  1000’s

Learning to Predict: Challenges Simplifying assumptions re training data • IID / unstructured • Lots of instances ? • Low dimensions? • Complete features • Completely labeled • Balanced data • is sufficient N  20,000 m100 Microarray, SNP Chips, … Dimensionality Reduction … L 2 Model: Component Discovery BiCluster Coding

Learning to Predict: Challenges Simplifying assumptions re training data • IID / unstructured • Lots of instances • Low dimensions • Complete features ? • Completely labeled • Balanced data • is sufficient Budget Learning

Learning to Predict: Challenges Simplifying assumptions re training data • IID / unstructured • Lots of instances • Low dimensions • Complete features • Completely labeled ? • Balanced data • is sufficient SemiSupervised Learning Active Learning

Learning to Predict: Challenges Simplifying assumptions re training data • IID / unstructured • Lots of instances • Low dimensions • Complete features • Completely labeled • Balanced data ? • is sufficient Cost Curves (analysis)

Learning to Predict: Challenges Simplifying assumptions re training data • IID / unstructured • Lots of instances • Low dimensions • Complete features • Completely labeled • Balanced data • is sufficient ? Robust SVM Mixture Using Variance Large Margin Bayes Net Coordinate Classifier …

Projects and Status • Structured Prediction • Random Fields • Parsing • Unsupervised M3N • Dimensional Reduction • (L 2 Model: Component Discovery) • Budgeted Learning • SemiSupervised Learning • large-margin (SVM) • probabilistic (CRF) • graph based transduction • Active Learning • CostCurves • Robust SVM • Coordinated Classifiers • Mixture Using Variance • Large Margin Bayes Net IID / unstructured Lots of instances Low dimensions Complete features Completely labeled Balanced data Beyond simple learners Poster # 26

Technical Details Budgeted Learning

Response Learner Predictor Typical Supervised Learning Person 1 Person 2

Response Learner Predictor ActiveLearning Person 1 Person 2 User is able to PURCHASE labels, at some cost … for which instances??

Response Learner Predictor BudgetedLearning Person 1 Person 2 User is able to PURCHASE values of features, at some cost … but which features for which instances??

Response Learner Predictor BudgetedLearning Person 1 Person 2 User is able to PURCHASE values of features, at some cost … but which features for which instances?? Significantly different from ACTIVE learning: • correlations between feature values

10 tests ($1/test) Budget =$40 Beta(10,1) Error # features purchased

Budgeted Learning… so far • Defined framework • Ability to purchase individual feature values • Fixed LEARNING / CLASSIFICATION Budget • Theoretical results • NP-hard in general • Standard algorithms not even Approx ! • Empirical Results show … • Avoid Round Robin • Try clever algorithms • Biased Robin • Randomized Single Feature Lookahead [Lizotte,Madani,Greiner: UAI’03], [Madani,Lizotte,Greiner: UAI’04], [Kapoor,Greiner: ECML’05]

Response Learner Classifier Future Work #1 Person 1 Person 2

Future Work #2 • Sample complexity of Budgeted Learning • How many (Ij, Xi)“probes” required to PAC-learn ? • Develop policies withguaranteeson learning performance • Complex cost model…Bundling tests, … • Allow learner to perform more powerful probes • purchase X3 in instance where X7 = 0& Y = 1 • More complex classifiers ?

Response Future Work #3 Learning Generative Model Person 1 Person 2 Goal: Find * = argmax P(D)

MTest MTrain Labels BiCluster Membership 0 1 .. 1 + 1 1 .. 0 Learner – Find BiClusters 1 0 … 1 – 1 1 … 0 + – 0 0 … 1 Classifier + 1 1 … 0 – Projects and Status • Structured Prediction(ongoing) • Dimensional Reduction: (ongoing; RoBiC: Poster#8) • Budgeted Learning(ongoing) • SemiSupervised Learning (ongoing) • Active Learning (ongoing) • CostCurves (complete; Post#26)

Technical Details Using Variance Estimates to Combine Bayesian Classifiers

C2 o + o + + o + + o o o + C1 o o + + + o o + + + o + o + + o o o + o + + o + + o + * o + + o o C3 o + + § C4 o o + + + o Motivation • Spse many different classifiers … • For each instance, want each classifier to… • “know what it knows”… • … and shout LOUDEST when it knows best… • “Loudness” 1/ Variance !

Mixture Using Variance • Given belief net classifier • fixed (correct) structure • parameters  estimated from (random) datasample • Response to query “P(+c| -e, +w)” is… • asymptotically normal with … • (asymptotic) variance • Variance easy to compute … • for simple structures (Naïve Bayes, TAN) … and • for complete queries

Experiment #4b:MUV(kNB, Adaboost, js) vs AdaBoost(NB) • MUV significantly out-performs AdaBoost • even when using base-classifiers that AdaBoost generated! MUV(kNB, AdaBoost, js) better than AdaBoost[NB] with p < 0.023

MUV Results • Sound statistical foundation • Very effective classifier … • …across many real datasets • MUV(NB) better than AdaBoost(NB)! C. Lee, S. Wang and R. Greiner; ICML’06

Mixture Using Variance … next steps? • Other structures (beyond NB, TAN) • Beyond just tabular CPtables for discrete variables • Noisy-or • Gaussians • Learn different base-classifiers from different subset of features • Scaling up to many MANY features • overfitting characteristics?

Confidence in Classifier • Confidence of Prediction? • Fit each j, j2 to Beta(aji, bj) • Compute area CDFBeta(aj, bj)(0.5)

Labeled Training Data UnLabeled Training Data Learner Semi-Supervised Learning Classifier No

Approaches • Ignore the unlabeled data • Great if have LOTS of labeled data • Use the unlabeled data, as is… • “Semi-Supervised Learning”… based on • large margin (SVM) • graph • probabilistic model • Pay to get labels for SOME unlabeled data • “Active Learning”

Semi-supervised Multi-class SVM • Approach: find a labeling that would yield an optimal SVM classifier, on the resulting training data. • Hard, but • semi-definite relaxations can approximate this objective surprisingly well • training procedures are computationally intensive, but produce high quality generalization results. L. Xu, J. Neufeld, B. Larson, D. Schuurmans. Maximum margin clustering. NIPS-04. L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class SVMs. AAAI-05.

Probabilistic Approach to Semi-Supervised Learning • Probabilistic model: P(y|x) • Context: non-IID data • Language modelling • Segmenting Brain Tumor from MR Images • Use Unlabeled Data as Regularizer • Future: Other applications… C-H. Lee, W. Shaojun, F. Jiao, D. Schuurmans and R. Greiner. Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields. NIPS06. F. Jiao, S. Wang, C. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. COLING/ACL06.

Active Learning • Pay for label to query xi that ... maximizes conditional mutual information about unlabeled data: • How to determine yi ? • Take EXPECTATION wrtYi ? • Use OPTIMISTIC guess wrt Yi ?

Optimistic Active Learning using Mutual Information • Need Optimism • Need “on-line adjustment” • Better than just MostUncertain, … breast pima Y. Guo and R. Greiner. Optimistic active learning using mutual information. IJCAI’07

Future Work on Active Learning • Understand WHY “optimism” works… + other applications of optimism • Extend framework to deal with • non-iid data • different qualities of labelers • …

Learning to Predict

Learning to Predict

Presentation Transcript

When Compositionality Fails to Predict Systematicity

Using Horoscopes to Predict Data Provenance

Which Graph Graphing to predict

PREDICT Debrief

PREDICT Debrief

Learning to Predict Readability using Diverse Linguistic Features

PREDICT

1.6 Using Data to Predict

Unable to predict

Predict Changes:

Predict pH

predict

Learning to Predict by the Methods of TD

Leveraging Predict

HOW NOT To predict ELECTIONS

Mining Metrics to Predict Component Failures

Learning to Predict Structures with Applications to Natural Language Processing

A Real-Time Learning Technique to Predict Cloud-To-Ground Lightning

Predict

Let’s Predict

How To Predict Marriage Astrology

Using Matrices to Predict Growth