Unified Multi-Classifier Learning with Bayesian Belief Nets

Exploiting Common SubRelations:Learning One Belief Net for Many Classification Tasks R Greiner, Wei Zhou University of Alberta

Situation • CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from medical symptoms • Hepatitis, from medical symptoms • … • Option 1: Learn k different classifier systems{SCancer, SMenin, …, Sk} • Then use Si to deal with ith“query class” • but… • but… need to re-learn inter-relations among Factors, Symptoms, common to all k classifiers

Common Interrelationships Cancer Cancer Menin Menin

Use Common Structure! • CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from symptoms • Hepatitis, from symptoms • … • Option 2: Learn 1 “structure” Sof relationships then use Sto address all k classification tasks • Actual Approach: Learn 1 Bayesian BeliefNet, inter-relating info for all k types of queries

Outline • Motivation • Handle multiple class variables • Framework • Formal model • Belief Nets, …-classifier • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structures; LL vs LCL • Contributions

MC-Learner MC Training Data

Multi-Classifier I/O • Given “query” “class variable” Q and “evidence” E=e Cancer=?, given Gender=F, Age=35, Smoke=t ? • Return value Q = q Cancer = Yes

MultiClassifier • Like standard Classifiers, can deal with • different evidence E • different evidence values e • Unlike standard Classifiers, can deal with • different class variables Q • Able to “answer queries” • classify new unlabeled tuples • Given “Q=?, given E=e”, return “q” MC(Cancer; Gender=M, Age=25, Height=6’) = No MC(Meningitis; Gender=F, BloodTest = t ) = Severe

MC-Learner’s I/O • Input: Set of “queries”(labeled partially-specified tuples) •  input to standard (partial-data) learners • Output: MultiClassifier

Error Measure • Query Distribution: • …can be uncorrelated with “tuple distribution” Prob([Q, E=e] asked) MultiClassifier MC returns MC(Q, E=e) = q’ • Classification Error of MC • [|a =? b|]  1 if a=b, 0 otherwise •  “0/1” error CE(MC) = [Q, E=e], qProb([Q, E=e] asked)* [|MC(Q, E=e) =?q|] • “Labeled query” [Q, E=e], q

Learner’s Task • Given • space of “MultiClassifiers” { MCi} • sample of labeled queries drawn from “query distribution” Find MC*= argmin{ MCi}{CE(MCi) } w/minimal error over query distribution.

Outline • Motivation • Handle multiple class variables • Framework • Formal model • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structure; LL vs LCL • Contributions Belief Nets, …-classifier

Simple Belief Net H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc! Skip Details

Example of a Belief Net P(H=1) P(H=0) H 0.05 0.95 h P(B=1 | H=h) P(B=0 | H=h) B 1 0.95 0.05 0 0.03 0.97 h b P(J=1|h,b) P(J=0|h,b) J 1 1 0.8 0.2 1 0 0.8 0.2 0 1 0.3 0.7 0 0 0.3 0.7 • Simple Belief Net: Node ~ Variable Link ~ “Causal dependency” “CPTable” ~ P(child | parents) Skip

Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

BeliefNet as (Multi)Classifier Prob q1 q2 q3 … qm • For query [Q, E=e], BN will return distribution • PBN(Q=q1| E=e ), PBN(Q=q2| E=e ), … PBN(Q=qm| E=e ) • (Multi)Classifier MCBN(Q, E=e ) = argmaxqi{PBN(Q= qi| E=e ) }

Learning Belief Nets • Belief Net =  G,  • G = directed acyclic graph (“structure” – what’s related to what”) •  = “parameters” – strength of connections • Learning Belief Net  G,   from “data”: • Learning structure G • Find parameters  that are best, for G • Our focus:#2(parameters);Best  minimal CE-error

Learning BN Multi-Classifier Structure G + Labeled Queries • Goal: Find CPtables  to minimize CE error … • * = argmin{ [Q, E=e], vProb([Q, E=e] asked) * [|MC G, (Q, E=e) =? q|] }

Issues Q1: How many labeled queries are required? Q2: How hard is learning, given distributional info? Q3: What is best algorithm for learning … • … Belief Net? • … Belief Net Classifier? • … Belief Net Multiclassifier?

Q1, Q2: Theoretical Results • PAC(e, d)-learn CPtables: Given BN structure, find CPtables whose CE-error is, with prob  1-d, within e of optimal Sample Complexity: … BN structure w/ N variables, K CPtable entries, ig >0, needsample of labeled queries. Computational Complexity:NP-hard to find CPtable w/ min’l CE error(over g, for any g O(1/N) ) from labeled queries… from known structure!

Use Conditional Likelihood Not standard model? As NP-hard… • Goal: minimize “classification error”, based on training sample [Qi, Ei=ei], qi* • Sample typically includes • high-probability queries [Q, E=e] • only most likely answers to these queries q*= argmaxq { P( Q=q | E=e ) } Maximize Conditional Likelihood LCLD(  ) = [q*,e] D log P( Q=q*| E=e )

Gradient Descent Alg: ILQq Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E Descend along derivative: + sum over queries “[Q=q, E=e]”, conjugate gradient, … • How to change CPtable qc|f = B(C=c | F=f)given datum “[Q=q, E=e]”,corresponding to

Q3: How to Learn BN MultiClassifier? • Approach 1: Minimize error  Maximize Conditional Likelihood • (In)Complete Data: ILQ • Approach 2: Fit to data  Maximize Likelihood • Complete Data: Observed Frequency Estimate • Incomplete Data: EM / APN

Empirical Studies • Two different objectives  2 learning algs • Maximize Conditional Likelihood: ILQ • Maximize Likelihood: APN • Two different approaches to MultipleClasses • 1 copy of structure • k copies of structure • k naïve-bayes • Several “datasets” • Alarm • Insurance • … • Error: “0/1”;MSE() = i[Ptrue(qi|ei) – P (qi|ei)]2

1- vs k- Structures Menin Cancer Menin Menin Cancer Cancer

Empirical Study I: Alarm • Alarm Belief Net37 vars, 46 links, 505 parameters

Query Distribution • [HC’91] says, typically • 8 vars Q N appear as query • 16 vars E N appear as evidence • Select • Q Quniformly • Use same set of 7 evidenceEE • Assign value e for E, based on Palarm(E =e) • Find “value” v based on Palarm(Q=v | E =e) • Each run uses m such queries, m=5,10, … 100, …

Results (Alarm; ILQ; SmallSample) • CE • MSE

Results (Alarm; ILQ; LargeSample) • CE • MSE

Comments on Alarm Results • For small Sample Size • “ILQ- 1 structure” better than “ILQ- k structures” • For large Sample Size • “ILQ- 1 structure”  “ILQ- k structures” • ILQ-k has more parameters to fit, but … lots of data • APN ok, but much slower (did not converge in bounds)

Empirical Study II: Insurance • Insurance Belief Net • 27 vars, (3 query, 8 evidence) • 560 parameters • Distribution: • Select 1 query randomly from 3 • Use all 8 evidence • … • (Simplified Version)

Results (Insur; ILQ) • CE • MSE

Summary of Results Learning  for given structure, to minimize CED() or MSED() • Correct structure • Small number of samples  • ILQ-1 (APN-1) win (over ILQ-k, APN-k) • Large number of samples  • ILQ-k  ILQ-1win(over APN-1, APN-k) • Incorrect structure (naïve-bayes) •  ILQ wins

Future Work • Best algorithm for learning optimal BN? • Actually optimize CE-Err (not LCL) • Learning STRUCTURE as well as CPtables • Special cases where ILQ is efficient (?complete data?) • Other “learning environments” • Other prior knowledge -- Query Forms • Explicitly-Labeled Queries • Better understanding of sample complexityw/out “g” restriction

Related Work • Like (ML) classification but… • Probabilities, not discrete • Diff class var’s, diff evidence sets... • … see Caruana • “Learning to Reason” [KR’95]“do well on tasks that will be encountered” … but different performance system • Sample Complexity [FY, Hoeffgen] … diff learning model • Computational Complexity [Kilian/Naor95] NP to find ANY distr w/min L1-error wrt uncond queryq for BN L2 conditional

Take Home Msgs • To max performance: • use Conditional Likelihood (ILQ) not Likelihood (APN/EM, OFE) • Especially if structure wrong, small sample, … … controversial… • To deal with MultiClassifiers • Use 1 structure, not k • If small sample, 1struct better performance If large sample, same performance, … but 1struct smaller … yes, of course… • Relation to Attrib vs Relation: • Not “1 example for many class of queries” • but “1 example for 1 class of queries, BUT IN ONE COMMON STRUCTURE” Exploiting Common Relations

Contributions • Appropriate model for learning Learn MultiClassifier that works well in practice • Extends standard learning environments • Labeled Queries, with different class variables • Sample Complexity • Need “few” labeled-queries • Computation ComplexityEffective Algorithm • NP-hard  Gradient descent • Empirical Evidence: works well! • http://www.cs.ualberta.ca/~greiner/BN-results.html

Questions? • LCL vs LL • Does diff matter? • ILQ vs APN • Query Forms • See also http://www.cs.ualberta.ca/~greiner/BN-results.html

Learning Model If never asked don’t care if “What is p(jaun | btest- ) ?” BN(jaun | btest- )  p(jaun | btest- ) • Most belief net learners try to maximize LIKELIHOOD LL D (  ) = xD log P( x) … as goal is “fit to data” D Our goal is different: We want to minimize error, over distribution of queries.

Different Optimization LL D(  ) = [q*,e]D log P( Q=q*| E=e ) +  [q*,e]D log P( E=e ) = LCL D(  ) +  [q*,e]D log P( E=e ) • As  [q*,e]D log P( E=e ) non-trivial, • LL = argmax { LL D(  ) } • LCL = argmax {LCL D(  ) } • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError • To findLCL: NP-hard, so…ILQ Return LL LCL

Why Alternative Model? • A belief net is … • representation for a distribution • system for answering queries • Suppose BN must answer: “What is p(hep | jaun, btest- ) ?”but not “What is p(jaun | btest-) ?” • So… BN is good ifeven if BN(hep | jaun, btest- ) = p(hep | jaun, btest- ) BN(jaun | btest- )  p(jaun | btest- )

Query Distr vs Tuple Distr • Distribution over tuples p(q)p(hep, jan, btest-, …) = 0.07 p(flu, cough, ~headache, …) = 0.43 • Distribution over queries sq(q) = Prob(q asked)Ask “What isp(hep | jan, btest-)?” 30%Ask “What isp(flu | cough, ~headache)?” 22% • Can be uncorrelated: • EG: Prob[ Asking Cancer ] = sq(“cancer”) = 100% even if Pr[ Cancer ] = p(cancer) = 0

Query Distr  Tuple Distr • Spse GP asks all ADULT FEMALE patients • “Pregnant” ? • Data  P( Preg | Adult, Gender=F ) = 2/3 • Is this really TUPLE distr? • P(Gender=F) = 1 ? • NO: only reflects questions asked ! • Provide info re: P(preg | Adult=+, Gender=F) • but NOT about P(Adult), …

Query Distr  Tuple Distr • Query Probability: independent of tuple probability: Prob([Q, E=e] asked) •  P(Q=q, E=e) • Could always ask about 0-prob situation • Always ask “[Pregnant=t, Gender=Male]”  sq(Pregnant=t, Gender=Male)=1, but P(Pregnant=t, Gender=Male ) = 0 •  P(E=e) • If sq(Q, E=ei)  P(E=ei), then • P(Gender=Female ) = P(Gender=Male )  sq(Pregnant, Gender=Female) = sq(Pregnant, Gender=Male) • Note: value of query -- q* of -- IS based on P(Q=q | E=e) [Q, E=e], q* Return

Does it matter? • If all queries involve same query variable, ok to pretend sq(.) ~ p(.) as no-one ever asks about EVIDENCE DISTRIBUTION • Eg, in As no one asks “What is P(Gender)?, doesn’t matter … • But problematic in MultiClassifier… if other queries – eg, sq(Gender; .)

ILQ (cond likelihood) vs APN (likelihood) • Wrong structure: • ILQ better than APN/EM • Experiments… • Artificial data • Using Naïve Bayes (UCI) • Correct structure • ILQ often better than OFE, APN/EM • Experiments • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError

Unified Multi-Classifier Learning with Bayesian Belief Nets

Unified Multi-Classifier Learning with Bayesian Belief Nets

Presentation Transcript

Meta Learning: For Classification

ONE-NET

Classification of Motor Tasks

From the many gods to the one god belief

Team Learning Tasks

Complex Learning Tasks

Text Classification with Belief Augmented Frames

Transfer for Supervised Learning Tasks

Exploiting Classification for Software Evolution

Unsupervised feature learning for audio classification using convolutional deep belief networks

Learning Tasks for Period 2

Detail to attention: Exploiting Visual Tasks for Selective Rendering

many to one

Learning Tasks

CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net

Automating Common DBA Tasks

One to many: many to one

Digital Signage is used for many different tasks

Exploiting Preference Queries for Searching Learning Resources

ONE-CLASS CLASSIFICATION

Detail to attention: Exploiting Visual Tasks for Selective Rendering

Document Classification using Deep Belief Nets