330 likes | 554 Vues
Active Learning for Probabilistic Models. Lee Wee Sun Department of Computer Science National University of Singapore leews@comp.nus.edu.sg. Probabilistic Models in Networked Environments. Probabilistic graphical models are powerful tools in networked environments
E N D
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore leews@comp.nus.edu.sg LARC-IMS Workshop
Probabilistic Models in Networked Environments • Probabilistic graphical models are powerful tools in networked environments • Example task: Given some labeled nodes, what are the labels of remaining nodes? • May also need to learn parameters of model (later) Faculty ? Project ? ? Student Project ? Labeling university web pages with CRF LARC-IMS Workshop
Active Learning • Given a budget of k queries, which nodes to query to maximize performance on remaining nodes? • What are reasonable performance measures with provable guarantees for greedy methods? Faculty ? Project ? ? Student Project ? Labeling university web pages with CRF LARC-IMS Workshop
Entropy • First consider non-adaptive policy • Chain rule of entropy • Maximizing entropy of selected variables (Y1) minimizes the conditional entropy Constant Maximize Minimize target LARC-IMS Workshop
Greedy method • Given already selected set S, add variable Yito maximize • Near optimality:because of submodularity of entropy. LARC-IMS Workshop
Submodularity • Diminishing return property LARC-IMS Workshop
Adaptive Policy • What about adaptive policies? k Non-adaptive Adaptive LARC-IMS Workshop
Let ρ be a path down the policy tree, and let the policy entropy beThen we can showwhere YG is the graph labeling • Correspond to chain rule in non-adaptive case – maximizing policy entropy minimizes conditional entropy LARC-IMS Workshop
Recap: Greedy algorithm is near-optimal for non-adaptive case • For adaptive case, consider greedy algorithm that selects the variable with the largest entropy conditioned on the observations • Unfortunately, for adaptive case, we can show that, for every α > 0, there is a probabilistic model such that LARC-IMS Workshop
Tsallis Entropy and Gibbs Error • In statistical mechanics, Tsallisentropy is a generalization of Shannon entropy • Shannon entropy is special case for q = 1. • We call the case q = 2, Gibbs Error LARC-IMS Workshop
Properties of Gibbs Error • Gibbs error is the expected error of the Gibbs classifier • Gibbs classifier: Draw a labeling from the distribution and use the labeling as the prediction • At most twice Bayes (best possible) error. LARC-IMS Workshop
Lower bound to entropy • Maximizing policy Gibbs error, maximize lower bound to policy entropy Gibbs Error LARC-IMS Workshop
Policy Gibbs error LARC-IMS Workshop
Maximizing policy Gibbs error minimizes expected weighted posterior Gibbs error • Make progress on either the version space or posterior Gibbs error Version space Posterior Gibbs error LARC-IMS Workshop
Gibbs Error and Adaptive Policies • Greedy algorithm: Select node i with the largest conditional Gibbs error • Near-optimality holds for the case of policy Gibbs error (in contrast to policy entropy) LARC-IMS Workshop
Proof idea: • Show that policy Gibbs error is the same as the expected version space reduction. • Version space is the total probability of remaining labelings on unlabeled nodes (labelings that are consistent with labeled nodes) • Version space reduction function is adaptive submodular, giving required result for policy Gibbs error (using result of Golovin and Krause). Version space LARC-IMS Workshop
Adaptive Submodularity • Diminishing return property • Change in version space when xi is concatenated to path ρ and y is received • Adaptive submodular because ρ’ ρ x3 LARC-IMS Workshop
Worst Case Version Space • Maximizing policy Gibbs error maximizes expected version space reduction • Related greedy algorithm: Select the least confident variable • Select the variable with the smallest maximum label probability • Approximately maximizes worst case version space reduction LARC-IMS Workshop
Let • Using greedy strategy that selects least confident variable achievesbecause version space reduction function is pointwisesubmodular LARC-IMS Workshop
PointwiseSubmodularity • Let V(S,y) be the version space remaining if y is the true labeling of all nodes and subset S has been labeled • 1-V(S,y) is pointwisesubmodularas it is submodular for every labeling y LARC-IMS Workshop
Summary So Far … LARC-IMS Workshop
Learning Parameters • Take a Bayesian approach • Put prior over parameters • Integrate away parameters when computing probability of labeling • Also works with commonly encountered pooled based active learning scenario (independent instances – no dependencies other than on parameter) LARC-IMS Workshop
Experiments • Named entity recognition with Bayesian CRF on CoNLL 2003 dataset • Greedy algsperformancesimilar andbetter thanpassivelearning (random) LARC-IMS Workshop
Weakness of Gibbs Error • A labeling is considered incorrect if even one component does not agree Faculty Faculty Faculty Project Project Project Project Project Student Student Student Student Project Project Student Student LARC-IMS Workshop
Generalized Gibbs Error • Generalize Gibbs error to use loss function L • Example: Hamming loss, 1-F-score, etc. • Reduces to Gibbs error when L(y,y’) = 1-δ(y,y’) where • δ(y,y’) = 1 when y = y’, and • δ(y,y’) = 0 otherwise y2 y4 y1 y3 LARC-IMS Workshop
Generalized policy Gibbs error (to maximize) Generalized Gibbs Error Remaining weighted Generalized Gibbs error (agrees with y on ρ) LARC-IMS Workshop
Generalized policy Gibbs error is the average of • Call this function the generalized version space reduction function • Unfortunately, not adaptive submodular for arbitrary L. y2 y4 y1 y3 LARC-IMS Workshop
However, generalized version space reduction function is pointwisesubmodular • Has good approximation in the worst case y2 y4 y1 y3 LARC-IMS Workshop
Hedging against worst case labeling may be too conservative • Can hedge against the total generalized version space among surviving labelings instead y2 y2 instead of y4 y4 y1 y1 y3 y3 LARC-IMS Workshop
Call this total generalized version space reduction function • Total generalized version space reduction function is pointwisesubmodular • Has good approximation in the worst case LARC-IMS Workshop
Summary LARC-IMS Workshop
Experiments • Text classification • 20Newsgroup dataset • Classify 7 pairs of newsgroups • AUC for classification error • Max Gibbs error vs Total Generalized Version Space with Hamming Loss LARC-IMS Workshop
Acknowledgements • Joint work with • Nguyen Viet Cuong (NUS) • Ye Nan (NUS) • Adam Chai (DSO) • ChieuHai Leong (DSO) LARC-IMS Workshop