1 / 33

Active Learning for Probabilistic Models

Active Learning for Probabilistic Models. Lee Wee Sun Department of Computer Science National University of Singapore leews@comp.nus.edu.sg. Probabilistic Models in Networked Environments. Probabilistic graphical models are powerful tools in networked environments

soleil
Télécharger la présentation

Active Learning for Probabilistic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore leews@comp.nus.edu.sg LARC-IMS Workshop

  2. Probabilistic Models in Networked Environments • Probabilistic graphical models are powerful tools in networked environments • Example task: Given some labeled nodes, what are the labels of remaining nodes? • May also need to learn parameters of model (later) Faculty ? Project ? ? Student Project ? Labeling university web pages with CRF LARC-IMS Workshop

  3. Active Learning • Given a budget of k queries, which nodes to query to maximize performance on remaining nodes? • What are reasonable performance measures with provable guarantees for greedy methods? Faculty ? Project ? ? Student Project ? Labeling university web pages with CRF LARC-IMS Workshop

  4. Entropy • First consider non-adaptive policy • Chain rule of entropy • Maximizing entropy of selected variables (Y1) minimizes the conditional entropy Constant Maximize Minimize target LARC-IMS Workshop

  5. Greedy method • Given already selected set S, add variable Yito maximize • Near optimality:because of submodularity of entropy. LARC-IMS Workshop

  6. Submodularity • Diminishing return property LARC-IMS Workshop

  7. Adaptive Policy • What about adaptive policies? k Non-adaptive Adaptive LARC-IMS Workshop

  8. Let ρ be a path down the policy tree, and let the policy entropy beThen we can showwhere YG is the graph labeling • Correspond to chain rule in non-adaptive case – maximizing policy entropy minimizes conditional entropy LARC-IMS Workshop

  9. Recap: Greedy algorithm is near-optimal for non-adaptive case • For adaptive case, consider greedy algorithm that selects the variable with the largest entropy conditioned on the observations • Unfortunately, for adaptive case, we can show that, for every α > 0, there is a probabilistic model such that LARC-IMS Workshop

  10. Tsallis Entropy and Gibbs Error • In statistical mechanics, Tsallisentropy is a generalization of Shannon entropy • Shannon entropy is special case for q = 1. • We call the case q = 2, Gibbs Error LARC-IMS Workshop

  11. Properties of Gibbs Error • Gibbs error is the expected error of the Gibbs classifier • Gibbs classifier: Draw a labeling from the distribution and use the labeling as the prediction • At most twice Bayes (best possible) error. LARC-IMS Workshop

  12. Lower bound to entropy • Maximizing policy Gibbs error, maximize lower bound to policy entropy  Gibbs Error LARC-IMS Workshop

  13. Policy Gibbs error LARC-IMS Workshop

  14. Maximizing policy Gibbs error minimizes expected weighted posterior Gibbs error • Make progress on either the version space or posterior Gibbs error Version space Posterior Gibbs error LARC-IMS Workshop

  15. Gibbs Error and Adaptive Policies • Greedy algorithm: Select node i with the largest conditional Gibbs error • Near-optimality holds for the case of policy Gibbs error (in contrast to policy entropy) LARC-IMS Workshop

  16. Proof idea: • Show that policy Gibbs error is the same as the expected version space reduction. • Version space is the total probability of remaining labelings on unlabeled nodes (labelings that are consistent with labeled nodes) • Version space reduction function is adaptive submodular, giving required result for policy Gibbs error (using result of Golovin and Krause). Version space LARC-IMS Workshop

  17. Adaptive Submodularity • Diminishing return property • Change in version space when xi is concatenated to path ρ and y is received • Adaptive submodular because ρ’ ρ x3 LARC-IMS Workshop

  18. Worst Case Version Space • Maximizing policy Gibbs error maximizes expected version space reduction • Related greedy algorithm: Select the least confident variable • Select the variable with the smallest maximum label probability • Approximately maximizes worst case version space reduction LARC-IMS Workshop

  19. Let • Using greedy strategy that selects least confident variable achievesbecause version space reduction function is pointwisesubmodular LARC-IMS Workshop

  20. PointwiseSubmodularity • Let V(S,y) be the version space remaining if y is the true labeling of all nodes and subset S has been labeled • 1-V(S,y) is pointwisesubmodularas it is submodular for every labeling y LARC-IMS Workshop

  21. Summary So Far … LARC-IMS Workshop

  22. Learning Parameters • Take a Bayesian approach • Put prior over parameters • Integrate away parameters when computing probability of labeling • Also works with commonly encountered pooled based active learning scenario (independent instances – no dependencies other than on parameter) LARC-IMS Workshop

  23. Experiments • Named entity recognition with Bayesian CRF on CoNLL 2003 dataset • Greedy algsperformancesimilar andbetter thanpassivelearning (random) LARC-IMS Workshop

  24. Weakness of Gibbs Error • A labeling is considered incorrect if even one component does not agree Faculty Faculty Faculty Project Project Project Project Project Student Student Student Student Project Project Student Student LARC-IMS Workshop

  25. Generalized Gibbs Error • Generalize Gibbs error to use loss function L • Example: Hamming loss, 1-F-score, etc. • Reduces to Gibbs error when L(y,y’) = 1-δ(y,y’) where • δ(y,y’) = 1 when y = y’, and • δ(y,y’) = 0 otherwise y2 y4 y1 y3 LARC-IMS Workshop

  26. Generalized policy Gibbs error (to maximize) Generalized Gibbs Error Remaining weighted Generalized Gibbs error (agrees with y on ρ) LARC-IMS Workshop

  27. Generalized policy Gibbs error is the average of • Call this function the generalized version space reduction function • Unfortunately, not adaptive submodular for arbitrary L. y2 y4 y1 y3 LARC-IMS Workshop

  28. However, generalized version space reduction function is pointwisesubmodular • Has good approximation in the worst case y2 y4 y1 y3 LARC-IMS Workshop

  29. Hedging against worst case labeling may be too conservative • Can hedge against the total generalized version space among surviving labelings instead y2 y2 instead of y4 y4 y1 y1 y3 y3 LARC-IMS Workshop

  30. Call this total generalized version space reduction function • Total generalized version space reduction function is pointwisesubmodular • Has good approximation in the worst case LARC-IMS Workshop

  31. Summary LARC-IMS Workshop

  32. Experiments • Text classification • 20Newsgroup dataset • Classify 7 pairs of newsgroups • AUC for classification error • Max Gibbs error vs Total Generalized Version Space with Hamming Loss LARC-IMS Workshop

  33. Acknowledgements • Joint work with • Nguyen Viet Cuong (NUS) • Ye Nan (NUS) • Adam Chai (DSO) • ChieuHai Leong (DSO) LARC-IMS Workshop

More Related