1 / 57

Bootstrapping without the Boot

Bootstrapping without the Boot. http://cs.jhu.edu/~jason/papers/#emnlp05-strapping. Jason Eisner Damianos Karakos. IPAM Document Space Workshop, January 2006. Basically a talk about clustering. Get a clustering algorithm (unsupervised) from (~any) classifier learning algorithm (supervised).

rland
Télécharger la présentation

Bootstrapping without the Boot

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping without the Boot http://cs.jhu.edu/~jason/papers/#emnlp05-strapping Jason Eisner Damianos Karakos IPAM Document Space Workshop, January 2006

  2. Basically a talk about clustering • Get a clustering algorithm (unsupervised)from (~any) classifier learning algorithm (supervised). • Lets you do domain-specific clustering. • Cute idea, works well • Builds on some tricks often used in natural language processing • But essentially procedural • Like the tricks that it builds on • Not clear what is being optimized

  3. First, a version for the “space” people • Want to learn a red/blue classifier • Nearest neighbor, kernel SVM, fit a surface … In most of the talk, we’ll be classifying linguistic objects (words in their contexts) using linguistic features

  4. don’t know colors, but do know positions First, a version for the “space” people • Harder if you have fewer training data • But maybe you can use the unlabeled data

  5. no, these would be joined early First, a version for the “space” people • How unlabeled can you go? • (Could you use single-link clustering?)

  6. “Bootstrapping” • Start with a few labeled points • How unlabeled can you go? • Let’s try “bootstrapping” (not the statistical bootstrap)

  7. “Bootstrapping” • Start with a few labeled points • From those, learn your favorite classifier (let’s use 1-NN to make this example work)

  8. “Bootstrapping” • Use that to label the rest of the points … • Start with a few labeled points • From those,learn your favorite classifier (let’s use 1-NN to make this example work) • Oops! • Doesn’t work even with soft labeling (as in EM) • Sparse good data immediately get swamped by bad guesses

  9. “Bootstrapping” • Use that to label just a few more points, where the classifier is most confident(define that!) • Start with a few labeled points • From those,learn your favorite classifier (let’s use 1-NN to make this example work)

  10. “Bootstrapping” • Use that to label just a few more points, where the classifier is most confident(define that!) • Goto step 2! • Start with a few labeled points • From those,learn your favorite classifier (let’s use 1-NN to make this example work) • When will this work?

  11. “Bootstrapping” • Use that to label just a few more points, where the classifier is most confident(define that!) • Goto step 2! • Start with a few labeled points • From those,learn your favorite classifier (let’s use 1-NN to make this example work) • When will this work? • Depends on where you start … • but real datasets may allow many good starting points

  12. “Bootstrapping” • Use that to label just a few more points, where the classifier is most confident(define that!) • Goto step 2! • Start with a few labeled points • From those,learn your favorite classifier (let’s use 1-NN to make this example work) • Here’s a really bad starting point • incorrect! red & blue actually in same class  • but even if we pick at random, ½ chance of different classes 

  13. Executive Summary(if you’re not an executive, you may stay for the rest of the talk) • What: • We like minimally supervised learning (bootstrapping). • Let’s convert it to unsupervised learning (“strapping”). • How: • If the supervision is so minimal, let’s just guess it! • Lots of guesses  lots of classifiers. • Try to predict which one looks plausible (!?!). • We can learn to make such predictions. • Results (on WSD): • Performance actually goes up! • (Unsupervised WSD for translational senses, English Hansards, 14M words.)

  14. (leaves, machinery) (life, manufacturing) WSD by bootstrapping classifier that attempts to classify all tokens of “plant” • we know “plant” has 2 senses • we hand-pick 2 words that indicate the desired senses • use the word pair to “seed” some bootstrapping procedure

  15. other tasks? other classifiers? other bootstrappers? Possible future work Yarowsky’s bootstrapping algorithm • The minimally supervised scenariofrom which we’ll eliminate supervision today: We chose to work on word-sense disambiguation and bootstrap decision-list classifiers using the method of Yarowsky (1995).

  16. table taken from Yarowsky (1995) Yarowsky’s bootstrapping algorithm life (1%) target word: plant 98% manufacturing(1%) (life, manufacturing)

  17. figure taken from Yarowsky (1995) Learn a classifier that distinguishes A from B. It will notice features like “animal”  A, “automate”  B. Yarowsky’s bootstrapping algorithm (life, manufacturing)

  18. figure taken from Yarowsky (1995) Now learn a new classifier and repeat … Yarowsky’s bootstrapping algorithm & repeat … That confidently classifies some of the remaining examples. & repeat … (life, manufacturing)

  19. figure taken from Yarowsky (1995) Yarowsky’s bootstrapping algorithm Should be a good classifier, unless we accidentally learned some bad cues along the way that polluted the original sense distinction. (life, manufacturing)

  20. table taken from Yarowsky (1995) Yarowsky’s bootstrapping algorithm (life, manufacturing)

  21. Other applications • Yarowsky (1995) has had a lot of influence … • ~ 10 bootstrapping papers at EMNLP’05 • Examples (see paper): • Is a webpage at CMU a course home page? • Is a webpage relevant to the user? (relevance feedback) • Is an English noun subjective (e.g., “atrocious”)? • Is a noun phrase a person, organization, or location? • Is this noun masculine or feminine? (in Spanish, Romanian,…) • Is this substring a noun phrase? • Any example of EM … including grammar induction

  22. (leaves, machinery) (life, manufacturing) f(s) WSD by bootstrapping fertility (actual task performance of classifier) classifier that attempts to classify all tokens of “plant” • we know “plant” has 2 senses • we hand-pick 2 words that indicate the desired senses • use the word pair to “seed” some bootstrapping procedure baseline (today, we’ll judge accuracy against a gold standard) s seed

  23. Want to maximize fertility but we can’t measure it! unsupervised learning can’t see any gold standard ?? ? automatically How do we choose among seeds? ^ f(s) fertility (actual task performance of classifier) Did I find the sense distinction they wanted? Who the heck knows? baseline (today, we’ll judge accuracy against a gold standard) s seed (leaves, machinery) (life, manufacturing)

  24. Want to maximize fertility but we can’t measure it! ? How do we choose among seeds? f(s) Traditional answer: Intuition helps you pick a seed. Your choice tells the bootstrapper about the two senses you want. “As long as you give it a good hint, it will do okay.” fertility (actual task performance of classifier) ! (today, we’ll judge accuracy against a gold standard) s seed (life, manufacturing)

  25. Why not pick a seed by hand? • Your intuition might not be trustworthy (even a sensible seed could go awry) • You don’t speak the language / sublanguage • You want to bootstrap lots of classifiers • All words of a language • Multiple languages • On ad hoc corpora, i.e., results of a search query • You’re not sure that # of senses = 2 • (life, manufacturing) vs. (life, manufacturing, sow) • which works better?

  26. h(s) predicted ^ ? How do we choose among seeds? Want to maximize fertility but we can’t measure it! f(s) Our answer: Bad classifiers smell funny. Stick with the ones that smell like real classifiers. ! fertility s seed

  27. Single classifier that we guess to be best. Future work: Return a combination of classifiers? “Strapping” This name is supposed to remind you of bagging and boosting, which also train many classifiers. (But those methods are supervised, & have theorems …) • Somehow pick a bunch of candidate seeds • For each candidate seed s: • grow a classifier Cs • compute h(s) (i.e., guess whether s was fertile) • Return Cs where s maximizes h(s)

  28. ambiguous words from Gale, Church, & Yarowsky (1992) drug1 drug2 sentence1 sentence2 medicament drogue peine phrase Data for this talk • Unsupervised learning from 14M English words (transcribed formal speech). • Focus on 6 ambiguous word types: • drug, duty, land, language, position, sentence each has from 300 to 3000 tokens To learn an English  French MT model, we would first hope to discover the 2 translational senses of each word.

  29. ambiguous words from Gale, Church, & Yarowsky (1992) Data for this talk • Unsupervised learning from 14M English words (transcribed formal speech). • Focus on 6 ambiguous word types: • drug, duty, land, language, position, sentence try to learn these distinctions monolingually (assume insufficient bilingual data to learn when to use each translation) drug1 drug2 sentence1 sentence2 medicament drogue peine phrase

  30. ambiguous words from Gale, Church, & Yarowsky (1992) Canadian parliamentaryproceedings (Hansards) Data for this talk • Unsupervised learning from 14M English words (transcribed formal speech). • Focus on 6 ambiguous word types: • drug, duty, land, language, position, sentence but evaluate bilingually: for this corpus, happen to have a French translation  gold standardfor the senses we want. drug1 drug2 sentence1 sentence2 medicament drogue peine phrase

  31. Automatically generate 200 seeds (x,y) • Get x, y to select distinct senses of target t: • x and y each have high MI with t • but x and y never co-occur Also, for safety: • x and y are not too rare • x isn’t far more frequent than y Strapping word-sense classifiers • Quickly pick a bunch of candidate seeds • For each candidate seed s: • grow a classifier Cs • compute h(s) (i.e., guess whether s was fertile) • Return Cs where s maximizes h(s)

  32. Strapping word-sense classifiers • Quickly pick a bunch of candidate seeds • For each candidate seed s: • grow a classifier Cs • compute h(s) (i.e., guess whether s was fertile) • Return Cs where s maximizes h(s) replicate Yarowsky (1995) (with fewer kinds of features, and some small algorithmic differences)

  33. Strapping word-sense classifiers • Quickly pick a bunch of candidate seeds • For each candidate seed s: • grow a classifier Cs • compute h(s) (i.e., guess whether s was fertile) • Return Cs where s maximizes h(s) h(s) is the interesting part.

  34. Strapping word-sense classifiers • Quickly pick a bunch of candidate seeds • For each candidate seed s: • grow a classifier Cs • compute h(s) (i.e., guess whether s was fertile) • Return Cs where s maximizes h(s) For comparison, hand-picked 2 seeds. Casually selected (< 2 min.)– one author picked a reasonable (x,y) from the 200 candidates. Carefully constructed (< 10 min.) – other author studied gold standard, then separately picked high-MI x and y that retrieved appropriate initial examples.

  35. Strapping word-sense classifiers • Quickly pick a bunch of candidate seeds • For each candidate seed s: • grow a classifier Cs • compute h(s) (i.e., guess whether s was fertile) • Return Cs where s maximizes h(s) h(s) is the interesting part. How can you possibly tell,without supervision,whether a classifier is any good?

  36. Unsupervised WSD as clustering bad “skewed” good • Easy to tell which clustering is “best” • A good unsupervised clustering has high • p(data | label) – minimum-variance clustering • p(data) – EM clustering • MI(data, label) – information bottleneck clustering + + – + + + + + + – + + + + + + + + – – + – – + – + + – – + + – –

  37. oversimplified slide Clue #1: Confidence of the classifier Yes! These tokens are sense A! And these are B! Um, maybe I found some senses, but I’m not sure. • Final decision list for Cs • Does it confidently classify the training tokens, on average? • Opens the “black box” classifier to assess confidence (but so does bootstrapping itself) though maybe thesenses are truly hard to distinguish though this couldbe overconfidence: may have found the wrong senses possible variants – e.g., is the label overdetermined by many features?

  38. Clue #1: Confidence of the classifier • Q: For an SVM kernel classifier, what is confidence? • A: We are more confident in a large-margin classifier. • This leads to semi-supervised SVMs: • A labeling smells good if  large-margin classifier for it • De Bie & Cristianini 2003, Xu et al 2004 optimize over all labelings, not restricting to bootstrapped ones as we do

  39. Clue #2: Agreement with other classifiers I seem to be odd tree out around here … I like my neighbors. • Intuition: for WSD, any reasonable seed s should find a true sense distinction. • So it should agree with some other reasonable seeds r that find the same distinction. Cs+ + - - + + - + + + Cr+ + - + + - - + + - prob of agreeing this well by chance?

  40. Clue #2: Agreement with other classifiers

  41. Clue #2: Agreement with other classifiers I seem to be odd tree out around here … I like my neighbors. • Intuition: for WSD, any reasonable seed s should find a true sense distinction. • So it should agree with some other reasonable seeds r that find the same distinction. Cs+ + - - + + - + + + Cr+ + - + + - - + + - prob of agreeing this well by chance?

  42. Clue #2: Agreement with other classifiers • Remember, ½ of starting pairs are bad (on same spiral) • But they all lead to different partitions: poor agreement! • The other ½ all lead to the same correct 2-spiral partition • (if spirals are dense and well-separated)

  43. Clue #3: Robustness of the seed Can’t trust an unreliable seed: it never finds the same sense distinction twice. Robust seed grows the same in any soil. • Cs was trained on the original dataset. • Construct 10 new datasets by resampling the data (“bagging”). • Use seed s to bootstrap a classifier on each new dataset. • How well, on average, do these agree with the original Cs? (again use prob of agreeing this well by chance) possible variant – robustness under changes to feature space (not changes to data)

  44. How well did we predict actual fertility f(s)? Measure true fertility f(s) for all 200 seeds. Spearman rank correlation with f(s): • 0.748 Confidence of classifier • 0.785 Agreement with other classifiers • 0.764 Robustness of the seed (avg correlation over 6 words) • 0.794 Average rank of all 3 clues

  45. with various’s test English Hansards drug, duty, land, language, position, sentence 200 seeds per word Smarter combination of clues? • Really want a “meta-classifier”! • Output: Distinguishes good from bad seeds. • Input:Multiple fertility clues for each seed (amount of confidence, agreement, robustness, etc.) train some other corpus plant, tank 200 seeds per word learns “how good seeds behave” for the WSD task we need gold standard answers so we know which seeds really were fertile guesses which seeds probably grew into a good sense distinction

  46. noinformation provided about the desired sense distinctions train test some labeled corpus plant, tank 200 seeds per word English Hansards drug, duty, land, language, position, sentence 200 seeds per word Yes, the test is still unsupervised WSD  Unsupervised WSD research has always relied on supervised WSD instances to learn about the space (e.g., what kinds of features & classifiers work). learns “what good classifiers look like” for the WSD task

  47. How well did we predict actual fertility f(s)? Spearman rank correlation with f(s): • 0.748 Confidence of classifier • 0.785 Agreement with other classifiers • 0.764 Robustness of the seed • 0.794 Average rank of all 3 clues • 0.851% Weighted average of clues Includes 4 versions of the “agreement” feature good weights are learned fromsupervised instances plant, tank just simple linear regression … might do better with SVM & polynomial kernel …

  48. Our top pick is the 7th best seed of 200. (The very best seed is our 2nd or 3rd pick.) 12 of 12 times 5 of 12 times 6 of 6 times ? How good are the strapped classifiers??? drug duty sentence land language position Our top pick is the very best seed out of 200 seeds! Wow! (i.e., it agreed best with an unknown gold standard) Statistically significant wins: strapped classifier (top pick) accuracy 76-90% classifiers bootstrappedfrom hand-picked seeds accuracy 57-88% Good seeds are hard to find! Maybe because we used only 3% as much data as Yarowsky (1995), & fewer kinds of features. chance baseline 50-87%

  49. Hard word, low baseline: drug top pick robust agreeable most confident actual fertility hand-picked seeds rank-correlation = 89% baseline our score

  50. top pick lowest possible (50%) Hard word, high baseline: land confident robust most performbelow baseline most agreeable hand-picked seeds actual fertility rank-correlation = 75% our score

More Related