1 / 60

Learning Ensembles of First-Order Clauses That Optimize Precision-Recall Curves

Learning Ensembles of First-Order Clauses That Optimize Precision-Recall Curves. Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Ph. D. Defense August 13th, 2007. Structured. Database. Biomedical Information Extraction.

Télécharger la présentation

Learning Ensembles of First-Order Clauses That Optimize Precision-Recall Curves

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Ensembles of First-Order Clauses That OptimizePrecision-Recall Curves Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Ph. D. Defense August 13th, 2007

  2. Structured Database Biomedical Information Extraction *image courtesy of SEER Cancer Training Site

  3. Biomedical Information Extraction http://www.geneontology.org

  4. Biomedical Information Extraction NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

  5. Outline • Biomedical Information Extraction • Inductive Logic Programming • Gleaner • Extensions to Gleaner • GleanerSRL • Negative Salt • F-Measure Search • Clause Weighting (time permitting)

  6. Inductive Logic Programming • Machine Learning • Classify data into categories • Divide data into train and test sets • Generate hypotheses on train set and then measure performance on test set • In ILP, data are Objects … • person, block, molecule, word, phrase, … • and Relations between them • grandfather, has_bond, is_member, …

  7. verb(…) • alphanumeric(…) phrase_child(…, …) • internal_caps(…) noun_phrase(…) phrase_parent(…, …) • long_sentence(…) Seeing Text as Relational Objects Word Phrase Sentence

  8. Protein Localization Clause prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).

  9. prot_loc(P,L,S) seed ILP Background • Seed Example • A positive example that our clause must cover • Bottom Clause • All predicates which are true about seed example prot_loc(P,L,S):- alphanumeric(P) prot_loc(P,L,S):- alphanumeric(P),leading_cap(L)

  10. prediction 2PR TP TP TP FN actual + + + TP TP P FN FP R TN FP Clause Evaluation • Prediction vs Actual Positive or Negative True or False • Focus on positive examples Recall = Precision = F1 Score =

  11. Protein Localization Clause prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence). 0.15 Recall 0.51 Precision 0.23 F1 Score

  12. Aleph (Srinivasan ‘03) • Aleph learnstheories of clauses • Pick positive seed example • Use heuristic search to find best clause • Pick new seed from uncovered positivesand repeat until threshold of positives covered • Sequential learning is time-consuming • Can we reduce time with ensembles? • And also increase quality?

  13. Outline • Biomedical Information Extraction • Inductive Logic Programming • Gleaner • Extensions to Gleaner • GleanerSRL • Negative Salt • F-Measure Search • Clause Weighting

  14. Gleaner (Goadrich et al. ‘04, ‘06) • Definition of Gleaner • One who gathers grain left behind by reapers • Key Ideas of Gleaner • Use Aleph as underlying ILP clause engine • Search clause space with Rapid Random Restart • Keep wide range of clauses usually discarded • Create separate theories for diverse recall

  15. Gleaner - Learning • Create B Bins • Generate Clauses • Record Best per Bin Precision Recall

  16. Gleaner - Learning Seed K . . . Seed 3 Seed 2 Seed 1 Recall

  17. ex1: prot_loc(…) 12 Pos ex2: prot_loc(…) 47 Neg ex3: prot_loc(…) 55 Pos ex598: prot_loc(…) 5 Pos ex599: prot_loc(…) Neg 14 ex2: prot_loc(…) ex600: prot_loc(…) 47 2 ex601: prot_loc(…) 18 Neg Pos Gleaner - Ensemble Clauses from bin 5 ex1: prot_loc(…) . 12 . . . .

  18. Precision Recall 1.0 0.05 1.00 0.05 0.50 0.10 0.66 Precision 0.85 0.12 0.90 0.13 Recall 1.0 0.90 0.12 Gleaner - Ensemble Examples Score pos3: prot_loc(…) 55 neg28: prot_loc(…) 52 pos2: prot_loc(…) 47 . neg4: prot_loc(…) 18 neg475: prot_loc(…) 17 pos9: prot_loc(…) 17 neg15: prot_loc(…) 16 .

  19. Precision Recall Gleaner - Overlap • For each bin, take the topmost curve

  20. How to Use Gleaner (Version 1) • Generate Tuneset Curve • User Selects Recall Bin • Return Testset ClassificationsOrdered By Their Score Precision Recall = 0.50 Precision = 0.70 Recall

  21. Gleaner Algorithm • Divide space into B bins • For K positive seed examples • Perform RRR search with precision x recall heuristic • Save best clause found in each bin b • For each bin b • Combine clauses in b to form theoryb • Find L of Kthreshold for theorym which performs best in bin b on tuneset • Evaluate thresholded theories on testset

  22. Aleph Ensembles (Dutra et al ‘02) • Compare to ensembles of theories • Ensemble Algorithm • Use K different initial seeds • Learn K theories containing C rules • Rank examples by the number of theories

  23. YPD Protein Localization • Hand-labeled dataset(Ray & Craven ’01) • 7,245 sentences from 871 abstracts • Examples are phrase-phrase combinations • 1,810 positive & 279,154 negative • 1.6 GB of background knowledge • Structural, Statistical, Lexical and Ontological • In total, 200+ distinct background predicates • Performed five-fold cross-validation

  24. Evaluation Metrics 1.0 • Area Under Precision-Recall Curve (AUC-PR) • All curves standardized to cover full recall range • Averaged AUC-PR over 5 folds • Number of clauses considered • Rough estimate of time Precision Recall 1.0

  25. PR Curves - 100,000 Clauses

  26. Protein Localization Results

  27. Other Relational Datasets • Genetic Disorder(Ray & Craven ’01) • 233 positive & 103,959 negative • Protein Interaction (Bunescu et al ‘04) • 799 positive & 76,678 negative • Advisor (Richardson and Domingos ‘04) • Students, Professors, Courses, Papers, etc. • 113 positive & 2,711 negative

  28. Genetic Disorder Results

  29. Protein Interaction Results

  30. Advisor Results

  31. Gleaner Summary • Gleaner makes use of clauses that are not the highest scoring ones for improved speed and quality • Issues with Gleaner • Output is PR curve, not probability • Redundant clauses across seeds • L of Kclause combination

  32. Outline • Biomedical Information Extraction • Inductive Logic Programming • Gleaner • Extensions to Gleaner • GleanerSRL • Negative Salt • F-Measure Search • Clause Weighting

  33. Precision Recall Estimating Probabilities - SRL • Given highly skewed relational datasets • Produce accurate probability estimates • Gleaner only produces PR curves

  34. Gleaner Algorithm GleanerSRL Algorithm(Goadrich ‘07) • Divide space into B bins • For K positive seed examples • Perform RRR search with precision x recall heuristic • Save best clause found in each bin b • For each bin b • Combine clauses in b to form theoryb • Find L of Kthreshold for theorym which performs best in bin b on tuneset • Evaluate thresholded theories on testset • Create propositional feature-vectors • Learn scores with SVM or other propositional learning algorithms • Calibrate scores into probabilities • Evaluate probabilities with Cross Entropy

  35. GleanerSRL Algorithm

  36. Learning with Gleaner • Generate Clauses • Create B Bins • Record Best per Bin • Repeat for K seeds Precision Recall

  37. Pos Neg Pos Pos Neg Creating Feature Vectors Clauses from bin 5 K Boolean 1 0 1 Binned ex1: prot_loc(…) 1 12 1 . . . . . . 0

  38. Learning Scores via SVM

  39. 0.00 0.50 0.66 1.00 Calibrating Probabilities • Use Isotonic Regression (Zadrozny & Elkan ‘03) to transform SVM scores into probabilities Probability Class 0 0 1 0 1 1 0 1 1 Score -2 -0.4 0.2 0.4 0.5 0.9 1.3 1.7 15 Examples

  40. GleanerSRL Results for Advisor (Davis et al. 05) (Davis et al. 07)

  41. Outline • Biomedical Information Extraction • Inductive Logic Programming • Gleaner • Extensions to Gleaner • GleanerSRL • Negative Salt • F-Measure Search • Clause Weighting

  42. Diversity of Gleaner Clauses

  43. prot_loc(P,L,S) seed salt Negative Salt • Seed Example • A positive example that our clause must cover • Salt Example • A negative example that our clause should avoid

  44. Gleaner Algorithm • Divide space into B bins • For K positive seed examples • Perform RRR search with precision x recall heuristic • Save best clause found in each bin b • For each bin b • Combine clauses in b to form theoryb • Find L of Kthreshold for theorym which performs best in bin b on tuneset • Evaluate thresholded theories on testset • Select Negative Salt example • Perform RRR search with salt-avoiding heuristic • Save best clause found in each bin b • For each bin b • Combine clauses in b to form theoryb • Find L of Kthreshold for theorym which performs best in bin b on tuneset • Evaluate thresholded theories on testset

  45. Diversity of Negative Salt

  46. Effect of Salt on Theorym Choice

  47. Negative Salt AUC-PR

  48. Outline • Biomedical Information Extraction • Inductive Logic Programming • Gleaner • Extensions to Gleaner • GleanerSRL • Negative Salt • F-Measure Search • Clause Weighting

  49. Gleaner Algorithm • Divide space into B bins • For K positive seed examples • Perform RRR search with precision x recall heuristic • Perform RRR search with F Measure heuristic • Save best clause found in each bin b • For each bin b • Combine clauses in b to form theoryb • Find L of Kthreshold for theorym which performs best in bin b on tuneset • Evaluate thresholded theories on testset

  50. RRR Search Heuristic • Heuristic function directs RRR search • Can provide direction through F Measure • Low values for encourage Precision • High values for encourage Recall

More Related