1 / 27

Term Informativeness for Named Entity Detection

Term Informativeness for Named Entity Detection. Jason D. M. Rennie MIT. Tommi Jaakkola MIT. Information Extraction. President Bush signed the Central America Free Trade Agreement into law Tuesday …. Who. What. When. Named Entity Detection.

lenci
Télécharger la présentation

Term Informativeness for Named Entity Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT

  2. Information Extraction President Bush signed the Central America Free Trade Agreement into law Tuesday… Who What When

  3. Named Entity Detection President Bush signed the Central America Free Trade Agreement into law Tuesday, hailing the seven-nation pact as an open-door policy that will benefit U.S. exporters and seed prosperity and democracy in Central America and the Dominican Republic.

  4. Informal Communication • Other Sources of Information • E-mail • Web Bulletin Boards • Mailing Lists • More specialized, up-to-date information • But, harder to extract

  5. IE for Informal Comm. SUBJECT: Two New Ipswich Seafood Joints to Open Soon. ALL HOUNDS ON DECK! #1 Across from the new HS, at the old White Cap Seafood is a renovated new joint and the sign says "Salt Box". I suspect they are opening soon; they look ready. Lets hope its great as there is too much 'just average' around here. #2: In the…

  6. NED for Informal Comm. Subject:finale harvard square has anyone been to the recently opened finale in harvard square?

  7. Restaurant Bulletin Board • Gathered from a Restaurant BBoard • 6 sets of ~100 posts • 132 threads • Applied Ratnaparki’s POS tagger • Hand-labeled each token In/Out of restaurant name

  8. Named Entity Informative Informative Bursty Detecting Named Entities Named Entity

  9. Document 1 Document 2 Document 3 the Brazil clandestine Quantifying Informativeness

  10. A Little History…

  11. Main Idea • Informative words are: • Rare (IDF) • Modal (Mixture Score) • Rarity and Modality are independent qualities • We quantify informativeness using a product of IDF and Mixture Score

  12. Binomial Distribution

  13. 7 0 4 0 8 0 5 5 6 0 Term Frequency Distributions “the” “Brazil”

  14. 5 0 =90% 10% Mixture Models 1=0.1% 2=5%

  15. Modality • Modal words fit a mixture much better than a single binomial • We separately fit the binomial and mixture models to each term frequency distribution • We quantify modality by comparing the fitness of the two models

  16. Learning Mixture Parameters • Use Gradient Descent to learn , 1, 2

  17. Comparing Fitness • Use log-odds to compare fitness of the two models

  18. Top Mixture Score Words

  19. Independence Rareness (IDF) ? Modality (Mixture Score)

  20. Correlation Coefficient

  21. Top Words Overlap Plot • Two sorted lists • Sorted by IDF • Sorted by Mixture Score • Look at % overlap among top N in both lists • Plot % overlap as we vary N • Independent scores would produce line along diagonal

  22. Overlap Plot IDF/RIDF Percent Overlap IDF/Mixture # Top Words

  23. Top IDF*Mixture Words

  24. Intro to NED Experiments • Task: Identify Restaurant Names • Use standard NED features (capitalization, punctuation, POS) as “Baseline” • Add informativeness score as an additional feature • Use F1 Breakeven as performance metric

  25. Better NED Experiments

  26. Summary • Traditional syntax-based features are not enough for IE in e-mail & bulletin boards • We used term occurrence statistics to construct an informativeness score (IDF*Mixture) • We found IDF*Mixture to be useful for identifying topic-centric words and named entites

  27. Discussion • Phrases • Foreign languages, Speech • Co-reference resolution, context tracking • Collaborative filtering

More Related