1 / 15

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty. Information Extraction. Extracting domain-specific information from NL text Example Domains Locations Companies Terrorism.

brone
Télécharger la présentation

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty

  2. Information Extraction • Extracting domain-specific information from NL text • Example Domains • Locations • Companies • Terrorism

  3. Required Lexical Resources • Semantic lexicons • Dictionary of words tagged using semantic categories • e.g. name of locations (countries, cities) • Extraction patterns • e.g. outlets in <x>, from <y> • From Noun Phrase • Outlets in New York

  4. Mutual Bootstrapping • No annotated corpus • Learning extraction patterns and semantic lexicon • Input • Unannotated corpus • Seed words

  5. Mutual Bootstrapping • Starting from seed words • Identifying NPs related to the seed words [for extraction patterns] • Using extraction patterns to identify new terms • New terms should be in the same lexical category • Using new terms to search for more patterns

  6. Algorithm • Input: • Candidate extraction pattern from AutoSlog • Seed words • Data Structures • EPdata – to store candidate extraction patterns • Initial value: extraction patterns from AutoSlog an the extractions • SemLex – to store semantic lexicons as they are identified • Initial value: seed words • Cat_EPlist – to store the extraction patterns • Initial value: null

  7. Algorithm (contd...) • For all Extraction Patterns Pi in EPdata • score(Pi) = Ri * log2(Fi) • Fi = no. of lexicons produced by Pi • Ri = Fi/Ni, Ni: no. of NPs extracted by Pi • 2. Insert Pi to Cat_Eplist, wherescore(Pi) is max • 3. Insert Pi’s extraction SemLex • 4. Repeat from step 1.

  8. Results (Locations)

  9. Multi-level Bootstrapping • Problem with mutual bootstrapping • Insertion of incorrect word in SemLex can drastically reduce accuracy • Solution • Second level of bootstrapping

  10. Meta-bootstrapping • Outer level of bootstrapping • Retains the best 5 NPs • Corresponding lexicons are added to a permanent list • Reliability score: • rel(NPi) = ΣNik=1(1+ score(pk)) • Using reliable lexicons for the next iteration of Mutual-BS

  11. Results

  12. Evaluation • Corpus • 4160 Corporate web pages • 1500 terrorism text • AutoSlog candidate extraction patterns • 19,690 for the web pages • 14,064 for the terrorism text • Seed words • Web company: Co., Company, Corp… • Web Location: Different country names • Terrorism location: Bolivia, city, Colombia, district

  13. Evaluation (contd…) • 50 iterations of Meta-bootstrapping • Mutual bootstrapping ran until to produced 10 unique patterns

  14. Evaluation (contd…) • Other systems’ accuracy (weapon): • 17% (Rilof & Shepherd, 1997) • 36% (Roark & Charniak, 1998)

  15. Evaluation (contd…) • Tested on 233 new web pages

More Related