150 likes | 266 Vues
This paper presents a novel method for extracting domain-specific information from natural language text using multi-level bootstrapping strategies. Focused on fields such as locations, companies, and terrorism, the approach leverages unannotated corpora and seed words to develop extraction patterns and semantic lexicons without the need for pre-annotated resources. The adaptive algorithm iteratively refines extraction patterns to improve accuracy, culminating in a reliable system that has been evaluated on diverse datasets, achieving notable performance improvements compared to existing methods.
E N D
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty
Information Extraction • Extracting domain-specific information from NL text • Example Domains • Locations • Companies • Terrorism
Required Lexical Resources • Semantic lexicons • Dictionary of words tagged using semantic categories • e.g. name of locations (countries, cities) • Extraction patterns • e.g. outlets in <x>, from <y> • From Noun Phrase • Outlets in New York
Mutual Bootstrapping • No annotated corpus • Learning extraction patterns and semantic lexicon • Input • Unannotated corpus • Seed words
Mutual Bootstrapping • Starting from seed words • Identifying NPs related to the seed words [for extraction patterns] • Using extraction patterns to identify new terms • New terms should be in the same lexical category • Using new terms to search for more patterns
Algorithm • Input: • Candidate extraction pattern from AutoSlog • Seed words • Data Structures • EPdata – to store candidate extraction patterns • Initial value: extraction patterns from AutoSlog an the extractions • SemLex – to store semantic lexicons as they are identified • Initial value: seed words • Cat_EPlist – to store the extraction patterns • Initial value: null
Algorithm (contd...) • For all Extraction Patterns Pi in EPdata • score(Pi) = Ri * log2(Fi) • Fi = no. of lexicons produced by Pi • Ri = Fi/Ni, Ni: no. of NPs extracted by Pi • 2. Insert Pi to Cat_Eplist, wherescore(Pi) is max • 3. Insert Pi’s extraction SemLex • 4. Repeat from step 1.
Multi-level Bootstrapping • Problem with mutual bootstrapping • Insertion of incorrect word in SemLex can drastically reduce accuracy • Solution • Second level of bootstrapping
Meta-bootstrapping • Outer level of bootstrapping • Retains the best 5 NPs • Corresponding lexicons are added to a permanent list • Reliability score: • rel(NPi) = ΣNik=1(1+ score(pk)) • Using reliable lexicons for the next iteration of Mutual-BS
Evaluation • Corpus • 4160 Corporate web pages • 1500 terrorism text • AutoSlog candidate extraction patterns • 19,690 for the web pages • 14,064 for the terrorism text • Seed words • Web company: Co., Company, Corp… • Web Location: Different country names • Terrorism location: Bolivia, city, Colombia, district
Evaluation (contd…) • 50 iterations of Meta-bootstrapping • Mutual bootstrapping ran until to produced 10 unique patterns
Evaluation (contd…) • Other systems’ accuracy (weapon): • 17% (Rilof & Shepherd, 1997) • 36% (Roark & Charniak, 1998)
Evaluation (contd…) • Tested on 233 new web pages