460 likes | 552 Vues
Augmenting Wikipedia with Named Entity Tags. Wisam Dakka Columbia University. Silviu Cucerzan Microsoft Research. IJCNLP 2008. outline. 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings
E N D
Augmenting Wikipedia with Named Entity Tags WisamDakka Columbia University SilviuCucerzan Microsoft Research IJCNLP 2008
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
the objective • assigning to each document in a collection one or several labels from a given set • algorithm • SVM • features • traditional • bag-of-words • Wikipedia-specific feature sets 2 Related Work
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
Types of Wikipedia pages • Disambiguation Page (DIS) • Common Page (COMM) • Named Entity Page 3 Classifying Wikipedia Pages
Entity Classes • Animated Entities (PER) • Organization Entities (ORG) • Location Entities (LOC) • Miscellaneous Entities (MISC) 3 Classifying Wikipedia Pages
Animated Entities (PER) • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Leonardo da Vinci • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Leonhard Euler • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Harry Potter • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Sonny (I, robot) • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Zeus • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Apollo • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Garfield • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Alien • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Organization Entities (ORG) • Typical examples are businesses • “Microsoft”, “Ford” • governmental bodies • “United States Congress” • non-governmental organizations • “Republican Party”, “American Bar Association” 3 Classifying Wikipedia Pages
Organization Entities (ORG) • science and health units • “Massachusetts General Hospital” • sports organizations and teams • “Angolan Football Federation”, “San Francisco 49ers” • religious organizations • “Church of Christ” • entertainment organizations • “San Francisco Mime Troupe”, the rock band “The Police” 3 Classifying Wikipedia Pages
Location Entities (LOC) • Geo-Political entities • “Hawaii”, “European Union”, “Australia”, and “Washington, D.C.” • Locations • “the Solar system”, “Mars”, “Hudson River”, and “Mount Rainier” • Facilities • airports, highways, streets, etc 3 Classifying Wikipedia Pages
Miscellaneous Entities (MISC) • Events • “Olympic Games” • Art works • books, movies, TV programs • Artifacts • camera “Nikon D4“, the software “photoshop” • Processes • “Ettinghausen effect” • Formulas or Algorithms 3 Classifying Wikipedia Pages
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
4 Features Used. Independent Views 4.1 Page-Based Features 4.2 Context Features
4.1 Page-Based Features • Bag of Words (BOW) • Structured Data (STRUCT) • First Paragraph (FPAR) • Abstract (ABS) • Surface Forms and Disambiguations (SFD) 4 Features Used. Independent Views
4.2 Context Features • Unigram Context (UCON) • Bigram Context (BCON) 4 Features Used. Independent Views
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
refer to entities that do not exist in Wikipedia • abstracts and structure features are only available for 68% and 79% of the pages, respectively • only had available several hundred labeled examples • feature space is very large, and many noise 5 Challenges
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
Human Judged Data (HJD) • Human Judged Data Extended (HJDE) 6.1 Training Data
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
algorithms • SVMs • Naïve Bayes 6.2 Classification
report the results • binary classification • identify all the Wikipedia pages of type PER • 5-fold classification • PER, COM,ORG, LOC, and MISC 6.2 Classification
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work