460 likes | 559 Vues
This study explores augmenting Wikipedia by labeling pages with entity classes, such as People, Organizations, Locations, and Miscellanea, using SVM algorithms and specific feature sets. It delves into page classification techniques, challenges faced, and experimental results.
E N D
Augmenting Wikipedia with Named Entity Tags WisamDakka Columbia University SilviuCucerzan Microsoft Research IJCNLP 2008
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
the objective • assigning to each document in a collection one or several labels from a given set • algorithm • SVM • features • traditional • bag-of-words • Wikipedia-specific feature sets 2 Related Work
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
Types of Wikipedia pages • Disambiguation Page (DIS) • Common Page (COMM) • Named Entity Page 3 Classifying Wikipedia Pages
Entity Classes • Animated Entities (PER) • Organization Entities (ORG) • Location Entities (LOC) • Miscellaneous Entities (MISC) 3 Classifying Wikipedia Pages
Animated Entities (PER) • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Leonardo da Vinci • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Leonhard Euler • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Harry Potter • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Sonny (I, robot) • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Zeus • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Apollo • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Garfield • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Alien • Human entities • real person • in fictional works • mythological deities • Non-human entities • particular animal • alien 3 Classifying Wikipedia Pages
Organization Entities (ORG) • Typical examples are businesses • “Microsoft”, “Ford” • governmental bodies • “United States Congress” • non-governmental organizations • “Republican Party”, “American Bar Association” 3 Classifying Wikipedia Pages
Organization Entities (ORG) • science and health units • “Massachusetts General Hospital” • sports organizations and teams • “Angolan Football Federation”, “San Francisco 49ers” • religious organizations • “Church of Christ” • entertainment organizations • “San Francisco Mime Troupe”, the rock band “The Police” 3 Classifying Wikipedia Pages
Location Entities (LOC) • Geo-Political entities • “Hawaii”, “European Union”, “Australia”, and “Washington, D.C.” • Locations • “the Solar system”, “Mars”, “Hudson River”, and “Mount Rainier” • Facilities • airports, highways, streets, etc 3 Classifying Wikipedia Pages
Miscellaneous Entities (MISC) • Events • “Olympic Games” • Art works • books, movies, TV programs • Artifacts • camera “Nikon D4“, the software “photoshop” • Processes • “Ettinghausen effect” • Formulas or Algorithms 3 Classifying Wikipedia Pages
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
4 Features Used. Independent Views 4.1 Page-Based Features 4.2 Context Features
4.1 Page-Based Features • Bag of Words (BOW) • Structured Data (STRUCT) • First Paragraph (FPAR) • Abstract (ABS) • Surface Forms and Disambiguations (SFD) 4 Features Used. Independent Views
4.2 Context Features • Unigram Context (UCON) • Bigram Context (BCON) 4 Features Used. Independent Views
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
refer to entities that do not exist in Wikipedia • abstracts and structure features are only available for 68% and 79% of the pages, respectively • only had available several hundred labeled examples • feature space is very large, and many noise 5 Challenges
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
Human Judged Data (HJD) • Human Judged Data Extended (HJDE) 6.1 Training Data
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
algorithms • SVMs • Naïve Bayes 6.2 Classification
report the results • binary classification • identify all the Wikipedia pages of type PER • 5-fold classification • PER, COM,ORG, LOC, and MISC 6.2 Classification
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
6 Experiments and Findings 6.1 Training Data 6.2 Classification 6.3 Results on Bag-of-words 6.4 Results on Other Feature Groups 6.5 Results for Co-training
outline 1 Introduction 2 Related Work 3 Classifying Wikipedia Pages 4 Features Used. Independent Views 5 Challenges 6 Experiments and Findings 7 Conclusions and Future Work