180 likes | 375 Vues
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA). Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley. Outline. Role of Keyphrases Phrase Extraction Algorithms Phrase Extraction with Multi-Objective Genetic Algorithm
E N D
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley
Outline • Role of Keyphrases • Phrase Extraction Algorithms • Phrase Extraction with Multi-Objective Genetic Algorithm • Experiment and Results • Results Evaluation • Conclusion • Future Research
Role of Keyphrases • Concept representations • Document indexing • Enhance document retrieval / Browsing • Query formulation assistance • Document surrogates
Design Research Repository Unified Subject Headings Corporate Design Repository Design Education Materials Vision of Unified Language System Context Mapping Mechanism Semantic Network Unified Language System for Engineering Design
Keyphrase Extraction Algorithms • Heuristic, Syntactic, Machine Learning • Requires prior training • Heuristic cut-off thresholds in number of phrases • Focuses on single document • Redundancy when aggregated for the whole document collection
3d scanning 1 abstraction 0 active control system 1 1 0 0 1 1 Candidate Phrases 0 1 1 0 1 Chromosome 1 0 0 0 1 Keyphrase Extraction with MOGA • Phrase extraction as an optimization problem • Candidate phrases generation • Optimize phrase selection with MOGA • Model & Genetic Operators Phenotype & Genotype Crossover Parents Offspring
Keyphrase Extraction with MOGA • Optimize phrase selection with MOGA (cont.) • Model & Genetic Operators (cont.) • Evaluation fitness functions • Minimize clustering measure / dispersion (Bookstein ’98) • Minimize number of phrases • Non-Dominated Sorting Genetic Algorithm (NSGA-II) Mutation 1 0 0 1 0 1 1 0 1 0
Experiment and Results • Data set 34 papers from Design Theory and Methodology Conference ’01 • Candidate phrases ~5000 noun phrases extracted • Genetic Algorithm Parameters • Population size 100 • Converges at 5000 generations • 5 hours on Xeon 1.8GHz CPU
Experiment and Results Pareto plot of Dispersion versus Number of Phrases
Experiment and Results Histogram of number of optimal solutions a keyphrase appears
Evaluation • 6 domain experts participated in the evaluation. • Core phrases vs. Non-core phrases. • Less than 10% are deemed irrelevant. • Significant deviation between evaluators.
Conclusion • Keyphrase extraction can be successfully implemented as a multi-objective global optimization problem. • Reasonably good keyphrases can be extracted without prior training or domain knowledge. • Trade-off information between objectives such as number of phrases vs. average quality of phrases can be gained from Pareto solutions. • Preferences can be made based on the user needs and trade-off information.
Future Research • Test on larger text collection. • Implement extracted keyphrases in IR system as browsing and query expansion tool and compare to full-text search IR system. • Evaluate with more raters and 1-5 scale. • Build domain thesauri with extracted keyphrases and semantic discovery algorithms (e.g. Latent Semantic Analysis).
Thank you! Comments? Questions? jialong@me.berkeley.edu aagogino@me.berkeley.edu