Extracting Keyphrases from Books using Language Modeling Approaches

Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu

Agenda • Keyphrase Extraction • Value addition to Digital Libraries • Methods of Keyphrase Extraction • Related Work • Our Solution

What are Keyphrases? • Keyphrases • (Give example) • Where used? • Cataloguing in Libraries for IR purposes • Quick Summarization of documents

Why important to ULIB? • Vast growth in digital content • More than a Million books! • Short Meta data description – useful to user while reading • For further processing of books like summarization, IR etc

How do we extract KPs? • Manual entry • Reliable, high quality outcome • But, time-consuming, expensive • Automatic • Fast extraction but less reliable • No expense at all

Automatic techniques for KPE • Rule based methods • Heuristics (paragraph beginning, headline etc) • Krulwich &Burkey etc • Using Linguistic tools • Statistical techniques • Term counts and weighting based Methods • Learn model from training data • Turney et. al[5], KEA[6] , KSpotter[3] etc

Requirements for a KPE for ULIB • Automatic Identification of Keyphrases from chapters of books • Language independent • Easily adaptable for different domains • No training data to learn from • Most books in ULIB do not have keywords as part of the metadata

Solution Outline • Language Modeling based • Given n-grams • Measure Informativeness, Phraseness • Score n-grams based on the above measures • Pick top K phrases as Keyphrases

Extracting Keyphrases from Books Text Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracted Keyphrases

Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user profiles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited Extracted Keyphrases

Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited {topics construct user, construct user profiles, user profiles explicit, profiles explicit specification, explicit specification interests, specification interests automatic, automatic analysis web, analysis web pages, web pages visited } Extracted Keyphrases

Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : 0.0281 explicit specication interests : 0.0281 specication interests automatic : 0.0272 user proles explicit : 0.0260 construct user proles : 0.0260 interests automatic analysis : 0.0255 topics construct user : 0.0243 automatic analysis web : 0.0227 web pages visited : 0.0226 analysis web pages : 0.0217 Extracted Keyphrases

Scoring • Phraseness • Measures degree to which a given n-gram can be considered a phrase • Based on Co-occurrence of words • Example.. • Informativeness • Measures how informative a given n-gram is • There is a, a lot of etc • Comparing co occurrence on a general corpus Vs given text(book) • Total Score • Phraseness-Score + Informativeness-Score

Scoring - Phraseness • Computed by measuring distance between unigram model and N-gram model • Point wise KL-divergence (Takashi et. al 2004) δw (p||q) = p(w)log(p(w)/q(w)) • Phraseness measure δw (LMfgN|| LMfg1)

Scoring - Informativeness • Computed by measuring distance between n-gram model from given data and n-gram model from general data • Point wise KL-divergence (Takashi et. al 2004) δw (p||q) = p(w)log(p(w)/q(w)) • Informativeness measure δw (LMfg1|| LMbg1)

Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : 0.0281 explicit specication interests : 0.0281 specication interests automatic : 0.0272 user proles explicit : 0.0260 construct user proles : 0.0260 interests automatic analysis : 0.0255 topics construct user : 0.0243 automatic analysis web : 0.0227 web pages visited : 0.0226 analysis web pages : 0.0217 Extracted Keyphrases

Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited proles explicit specication explicit specication interests specication interests automatic user proles explicit construct user proles interests automatic analysis topics construct user automatic analysis web web pages visited analysis web pages Extracted Keyphrases

Conclusions and Future Work • Discussed benefits of Keyphrases in ULIB context • Demonstrated the building of a KPE that works for books • Robust evaluation • Building a test set from books in ULIB for generic robust evaluation of KPE tools • Are chapters really independent in a book • Revisit the assumption

Thank you

References • Fred J. Damerau. Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4):433-447, 1993. • S.T Dumais, J Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management, page 148-155. ACM Press, 1998. • Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a exible information gain-based keyphrase extraction system. In WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 50-53, New York, NY, USA, 2003. ACM Press. • Takashi Tomokiyo and Mathew Hurst. A language modeling approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages 33{40, Morristown, NJ, USA, 2003. Association for Computational Linguistics. • P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336, 2006. • I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G Nevill-Manning. Kea: Practical automatic keyphrase extraction. In E. A. Fox and N. Rowe, editors, Proceedings of digital libraries 99: The fourth ACM conference on digital libraries, pages 254-255. ACM Press, 1999. • Mikio Yamamoto and Kenneth W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1-30, 2001

Extracting Keyphrases from Books using Language Modeling Approaches

Extracting Keyphrases from Books using Language Modeling Approaches

Presentation Transcript

Logic Modeling Approaches

Language Modeling

Language Modeling

Object-Oriented Modeling Using Modified Modeling Language (UML)

Extracting Videos from YouTube

Language Modeling

Language Modeling

Extracting Tables from ERD

Language Modeling

Language Modeling

Language modeling

Business Requirements Using Unified Modeling Language

Extracting Value from SOA

Hancock: A Language for Extracting Signatures from Data Streams

Language Modeling

Statistical Language Modeling using SRILM Toolkit

Extracting Names Using Layout Clues in Genealogical Books

Watershed Modeling Approaches

Using Course books for Language Teaching

Language modeling