1 / 21

Extracting Keyphrases from Books using Language Modeling Approaches

Extracting Keyphrases from Books using Language Modeling Approaches. Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu. Agenda . Keyphrase Extraction

aconant
Télécharger la présentation

Extracting Keyphrases from Books using Language Modeling Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu

  2. Agenda • Keyphrase Extraction • Value addition to Digital Libraries • Methods of Keyphrase Extraction • Related Work • Our Solution

  3. What are Keyphrases? • Keyphrases • (Give example) • Where used? • Cataloguing in Libraries for IR purposes • Quick Summarization of documents

  4. Why important to ULIB? • Vast growth in digital content • More than a Million books! • Short Meta data description – useful to user while reading • For further processing of books like summarization, IR etc

  5. How do we extract KPs? • Manual entry • Reliable, high quality outcome • But, time-consuming, expensive • Automatic • Fast extraction but less reliable • No expense at all

  6. Automatic techniques for KPE • Rule based methods • Heuristics (paragraph beginning, headline etc) • Krulwich &Burkey etc • Using Linguistic tools • Statistical techniques • Term counts and weighting based Methods • Learn model from training data • Turney et. al[5], KEA[6] , KSpotter[3] etc

  7. Requirements for a KPE for ULIB • Automatic Identification of Keyphrases from chapters of books • Language independent • Easily adaptable for different domains • No training data to learn from • Most books in ULIB do not have keywords as part of the metadata

  8. Solution Outline • Language Modeling based • Given n-grams • Measure Informativeness, Phraseness • Score n-grams based on the above measures • Pick top K phrases as Keyphrases

  9. Extracting Keyphrases from Books Text Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracted Keyphrases

  10. Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user profiles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited Extracted Keyphrases

  11. Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited {topics construct user, construct user profiles, user profiles explicit, profiles explicit specification, explicit specification interests, specification interests automatic, automatic analysis web, analysis web pages, web pages visited } Extracted Keyphrases

  12. Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : 0.0281 explicit specication interests : 0.0281 specication interests automatic : 0.0272 user proles explicit : 0.0260 construct user proles : 0.0260 interests automatic analysis : 0.0255 topics construct user : 0.0243 automatic analysis web : 0.0227 web pages visited : 0.0226 analysis web pages : 0.0217 Extracted Keyphrases

  13. Scoring • Phraseness • Measures degree to which a given n-gram can be considered a phrase • Based on Co-occurrence of words • Example.. • Informativeness • Measures how informative a given n-gram is • There is a, a lot of etc • Comparing co occurrence on a general corpus Vs given text(book) • Total Score • Phraseness-Score + Informativeness-Score

  14. Scoring - Phraseness • Computed by measuring distance between unigram model and N-gram model • Point wise KL-divergence (Takashi et. al 2004) δw (p||q) = p(w)log(p(w)/q(w)) • Phraseness measure δw (LMfgN|| LMfg1)

  15. Scoring - Informativeness • Computed by measuring distance between n-gram model from given data and n-gram model from general data • Point wise KL-divergence (Takashi et. al 2004) δw (p||q) = p(w)log(p(w)/q(w)) • Informativeness measure δw (LMfg1|| LMbg1)

  16. Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : 0.0281 explicit specication interests : 0.0281 specication interests automatic : 0.0272 user proles explicit : 0.0260 construct user proles : 0.0260 interests automatic analysis : 0.0255 topics construct user : 0.0243 automatic analysis web : 0.0227 web pages visited : 0.0226 analysis web pages : 0.0217 Extracted Keyphrases

  17. Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited topics construct user profiles explicit specification interests automatic analysis web pages visited proles explicit specication explicit specication interests specication interests automatic user proles explicit construct user proles interests automatic analysis topics construct user automatic analysis web web pages visited analysis web pages Extracted Keyphrases

  18. Conclusions and Future Work • Discussed benefits of Keyphrases in ULIB context • Demonstrated the building of a KPE that works for books • Robust evaluation • Building a test set from books in ULIB for generic robust evaluation of KPE tools • Are chapters really independent in a book • Revisit the assumption

  19. Thank you

  20. References • Fred J. Damerau. Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4):433-447, 1993. • S.T Dumais, J Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management, page 148-155. ACM Press, 1998. • Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a exible information gain-based keyphrase extraction system. In WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 50-53, New York, NY, USA, 2003. ACM Press. • Takashi Tomokiyo and Mathew Hurst. A language modeling approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages 33{40, Morristown, NJ, USA, 2003. Association for Computational Linguistics. • P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336, 2006. • I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G Nevill-Manning. Kea: Practical automatic keyphrase extraction. In E. A. Fox and N. Rowe, editors, Proceedings of digital libraries 99: The fourth ACM conference on digital libraries, pages 254-255. ACM Press, 1999. • Mikio Yamamoto and Kenneth W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1-30, 2001

More Related