1 / 29

Introduction to Text Mining

Introduction to Text Mining. ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and Institute for Genomic Biology University of Illinois, Urbana-Champaign. Outline. Overview of Text Mining IR-Style Text Mining Techniques

belva
Télécharger la présentation

Introduction to Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Text Mining ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and Institute for Genomic Biology University of Illinois, Urbana-Champaign

  2. Outline • Overview of Text Mining • IR-Style Text Mining Techniques • NLP-Style Text Mining Techniques • ML-Style Text Mining Techniques

  3. Two Definitions of “Mining” • Goal-oriented (effectiveness driven, NLP, AI) • Any process that generates useful results that are non-obvious is called “mining”. • Keywords: “useful” + “non-obvious” • Data isn’t necessarily massive • Method-oriented (efficiency driven, DB, IR) • Any process that involves extracting information from massive data is called “mining” • Keywords: “massive” + “pattern” • Patterns aren’t necessarily useful

  4. What is Text Mining? • Data Mining View: Explore patterns in textual data • Find latent topics • Find topical trends • Find outliers and other hidden patterns • Natural Language Processing View: Make inferences based on partial understanding natural language text • Information extraction • Question answering

  5. Applications of Text Mining • Direct applications • Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? • Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it? • Indirect applications • Assist information access (e.g., discover latent topics to better summarize search results) • Assist information organization (e.g., discover hidden structures)

  6. Text Mining Methods • Data Mining Style: View text as high dimensional data • Frequent pattern finding • Association analysis • Outlier detection • Information Retrieval Style: Fine granularity topical analysis • Topic extraction • Exploit term weighting and text similarity measures • Question answering • Natural Language Processing Style: Information Extraction • Entity extraction • Relation extraction • Sentiment analysis • Machine Learning Style: Unsupervised or semi-supervised learning • Generative models • Dimension reduction • Classification & prediction

  7. IR-Style Techniques for Text Mining

  8. Some “Basic” IR Techniques • Stemming • Stop words • Weighting of terms (e.g., TF-IDF) • Vector/Unigram representation of text • Text similarity (e.g., cosine, KL-div) • Relevance/pseudo feedback (e.g., Rocchio)

  9. t1 t2 … tn t t t t Term similarity t t t t d1 d2 … dm w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn t t t t d Doc similarity d d d d d d d d d d d d d Term Weighting Vector centroid Sentence selection Tokenized text SUMMARIZATION d Stemming & Stop words META-DATA/ ANNOTATION CATEGORIZATION Generality of Basic Techniques CLUSTERING Raw text

  10. Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering • Text Summarization

  11. Information Filtering • Stable & long term interest, dynamic info source • System must make a delivery decision immediately as a document “arrives” • Two Methods: Content-based vs. Collaborative my interest: Filtering System …

  12. Examples of Information Filtering • News filtering • Email filtering • Recommending Systems • Literature alert • And many others

  13. Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering • Text Summarization

  14. Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

  15. Examples of Text Categorization • News article classification • Meta-data annotation • Automatic Email sorting • Web page classification

  16. Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering • Text Summarization

  17. The Clustering Problem • Discover “natural structure” • Group similar objects together • Object can be document, term, passages • Example

  18. Similarity-induced Structure

  19. Examples of Doc/Term Clustering • Clustering of retrieval results • Clustering of documents in the whole collection • Term clustering to define “concept” or “theme” • Automatic construction of hyperlinks • In general, very useful for text mining

  20. Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering • Text Summarization

  21. “Retrieval-based” Summarization • Observation: term vector  summary? • Basic approach • Rank “sentences”, and select top N as a summary • Methods for ranking sentences • Based on term weights • Based on position of sentences • Based on the similarity of sentence and document vector

  22. Examples of Summarization • News summary • Summarize retrieval results • Single doc summary • Multi-doc summary • Summarize a cluster of documents (automatic label creation for clusters)

  23. NLP-Style Text Mining Techniques Most of the following slides are from William Cohen’s IE tutorial

  24. NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Richard Stallman founder What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *

  25. Landscape of IE Tasks:Complexity E.g. word patterns: Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be reached at 412-268-1299 The big Wyoming sky… Complex pattern Ambiguous patterns,needing context andmany sources of evidence U.S. postal addresses University of Arkansas P.O. Box 140 Hope, AR 71802 Person names …was among the six houses sold by Hope Feldman that year. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

  26. Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Landscape of IE Techniques Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Any of these models can be used to capture words, formatting or both.

  27. Statistical Learning Style Techniques for Text Mining

  28. Many Techniques are Available • Supervised learning • Classification • Regression • Unsupervised learning • Topic models • Dimension reduction • Most relevant methods • Generative models • Matrix decomposition

  29. Topics for Discussion • Social Science research questions: • Mining bias: selection bias, framing bias • Text Mining techniques • Sentiment analysis • Topic discovery and evolution graph • Joint text-image analysis

More Related