1 / 29

Introduction to Text Mining

Introduction to Text Mining. By Soumyajit Manna 11/10/08. Outline. Text Mining Definition Text Mining Application Text Characteristics Text Mining Process Future of text mining. Text Mining Definition.

thorntonj
Télécharger la présentation

Introduction to Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Text Mining By Soumyajit Manna 11/10/08

  2. Outline • Text Mining Definition • Text Mining Application • Text Characteristics • Text Mining Process • Future of text mining

  3. Text Mining Definition • “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”. • An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge. • What is “previously unknown”information ? • Strict definition • Information that not even the writer knows. • e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure • Lenient definition • Rediscover the information that the author encoded in the text • e.g., Automatically extracting a product’s name from a web-page.

  4. Definition Cont… • Then the question arises Is Text mining is similar to that of Data mining ? or Can we implement the Data Mining technique for Text Mining?

  5. Answer • Structured Data : The data that will be used are clearly described over a range of all possibilities or can be described by a spreadsheet. Types: 1. Order Numerical: Values where greater than and less than comparisons have meaning. 2. Categorical : The values that can be measured as true or false. Typical data mining application uses structured data. • Unstructured Data: The above criteria does not fulfill (Text Mining).

  6. Answer Contd... • The classical data mining technique is implemented by transforming text into numerical data and then putting it into the spreadsheet.

  7. Text Mining Applications • Marketing: Discover distinct groups of potential buyers according to a user text based profile • e.g. Amazon • Industry: Identifying groups of competitors web pages • e.g., competing products and their prices • Job seeking: Identify parameters in searching for jobs • e.g., www.flipdog.com

  8. Text Mining Methods • Document Classification (Web Mining) • Indexing and retrieval of textual documents and extraction of partial knowledge using the web • Information Extraction • Extraction of partial knowledge in the text • Information Retrieval • Indexing and retrieval of textual documents • Clustering • Generating collections of similar text documents

  9. Document Classification • Purest embodiment of spreadsheet model with labeled answers • Documents organized into folders, one folder for each topic. • The application is almost always binary classification because a document can appear in multiple folder. • The problem is considered by the form of indexing like the index of book. Household Household vs. ~Household New Document Finance Finance vs. ~Finance School vs. ~School School

  10. Information Retrieval • Given: • A source of textual documents • A user query (text based) • Find: • A set (ranked) of documents that are relevant to the query Document Collection Document Collection Document Collection Document Collection Document Collection Test Document IR System Match Documents Query E.g. Spam / Text

  11. Intelligent Information Retrieval • Meaning of words • Synonyms “buy” / “purchase” • Ambiguity “bat” (baseball vs. mammal) • Order of words in the query • hot dog stand in the amusement park • hot amusement stand in the dog park • User dependency for the data • direct feedback • indirect feedback • Authorityof the source • IBM is more likely to be an authorized source then my second far cousin

  12. Information Extraction • Given: • A source of textual documents • A well defined limited query (text based) • Find: • Sentences with relevant information • Extract the relevant information and ignore non-relevant information (important!) • Link related information and output in a predetermined format

  13. Information Extraction Model • Query 1 • (E.g. revenue) • Query 2 • (E.g. profit) Document Source Sorted Data Extraction System Combine Query Result

  14. Information Extraction Example. • ..on revenues of twenty five million dollars, the company reported a profited a profit of 4.5 million for the fiscal year Input Documents

  15. Clustering • Given: • A source of textual documents • Similarity measure • e.g., how many words are common in these documents • Find: • Several clusters of documents that are relevant to each other

  16. Clustering Model Group1 Group2 Group3 Group4 Group5 Document Document Document Document Organizer

  17. Text Characteristics • Large textual data base • High dimensionality • Several input modes • Dependency • Ambiguity • Noisy data • Not well structured text

  18. Text Characteristics Cont.. • Large textual data base • Efficiency consideration • over 2,000,000,000 web pages • almost all publications are also in electronic form • High dimensionality (Sparse input) • Consider each word/phrase as a dimension • Several input modes • e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.

  19. Text Characteristics Cont.. • Dependency • relevant information is a complex conjunction of words/phrases • e.g., Document categorization. Pronoun disambiguation. • Ambiguity • Word ambiguity • Pronouns (he, she …) • “buy”, “purchase” • Semantic ambiguity • The king saw the rabbit with his glasses. (8 meanings)

  20. Text Characteristics Cont.. • Noisy data • Example: Spelling mistakes • Not well structured text • Chat rooms • “r u available ?” • “Hey whazzzzzz up” • Speech

  21. Text Mining Process

  22. Text Mining Process Cont.. • Text preprocessing • Syntactic/Semantic text analysis • Features Generation • Bag of words • Features Selection • Simple counting • Statistics • Text/Data Mining • Classification- Supervised learning • Clustering- Unsupervised learning • Analyzing results

  23. Text preprocessing • Part Of Speech (pos) tagging • Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) • ~98% accurate. • Word sense disambiguation • Context based or proximity based • Very accurate • Parsing • Generates a parse tree (graph) for each sentence • Each sentence is a stand alone graph

  24. Features Generation • Text document is represented by the words it contains (and their occurrences) • e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”} • Highly efficient • Makes learning far simpler and easier • Order of words is not that important for certain applications • Stemming: identifies a word by its root • e.g., flying, flew  fly • Reduce dimensionality • Stop words: The most common words are unlikely to help text mining • e.g., “the”, “a”, “an”, “you” …

  25. Features Generation with XML • Current keyword-oriented search engines cannot handle rich queries like • Find all books authored by “Scooby-Doo”. • XML: Extensible Markup Language • XML documents have a nested structure in which each element is associated with a tag. • Tags describe the semantics of elements. <book><title> The making of a bad movie </title> <author><name> Scooby-Doo </name> <affiliation> Cartoons </affiliation></author> </book>

  26. Feature Selection • Reduce dimensionality • Learners have difficulty addressing tasks with high dimensionality • Irrelevant features • Not all features help! • e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport”

  27. Challenges of Text Mining • Access to raw text in gated collections (ie, collections which require payment to permit access to resources) . • Tools that are too difficult for non-programmers to use. • Questions relating to the validity of text mining as a technique for drawing legitimate conclusions.

  28. Future Of Text Mining • Develop focused, easy-to-use tools that bridge the gap between computer programmers and humanities researchers • Different tools and data, but common dimensions • Example: • “Find sales trends by product and correlate with occurrences of company name in business news articles” • Dimensions: Time, Company names (or stock symbols), Product names, Regions

  29. Thanks Questions ??

More Related