1 / 34

Text Mining with Machine Learning Techniques

Text Mining with Machine Learning Techniques. Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University. Text Analysis. Summerization. Classification. Feature Selection. Language Identification. Clustering. Text Mining.

elewa
Télécharger la présentation

Text Mining with Machine Learning Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Miningwith Machine Learning Techniques Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Ping-Tsun Chang

  2. Text Analysis Summerization Classification Feature Selection Language Identification Clustering Ping-Tsun Chang

  3. Text Mining • Text mining is about looking for patterns in natural language text • Natural Language Processing • May be defined as the process of analyzing text to extract information from it for particular purposes. • Information Extraction • Information Retrieval Ping-Tsun Chang

  4. Text Miningand Knowledge Management • a recent study indicated that 80% of a company's information is contained in text documents • emails, memos, customer correspondence, and reports • The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy. Ping-Tsun Chang

  5. Text MiningApplications • Customer profile analysis • mining incoming emails for customers' complaint and feedback. • Patent analysis • analyzing patent databases for major technology players, trends, and opportunities. • Information dissemination • organizing and summarizing trade news and reports for personalized information services. • Company resource planning • mining a company's reports and correspondences for activities, status, and problems reported. Ping-Tsun Chang

  6. Text CategorizationProblem Definition • Text categorization is the problem of automatically assigned predefined categories to free text documents • Document classification • Web page classification • News classification Ping-Tsun Chang

  7. Information Retrieval • Full text is hard to process, but is a complete representation to document • Logical view of documents • Models • Boolean Model • Vector Model • Probabilistic Model • Think text as patterns? Ping-Tsun Chang

  8. Evaluation Retrieved b a d c Relevant Ping-Tsun Chang

  9. Pattern Recognization Sensing Classification Segmentation Post-Processing Feature Extraction Decision Ping-Tsun Chang

  10. Pattern Classification f2 C1 C2 f1 Ping-Tsun Chang

  11. Machine Learning • Using Computer help us to induction from complex and large amount of pattern data • Bayesian Learning • Instance-Based Learning • K-Nearest Neighbors • Neural Networks • Support Vector Machine Ping-Tsun Chang

  12. Feature Selection (I) • Information Gain Ping-Tsun Chang

  13. Feature Selection (II) • Mutual Information • CHI-Square Ping-Tsun Chang

  14. Weighting SchemeTF‧IDF Ping-Tsun Chang

  15. Simility Evaluation • Cosine-Like schema dj di Ping-Tsun Chang

  16. Machine LearningApproaches: Baysian Classifier Ping-Tsun Chang

  17. Machine LearningApproaches: kNN Classifier d ? Ping-Tsun Chang

  18. Machine LearningApproaches: Support Vector Machine • Basic hypotheses : Consistent hypotheses of the Version Space • Project the original training data in space X to a higher dimension feature space F via a Mercel operator K Ping-Tsun Chang

  19. Compare: SVM and traditional Leaners • Traditional Leaner • SVM access the hypothesis space! P(h) P(h|D1) P(h|D1^D2) hypothesis hypothesis hypothesis Ping-Tsun Chang

  20. SVM Learning in Feature Spaces X F Example: Ping-Tsun Chang

  21. f2 f1 Support Vector Machine (cont’d) • Nonlinear • Example: XOR Problem • Natural Language is Nonlinear! f1 f2 f1 Ping-Tsun Chang

  22. Support Vector Machine (cont’d) • Consistent hypothses • Maximum margin • Support Vector Ping-Tsun Chang

  23. Statistical Learning Theory Generator Supervisor x P(X) P(y|x) y F(x) y* x Leaner Ping-Tsun Chang

  24. Support Vector MachineLinear Discriminant Functions • Linear discriminant space • Hyperplane y2 g(y)>1 g(y)<1 y1 Ping-Tsun Chang

  25. Optimal hyperplane Learning of Support Vector Machine • Maxmize Margin • Minimize ||a|| Ping-Tsun Chang

  26. Version Space • Hypothesis Space H • Version Space V H V Ping-Tsun Chang

  27. Support Vector Machine Active Learning • Why Support Vector Machine? • Text Categorization have large amount of data • Traditional Learning cause Over-Fitting • Language is complex and nonlinear • Why Active Learning? • Labeling instance is time-consuming and costly • Reduce the need for labeled training instances Ping-Tsun Chang

  28. Active Learning: History Support Vector Machine [Vapnik,82] Text Classification [Rochio, 71] [Dumais, 98] The Nature of Statistical Learning Theory [Vapnik, 95] Text Classification Support Vector Machine [Joachims,98] [Dumais,98] Pool-Based Active Learning [Lewis, Gale ‘94][McCallum, Nigrm ‘98] Automated Text Categorization Using Support Vector Machine [Kwok, 98] Ping-Tsun Chang

  29. Active Learning • Pool-Based active learning have a pool U of unlabeled instances • Active Lerner l have three components (f,q,X) • f: classifier x->{-1, 1} • q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next. • X: training data, labeled. Ping-Tsun Chang

  30. Active Learning (cont’d) • Main difference: querying component q. • How to choose the next unlabeled instance to query? • Resulting Version Space Ping-Tsun Chang

  31. Active Learner • Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space Ping-Tsun Chang

  32. ExperienmentsBayesian Classifier Ping-Tsun Chang

  33. Comparsion of Learning Methods 1 SVM 0.8 kNN NB 0.6 Precision NNet 0.4 0.2 0 10 20 30 40 50 60 Training Data Size Ping-Tsun Chang

  34. Knowledge Conclusions • Text-Mining extraction knowledge from text. • Support Vector Machine is almost the best statistic-based machine learning method • Natural Language Understanding is still a open problem Ping-Tsun Chang

More Related