350 likes | 664 Vues
Text Mining with Machine Learning Techniques. Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University. Text Analysis. Summerization. Classification. Feature Selection. Language Identification. Clustering. Text Mining.
E N D
Text Miningwith Machine Learning Techniques Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Ping-Tsun Chang
Text Analysis Summerization Classification Feature Selection Language Identification Clustering Ping-Tsun Chang
Text Mining • Text mining is about looking for patterns in natural language text • Natural Language Processing • May be defined as the process of analyzing text to extract information from it for particular purposes. • Information Extraction • Information Retrieval Ping-Tsun Chang
Text Miningand Knowledge Management • a recent study indicated that 80% of a company's information is contained in text documents • emails, memos, customer correspondence, and reports • The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy. Ping-Tsun Chang
Text MiningApplications • Customer profile analysis • mining incoming emails for customers' complaint and feedback. • Patent analysis • analyzing patent databases for major technology players, trends, and opportunities. • Information dissemination • organizing and summarizing trade news and reports for personalized information services. • Company resource planning • mining a company's reports and correspondences for activities, status, and problems reported. Ping-Tsun Chang
Text CategorizationProblem Definition • Text categorization is the problem of automatically assigned predefined categories to free text documents • Document classification • Web page classification • News classification Ping-Tsun Chang
Information Retrieval • Full text is hard to process, but is a complete representation to document • Logical view of documents • Models • Boolean Model • Vector Model • Probabilistic Model • Think text as patterns? Ping-Tsun Chang
Evaluation Retrieved b a d c Relevant Ping-Tsun Chang
Pattern Recognization Sensing Classification Segmentation Post-Processing Feature Extraction Decision Ping-Tsun Chang
Pattern Classification f2 C1 C2 f1 Ping-Tsun Chang
Machine Learning • Using Computer help us to induction from complex and large amount of pattern data • Bayesian Learning • Instance-Based Learning • K-Nearest Neighbors • Neural Networks • Support Vector Machine Ping-Tsun Chang
Feature Selection (I) • Information Gain Ping-Tsun Chang
Feature Selection (II) • Mutual Information • CHI-Square Ping-Tsun Chang
Weighting SchemeTF‧IDF Ping-Tsun Chang
Simility Evaluation • Cosine-Like schema dj di Ping-Tsun Chang
Machine LearningApproaches: Baysian Classifier Ping-Tsun Chang
Machine LearningApproaches: kNN Classifier d ? Ping-Tsun Chang
Machine LearningApproaches: Support Vector Machine • Basic hypotheses : Consistent hypotheses of the Version Space • Project the original training data in space X to a higher dimension feature space F via a Mercel operator K Ping-Tsun Chang
Compare: SVM and traditional Leaners • Traditional Leaner • SVM access the hypothesis space! P(h) P(h|D1) P(h|D1^D2) hypothesis hypothesis hypothesis Ping-Tsun Chang
SVM Learning in Feature Spaces X F Example: Ping-Tsun Chang
f2 f1 Support Vector Machine (cont’d) • Nonlinear • Example: XOR Problem • Natural Language is Nonlinear! f1 f2 f1 Ping-Tsun Chang
Support Vector Machine (cont’d) • Consistent hypothses • Maximum margin • Support Vector Ping-Tsun Chang
Statistical Learning Theory Generator Supervisor x P(X) P(y|x) y F(x) y* x Leaner Ping-Tsun Chang
Support Vector MachineLinear Discriminant Functions • Linear discriminant space • Hyperplane y2 g(y)>1 g(y)<1 y1 Ping-Tsun Chang
Optimal hyperplane Learning of Support Vector Machine • Maxmize Margin • Minimize ||a|| Ping-Tsun Chang
Version Space • Hypothesis Space H • Version Space V H V Ping-Tsun Chang
Support Vector Machine Active Learning • Why Support Vector Machine? • Text Categorization have large amount of data • Traditional Learning cause Over-Fitting • Language is complex and nonlinear • Why Active Learning? • Labeling instance is time-consuming and costly • Reduce the need for labeled training instances Ping-Tsun Chang
Active Learning: History Support Vector Machine [Vapnik,82] Text Classification [Rochio, 71] [Dumais, 98] The Nature of Statistical Learning Theory [Vapnik, 95] Text Classification Support Vector Machine [Joachims,98] [Dumais,98] Pool-Based Active Learning [Lewis, Gale ‘94][McCallum, Nigrm ‘98] Automated Text Categorization Using Support Vector Machine [Kwok, 98] Ping-Tsun Chang
Active Learning • Pool-Based active learning have a pool U of unlabeled instances • Active Lerner l have three components (f,q,X) • f: classifier x->{-1, 1} • q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next. • X: training data, labeled. Ping-Tsun Chang
Active Learning (cont’d) • Main difference: querying component q. • How to choose the next unlabeled instance to query? • Resulting Version Space Ping-Tsun Chang
Active Learner • Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space Ping-Tsun Chang
ExperienmentsBayesian Classifier Ping-Tsun Chang
Comparsion of Learning Methods 1 SVM 0.8 kNN NB 0.6 Precision NNet 0.4 0.2 0 10 20 30 40 50 60 Training Data Size Ping-Tsun Chang
Knowledge Conclusions • Text-Mining extraction knowledge from text. • Support Vector Machine is almost the best statistic-based machine learning method • Natural Language Understanding is still a open problem Ping-Tsun Chang