360 likes | 526 Vues
Text Mining the technology to convert text into knowledge. Stan Matwin School of Information Technology and Engineering University of Ottawa Canada stan@site.uottawa.ca. Plan. What? Why? How? Who?. What?. Text Mining (TM) = Data Mining from textual data
E N D
Text Miningthe technology to convert textinto knowledge Stan Matwin School of Information Technology and Engineering University of Ottawa Canada stan@site.uottawa.ca
Plan • What? • Why? • How? • Who? codata 2002
What? • Text Mining (TM) = Data Mining from textual data • Finding nuggets in otherwise uninteresting mountains of ore • DM = finding interesting knowledge (relationships, facts) in large amounts of data codata 2002
What cnt’d • Working with large corpora • …and little knowledge • Discovering new knowledge • … e.g. in Grimm’s fairy tales • vs uncovering of existing knowledge • …e.g. find mySQL developers with 1yr experience in a file of 5000 CVs • Has to treat data as NL codata 2002
What? Cnt’d • Uncovering aspect of TM • TM = Information Extraction from Text • Text -> Data Base mapping • TM and XML codata 2002
Examples • Extracting information from CVs: skills, systems, technologies etc • Personal news filtering agent • Research in functional genomics about protein interaction codata 2002
Why? • Moore’s law, and… • Storage law codata 2002
How? A combination of • Machine learning • Linguistic analysis • Stemming • Tagging • Parsing • Semantic analysis codata 2002
Some TM-related tasks • Text segmentation • Topic identification and tracking • Text summarization • Language identification • Author identification codata 2002
Two case studies • CADERIGE • Spam detection (with AmikaNow) codata 2002
Caderige « Catégorisation Automatique de Documents pour l'Extraction de Réseaux d'Interactions Géniques » Knowledge extraction from Natural Language texts codata 2002
Caderige • Objective: to extract information of interest to geneticists from on-line bastract and/or paper databases (e.g. Medline) • Ensure acceptable recall and precision codata 2002
The araR gene is monocistronic, and the promoter region contains -10 and -35 regions (as determind by primer extension analysis) similar to those recognized by RNA polymerase containing the major vegetative cell sigma factor sigmaA. An insertion-deletion mutation in the araR gene leads to constitutive expression of the L-arabinose metabolic operaon. We demonstrate that the araR gene codes for a negative regulator of the ara operon and that the expression of araR is repressed by its own product. The fragment (it.) can be selected by means of keywords codata 2002
This question cannot be answered with keywords alone; semantic knowledge that repression is a type of regulation is req’d It has been proposed that Pho-P plays a key role in the activation of tuA and in the repression of tagA and tagD. "What are the proteins involved in the regulation of tagA?” codata 2002
does not answer After determination of the nucleotide sequence and deduction of the purR reading frame, the PurR product was found to be highly similar to the purR-encoded repressor from Bacillus subtilis. "What are the proteins involved in the regulation of purR?", In fact, parsing is needed to see that PurR and purR-encoded Repressor are objects of the verb to be similar codata 2002
RNA isolated from a sigma B deletion mutant revealed that the transcription of gspA is sigmaB dependent. Conceptual interpretation is needed to see that is an answer to "What are the proteins involved in the regulation of gspA gspA is sigmaB dependent is interpreted as protein sigmaB regulates gspA codata 2002
CADERIGE Architecture codata 2002 Forms matching • • • - fragment selectors - text - Query extraction grammars - Thesaurus - Linguistic resources normalization normalization s conceptual gragrammar text mining extr. extraction using by index resources selection MedLine abstracts of linguistic fragment acquisition labeling query Extraction
3 steps • Focusing: learned filters • Linguistic Analysis: lexicalsyntactic/semantic • Syntax-semantics mapping 3. Extraction codata 2002
Caderige: example codata 2002
Current stage • 1 done • XML for 3 designed • Tools for 2 chosen codata 2002
Email filters • Spam elimination • Automatic filing • Compliance enforcement • …. codata 2002
Email… • The trick: cast it as a text classification problem • Build a training set • train your favouritre classifier • Deploy it codata 2002
State of the art • Current accuracy 80% codata 2002
Difficulties • multi-class problem where • classes overlap • and are hierarchical • recall vs precision codata 2002
TM: who – academically? • David Lewis • Yimin Yang – CMU • Ray Mooney - UT Austin • Nick Cercone - Waterloo • Guy Lapalme – U. de Montréal • TAMALE - University of Ottawa codata 2002
Who – industrially? • Google • Clearforest • AmikaNow codata 2002
Conclusion • Text mining – a necessity (so “!” instead of “?”) • Still in its infancy • Methods must exploit linguistic knowledge codata 2002
Classification • Prevalent practice: examples are represented as vectors of values of attributes • Theoretical wisdom, confirmed empirically: the more examples, the better predictive accuracy codata 2002
ML/DM at U of O • Learning from imbalanced classes: applications in remote sensing • a relational, rather than propositional representation: learning the maintainability concept • Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB codata 2002
Why text classification? • Automatic file saving • Internet filters • Recommenders • Information extraction • … codata 2002
Text classification: standard approach • Remove stop words and markings • remaining words are all attributes • A document becomes a vector <word, frequency> • Train a boolean classifier for each class • Evaluate the results on an unseen sample Bag of words codata 2002
Text classification: tools • RIPPER A rule-based learner Works well with large sets of binary features • Naïve Bayes Efficient (no search) Simple to program Gives “degree of belief” codata 2002
“Prior art” • Yang: best results using k-NN: 82.3% microaveraged accuracy • Joachim’s results using Support Vector Machine + unlabelled data • SVM insensitive to high dimensionality, sparseness of examples codata 2002
SVM in Text classification SVM Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training Transductive SVM Maximum separation Margin for test set codata 2002
Combining classifiers Comparable to best known results (Yang) codata 2002