1 / 22

Text Mining Tools

Text Mining Tools 22C:196 Text Retrieval & Text Mining Seminar Tools WordNet MxTerminator Lingpipe Stanford TP Tools Stanford-NER SVM Light Rainbow Toolkit Manjal WordNet http://wordnet.princeton.edu/ English lexical database Developed at Princeton Univ. by George A. Miller, etc.

benjamin
Télécharger la présentation

Text Mining Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining Tools 22C:196 Text Retrieval & Text Mining Seminar

  2. Tools • WordNet • MxTerminator • Lingpipe • Stanford TP Tools • Stanford-NER • SVM Light • Rainbow Toolkit • Manjal

  3. WordNet • http://wordnet.princeton.edu/ • English lexical database • Developed at Princeton Univ. by George A. Miller, etc. • Organized as Synsets • Cognitive synonym sets • Synsets for Nouns, Verbs, Adjectives and Adverbs

  4. WordNet • Synsets interlinked via lexical and conceptual-sematic relations • Network of meaningfully related concepts and words • Available online and can also be freely downloaded • Perl and Java packages available to interface with WordNet

  5. WordNet • WordNet 2.0 on sulu and geordi • Command line interface • Example • /usr/local/WordNet-2.0/bin/wn <w> -over • Provides overview of various senses • /usr/local/WordNet-2.0/bin/wn <w> -synsn • Provides list of synonyms

  6. MxTerminator • http://www.id.cbs.dk/~dh/corpus/tools/MXTERMINATOR.html • Java sentence boundary detection tool • Algorithm described in • J.C. Reynar and A. Ratnaparkhi. A Maximum Entropy Approach to Identifying Sentence Boundaries. 1997.

  7. MxTerminator • Installed on sulu and geordi • Command-line interface • Requires two parameters • Trained model directory • Text File to parse • Syntax • /usr/local/mxterminator/mxterminator ‘modeldir’ < ‘textfile’ • Comes with pre-trained model • /usr/local/mxterminator/eos.project

  8. MxTerminator • New models can be trained • trainmxterminator <projectdir> <traindata> • <projectdir> is newly created model directory • <traindata> is training data with one sentence per line • Package also includes mxpost • part-of-speech tagger • /usr/local/mxterminator/mxpost ‘modeldir < ‘wordfile’ • Pre-built model - /usr/local/mxterminator/tagger.project • wordfile - contains words; one sentence per line

  9. LingPipe • http://www.alias-i.com/lingpipe/ • Suite of Java libraries for different kinds of analyses • Sentence detection • Part-of-speech tagging • Named-entity extraction • Phrase extraction • Entity co-reference • Spell checker • Clustering • Chinese language support

  10. LingPipe • Also contains tools for database text mining • Directly work-off a database such as MySQL • Package contains demos, tutorials, pre-trained models and javadoc • Widely used in text mining community • Especially for general and biomedical named-entity recognition • Website has links to blogs and developer discussion forum

  11. Stanford TP Tools • http://nlp.stanford.edu/software/index.shtml • Variety of text processing tools • Made available by Stanford NLP group • All tools are implemented in Java • Freely downloadable

  12. Stanford TP Tools • Parser • POS Tagger • Named Entity Recognizer • Chinse word segmenter • Classifier • Tregex and Tsurgeon • Matching patterns in trees

  13. Stanford-NER • Based on CRFs • Contains demo programs • 4 pre-built models • 3 class basic model trained on US and UK Newswire data from CoNLL, MUC and ACE • Labels PERSON, ORGANIZATION and LOCATION • 4 class model trained on CoNLL training data • Additionally labels MISC • 2 more accurate distsim versions of above models

  14. Stanford-NER • Example • java -mx600m -cp ./stanford-ner.jar:. stanfordNER ner-eng-ie.crf-3-all2006-distsim.ser.gz “text” • Advanced distsim model • Example • java -mx300m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -textFile sample.txt • Default basic model

  15. SVMLight • http://svmlight.joachims.org/ • C support-vector-machine implementation by Thorsten Joachims • Does classification, regression and ranking • Many other functions • Estimate error-rate and precision and recall directly • Freely downloadable • Instructions on website

  16. SVMLight • Contains 2 main executable files • svm_learn (learn model from training set) • svm_classify (classify test set) • Input file contains weighted term vectors • Strategy: index doc files using Lucene or SMART and obtain term vectors • Example: -1 1:0.43 3:0.12 9284:0.2 +1 1:0.20 3:0.14 9284:0.97 • Use different kernel functions • Support for linear and non-linear kernels

  17. SVMLight • Syntax: • svm_learn [options] example_file model_file • svm_classify [options] example_file model_file output_file • Example data included in distribution

  18. Rainbow Toolkit • http://www.cs.cmu.edu/~mccallum/bow/rainbow/ • Part of the Bow toolkit • http://www.cs.cmu.edu/~mccallum/bow/ • Text Classification tool • Supports 4 classification methods • Naïve Bayes (default) • TFIDF/Rocchio • K-nearest neighbor • Probabilistic Indexing

  19. Rainbow Toolkit • Building a model • rainbow -d ./model --index <modeldir> --use-stemming --skip-html • <modeldir> contains individual folders (with text files) for each class • Model is stored in./model • Test model • rainbow -d ~/model --test-set=0.4 --test=3 • Train-test split is 0.6/0.4; 3 iterations

  20. Rainbow Toolkit • Test model • rainbow -d ~/model --test-set=0.5 --test=1 • Specify test set • Half chosen randomly • rainbow -d ~/model --test-files <testdir> • Classify previously unseen files in<testdir>

  21. Rainbow Toolkit • Formatted output • rainbow-stats • Example • rainbow -d ./model --test-set=0.4 --test=2 | rainbow-stats • Confusion matrix, Percent accuracy, Std. error,

  22. Manjal • Online demo

More Related