1 / 33

Mallet

Mallet. MA chine L earning for L anguag E T oolkit. Outline. About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion. Outline. About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion. About MALLET.

satchel
Télécharger la présentation

Mallet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mallet MAchineLearning for LanguagEToolkit

  2. Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion

  3. Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion

  4. About MALLET • "MALLET: A Machine Learning for Language Toolkit.“ • written by Andrew McCallum • http://mallet.cs.umass.edu. 2002. • Implemented in Java, currently version 2.0.6 • Motivation: • Text classification and information extraction • Commercial machine learning • Analysis and indexing of academic publications

  5. About MALLET • Main idea • Text focus: data is discrete rather than continuous, even when values could be continuous • How to • Command line scripts: • bin/mallet [command] --[option] [value] … • Text User Interface (“tui”) classes • Direct Java API • http://mallet.cs.umass.edu/api

  6. Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion

  7. Representations • Transform text documents to vectors x1 , x2 … • Elements of vector are called feature values • Example: “Feature at row 345 is number of times “dog” appears in document” • Retain meaning of vector indices

  8. Documents to Vectors

  9. Documents to Vectors

  10. Documents to Vectors

  11. Documents to Vectors

  12. Documents to Vectors

  13. Instances

  14. Instances

  15. Instances

  16. Outline • About MALLET • Representing Data • Command Line Processing • Developing with MALLET • Conclusion

  17. Command Line • Importing Data • Classification • Sequence Tagging • Topic Modeling

  18. Importing Data • One Instance per file • files in the folder: sample-data/web/enor sample-data/web/de • command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet • One file, one instance per line • file format: [URL] [language] [text of the page...] • command line: bin/mallet import-file --input /data/web/data.txt --output web.mallet

  19. Classification • Training a classifier bin/mallet train-classifier --input training.mallet --output-classifier my.classifier • Choosing an algorithm • MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier--trainer MaxEnt • Evaluation • Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances.  bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

  20. Sequence Tagging • Sequence algorithms • hidden Markov models (HMMs) • linear chain conditional random fields (CRFs). • SimpleTagger • a command line interface to the MALLET Conditional Random Field (CRF) class

  21. SimpleTagger • Input file: [feature1 feature2 ... featurenlabel] Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun • Train a CRF • An input file “sample” • A trained CRF in the file "nouncrf" java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

  22. SimpleTagger • A file “stest” needed to be labeled CAPITAL Al slept here • Label the input java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrfstest • Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

  23. Topic Modeling • Building Topic Models bin/mallet train-topics --input topic-input.mallet--num-topics 100 --output-state topic-state.gz --input [FILE]  --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments. 

  24. Demo

  25. Outline • About MALLET • Representing Data • Command Line Processing • Simple Evaluation • Conclusion

  26. Methodology • Focus on sequence tagging module in MALLET • CRF-based implementation • Some scripts written for importing data and evaluating results • Small corpora collected from web • Divided into two parts, 80% for training, 20% for test • Evaluate both POS Tagging and Named Entity Recognition • The performance of training • Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) • All scripts, corpora and results can be found here • http://mallet-eval.googlecode.com

  27. A Survey of Named Entity Corpora • Well known named entity corpora • Language-Independent Named Entity Recognition at CoNLL-2003 • A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) • free and public, but need RCV1 raw texts as the input • Message Understanding Conference (MUC) 6/ 7 • not for free • Affective Computational Entities (ACE) Training Corpus • not for free • Other special purpose corpora • Enron Email Dataset • email messages in this corpus are tagged with person names, dates and times. • A variety of biomedical corpora • some corpora in this collection are tagged with entities in the biomedical domain, such as gene name

  28. Small Corpora • Two small corpora collected from web • Penn Treebank Sample • English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. • raw, tagged, parsed and combined data from Wall Street Journal • 148120 tokens, 36 Standard treebank POS tagger • http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ • HIT CIR LTP Corpora Sample • Chinese NER corpora integrated • 10% of the whole corpora (open to public) • 23751 tokens, 7 kinds of named entities • http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

  29. Environment • Hardware • CPU: Q8300 Quad Core 2.50 GHz • Memory: 3GB • Software • Fedora 13 x86_64 • Java 1.6.0_18 • MALLET 2.0.6

  30. Data Format and Labels • Data Format • Each token one row, each feature one column Bill noun slept non-noun Here non-noun • Labels • Standard treebank POS Tagger • CCCoordinating conjunction | CD Cardinal number | DT Determiner | EXExistential there | FW Foreign word | INPreposition or subordinating conjunction | JJ Adjective | JJRAdjective, comparative | JJSAdjective, superlative | LS List item marker | MD Modal | NNNoun, singular or mass | NNSNoun, plural …… (36 taggers in all) • HIT Named Entity • O 不是NE | S- 单独构成 NE | B- 一个NE 的开始 | I- 一个NE 的中间 | E- 一个 NE 的结尾 • Nm 数词| Ni 机构名 | Ns 地名 | Nh人名 | Nt时间 | Nr 日期 | Nz专有名词 • Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni

  31. Evaluation Tasks Stages

  32. DEMO

  33. Q&A

More Related