Understanding POS Tagging and Machine Learning Classifiers in Natural Language Processing

Ling 570: Day 8Classification, Mallet

Roadmap • Open questions? • Quick review of classification • Feature templates

Classification Problem Steps • Input processing: • Split data into training/dev/test • Convert data into a feature representation (aka Attribute Value Matrix) • Training • Testing • Evaluation

Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly”

Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include:

Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly”

Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX.

Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX. • Plug in XXX to get a binary valued feature. • Templates generate many features

Machine learning

Classifiers • Wide variety • Differ on several dimensions • Supervision • Learning Function • Input Features

Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc

Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms

Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms • Semi-supervised: (bootstrapping) • True labels are provided for only a subset of examples • Co-training, semi-supervised SVM/CRF, etc

Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc

Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc • Graphically, decision boundary + + + - - -

Machine Learning Functions • Problem: Can the representation effectively model the class to be learned?

Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm - - - - - - - - - ++ + + + +

Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! - - - - - - - - - ++ + + + +

Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! - - - - - - - - - ++ + + + +

Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! Pick the right representation! - - - - - - - - - ++ + + + +

Machine Learning Features • Inputs: • E.g.words, acoustic measurements, parts-of-speech, syntactic structures, semantic classes, .. • Vectors of features: • E.g. word: letters • ‘cat’: L1=c; L2 = a; L3 = t • Parts of syntax trees?

Machine Learning Features • Questions: • Which features and values should be used? • How should they relate to each other? • Issue 1: What values should they take? • Binary features – don’t do anything! • Real valued features *may* need to be normalized • Can force the values to have 0 mean and unit variance • Compute the mean and variance on the training set for real valued feature • Replace original value with • Can also bin them or binarize them – often this works better • Issue 2: Which ones are important? • Feature selection is sometimes important • Current approach

Machine Learning Toolkits • Many learners, many tools/implementations

Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented

Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented • mallet • Java, classifiers, sequence learners • More heavy duty

Mallet: intro and data prep

Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum

Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source

Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners

Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners • Widely used, but • Research software: some bugs/gaps; odd documentation

Installation • Installed on patas • /NLP_TOOLS/tool_sets/mallet/latest/ • Directories: • bin/: script files • src/: java source code • class/: java classes • lib/: jar files • sample-data/: wikipedia docs for languages id, etc

Environment • Should be set up on patas • $PATH should include • /NLP_TOOLS/tool_sets/mallet/latest/bin • $CLASSPATH should include • /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar; /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar • Check: • which text2vectors • /NLP_TOOLS/tool_sets/mallet/latest/bin

Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification

Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification • Command line scripts • Shell scripts • Set up java environment • Invoke java programs • --help lists command line parameters for scripts

Mallet Data • Mallet data instances: • Instance_id label f1 v1 f2 v2 ….. • Stored in internal binary format: “vectors” • Binary format used by learners, decoders • Need to convert text files to binary format

Data Preparation • Built-in data importers • One class per directory, one instance per file • bin/mallet import-dir --input IF --output OF • Label is directory name • (Also text2vectors) • One instance per line • bin/mallet import-file --input IF --output OF • Line: instance label text ….. • (Also csv2vectors) • Create binary representation of text feature counts

Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs

Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors)

Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors

Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors • Ensures consistent feature representation • Note: can’t mix svmlight models with others

Accessing Binary Formats • vectors2info --input IF

Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set

Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order

Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct

Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct • Creates random training/test splits in some ratio

Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc

Building & Accessing Models • bin/mallet train-classifier --input vector_data_file --trainer classifiertype --training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc

Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en

Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Can also use pre-split training & testing files • e.g. output of vectors2vectors • --training-file, --testing-file

Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Confusion Matrix, row=true, column=predicted accuracy=1.0 • label 0 1 |total • 0 de 1 . |1 • 1 en . 1 |1 • Summary. train accuracy mean = 1.0 stddev = 0 stderr = 0 • Summary. test accuracy mean = 1.0 stddev = 0 stderr = 0

Understanding POS Tagging and Machine Learning Classifiers in Natural Language Processing

Understanding POS Tagging and Machine Learning Classifiers in Natural Language Processing

Presentation Transcript

Practice Questions Day 8: Classification, Viruses, and Bacteria

Mallet

Introduction to Mallet

Taxonomic Classification Day 1

Day 8

Ling 570 Day 9: Text Classification and Sentiment Analysis

Ling 570

Mallet Finger

Ling 570

Ling 570 Day 6: HMM POS Taggers

Ling 570 Day 17: Named Entity Recognition Chunking

COMPUTERIZED MALLET CLASSIFICATION FOR BRACHIAL PLEXUS PALSY

Ling 570 Day 16 : Sequence modeling Named Entity Recognition

Major Event Day Classification

Ling 570

Starter: Mallet’s Mallet

Mallet Masterclass

570

Text classification Day 35

Ling 570 Day 7: Classifiers

Mallet Percussion Instruments

ACC 570 WEEK 8 ASSIGNMENT 4