740 likes | 884 Vues
This guide explores the process of classifying parts of speech (POS) for unknown words using machine learning classifiers. It provides a step-by-step roadmap for feature extraction, including templates for generating features based on character patterns. The document covers supervised, unsupervised, and semi-supervised learning, discussing various classifiers such as Naïve Bayes, MaxEnt, and neural networks. Key concepts include model evaluation, the choice of representation, and inductive bias. By understanding these elements, one can effectively predict POS distributions for new inputs.
E N D
Roadmap • Open questions? • Quick review of classification • Feature templates
Classification Problem Steps • Input processing: • Split data into training/dev/test • Convert data into a feature representation (aka Attribute Value Matrix) • Training • Testing • Evaluation
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly”
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include:
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly”
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX.
Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX. • Plug in XXX to get a binary valued feature. • Templates generate many features
Classifiers • Wide variety • Differ on several dimensions • Supervision • Learning Function • Input Features
Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc
Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms
Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms • Semi-supervised: (bootstrapping) • True labels are provided for only a subset of examples • Co-training, semi-supervised SVM/CRF, etc
Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc
Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc • Graphically, decision boundary + + + - - -
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned?
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm - - - - - - - - - ++ + + + +
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! - - - - - - - - - ++ + + + +
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! - - - - - - - - - ++ + + + +
Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! Pick the right representation! - - - - - - - - - ++ + + + +
Machine Learning Features • Inputs: • E.g.words, acoustic measurements, parts-of-speech, syntactic structures, semantic classes, .. • Vectors of features: • E.g. word: letters • ‘cat’: L1=c; L2 = a; L3 = t • Parts of syntax trees?
Machine Learning Features • Questions: • Which features and values should be used? • How should they relate to each other? • Issue 1: What values should they take? • Binary features – don’t do anything! • Real valued features *may* need to be normalized • Can force the values to have 0 mean and unit variance • Compute the mean and variance on the training set for real valued feature • Replace original value with • Can also bin them or binarize them – often this works better • Issue 2: Which ones are important? • Feature selection is sometimes important • Current approach
Machine Learning Toolkits • Many learners, many tools/implementations
Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented
Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented • mallet • Java, classifiers, sequence learners • More heavy duty
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners
Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners • Widely used, but • Research software: some bugs/gaps; odd documentation
Installation • Installed on patas • /NLP_TOOLS/tool_sets/mallet/latest/ • Directories: • bin/: script files • src/: java source code • class/: java classes • lib/: jar files • sample-data/: wikipedia docs for languages id, etc
Environment • Should be set up on patas • $PATH should include • /NLP_TOOLS/tool_sets/mallet/latest/bin • $CLASSPATH should include • /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar; /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar • Check: • which text2vectors • /NLP_TOOLS/tool_sets/mallet/latest/bin
Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification
Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification • Command line scripts • Shell scripts • Set up java environment • Invoke java programs • --help lists command line parameters for scripts
Mallet Data • Mallet data instances: • Instance_id label f1 v1 f2 v2 ….. • Stored in internal binary format: “vectors” • Binary format used by learners, decoders • Need to convert text files to binary format
Data Preparation • Built-in data importers • One class per directory, one instance per file • bin/mallet import-dir --input IF --output OF • Label is directory name • (Also text2vectors) • One instance per line • bin/mallet import-file --input IF --output OF • Line: instance label text ….. • (Also csv2vectors) • Create binary representation of text feature counts
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors)
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors
Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors • Ensures consistent feature representation • Note: can’t mix svmlight models with others
Accessing Binary Formats • vectors2info --input IF
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct
Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct • Creates random training/test splits in some ratio
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc
Building & Accessing Models • bin/mallet train-classifier --input vector_data_file --trainer classifiertype --training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Can also use pre-split training & testing files • e.g. output of vectors2vectors • --training-file, --testing-file
Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Confusion Matrix, row=true, column=predicted accuracy=1.0 • label 0 1 |total • 0 de 1 . |1 • 1 en . 1 |1 • Summary. train accuracy mean = 1.0 stddev = 0 stderr = 0 • Summary. test accuracy mean = 1.0 stddev = 0 stderr = 0