240 likes | 378 Vues
This document explores empirical learning methods used in natural language processing (NLP), emphasizing the necessity of diverse knowledge acquisition for effective language modeling. It discusses the challenges presented by the high dimensionality and sparsity of data, as well as the complexities of semantic ambiguity. Key topics include supervised and unsupervised learning tasks, feature engineering, and the importance of both quantitative and qualitative approaches in NLP. The ongoing goal is to deepen the understanding of language behavior through improved empirical models.
E N D
Empirical Learning Methods in Natural Language Processing Ido Dagan Bar Ilan University, Israel
Introduction • Motivations for learning in NLP • NLP requires huge amounts of diverse types of knowledge – learning makes knowledge acquisition more feasible, automatically or semi-automatically • Much of language behavior is preferential in nature, so need to acquire both quantitative and qualitative knowledge
Introduction (cont.) • Apparently, empirical modeling obtains (so far) mainly “first-degree” approximation of linguistic behavior • Often, more complex models improve results only to a modest extent • Often, several simple models obtain comparable results • Ongoing goal – deeper modeling of language behavior within empirical models
Linguistic Background (?) • Morphology • Syntax – tagging, parsing • Semantics • Interpretation – usually out of scope • “Shallow” semantics: ambiguity, semantic classes and similarity, semantic variability
Information Units of Interest - Examples • Explicit units: • Documents • Lexical units: words, terms (surface/base form) • Implicit (hidden) units: • Word senses, name types • Document categories • Lexical syntactic units: part of speech tags • Syntactic relationships between words – parsing • Semantic relationships
Data and Representations • Frequencies of units • Co-occurrence frequencies • Between all relevant types of units (term-doc, term-term, term-category, sense-term, etc.) • Different representations and modeling • Sequences • Feature sets/vectors (sparse)
Tasks and Applications • Supervised/classification: identify hidden units (concepts) of explicit units • Syntactic analysis, word sense disambiguation, name classification, relations, categorization, … • Unsupervised: identify relationships and properties of explicit units (terms, docs) • Association, topicality, similarity, clustering • Combinations
Using Unsupervised Methods within Supervised Tasks • Extraction and scoring of features • Clustering explicit units to discover hidden concepts and to reduce labeling effort • Generalization of learned weights or triggering-rules from known features to similar ones (similarity or class based) • Similarity/distance to training as the basis for classification method (nearest neighbor)
Characteristics of Learning in NLP • Very high dimensionality • Sparseness of data and relevant features • Addressing the basic problems of language: • Ambiguity – of concepts and features • One way to say many things • Variability • Many ways to say the same thing
Supervised Classification • Hidden concept is defined by a set of labeled training examples (category, sense) • Classification is based on entailment of the hidden concept by related elements/features • Example: two senses of “sentence”: • word, paragraph, description Sense1 • judge, court, lawyer Sense2 • Single or multiple concepts per example • Word sense vs. document categories
Supervised Tasks and Features • Typical Classification Tasks: • Lexical: Word sense disambiguation, target word selection in translation, name-type classification, accent restoration, text categorization (notice task similarity) • Syntactic: POS tagging, PP-attachment, parsing • Complex: anaphora resolution, information extraction • Features (“feature engineering”): • Adjacent context: words, POS • In various relationships – distance, syntactic • possibly generalized to classes • Other: morphological, orthographic, syntactic
Learning to Classify • Two possibilities for acquiring the “entailment” relationships: • Manually: by an expert • time consuming, difficult – “expert system” approach • Automatically: concept is defined by a set of training examples • training quantity/quality • Training: learn entailment of concept by features of training examples (a model) • Classification: apply model to new examples
Supervised Learning Scheme “Labeled” Examples Training Algorithm Classification Model New Examples Classification Algorithm Classifications
Avoiding/Reducing Manual Labeling • Basic supervised setting – examples are annotated manually by labels (sense, text category, part of speech) • Settings in which labeled data can be obtained without manual annotation: • Anaphora, target word selectionThe system displays the file on the monitor and prints it. • Bootstrapping approachesSometimes referred as unsupervised learning, though it actually addresses a supervised task of identifying an externally imposed class (“unsupervised” training)
Learning Approaches • Model-based: define entailment relations and their strengths by training algorithm • Statistical/Probabilistic: model is composed of probabilities (scores) computed from training statistics • Iterative feedback/search (neural network): start from some model, classify training examples, and correct model according to errors • Memory-based: no training algorithm and model - classify by matching to raw training (compared to unsupervised tasks)
Evaluation • Evaluation mostly based on (subjective) human judgment of relevancy/correctness • In some cases – task is objective (e.g. OCR), or applying mathematical criteria (likelihood) • Basic measure for classification – accuracy • In many tasks (extraction, multiple class per-instance, …) most instances are “negative”; therefore using recall/precision measures, following information retrieval (IR) tradition • Cross validation – different training/test splits
Evaluation: Recall/Precision • Recall: #correct extracted/total correct • Precision: #correct extracted/total extracted • Recall/precision curve - by varying the number of extracted items, assuming the items are sorted by decreasing score
Micro/Macro averaging • Often results are evaluated for multiple tasks • Many categories, many ambiguous words • Macro-averaging: compute results separately for each category and average • Micro-averaging (common): refer to all classification instances, from all categories, as one pile and compute results • Gives more weight to common categories
Course Organization • Material organized mostly by types of learning approaches, while demonstrating applications as we go along • Emphasis on demonstrating how computational linguistics tasks can be modeled (with simplifications) as statistical/learning problems • Some sections covering the lecturer’s personal work perspective
Course Outline • Sequential modeling • POS tagging • Parsing • Supervised (instance-based) classification • Simple statistical models • Naïve Bayes classification • Perceptron/Winnow (one layer NN) • Improving supervised classification • Unsupervised learning - clustering
Course Outline (1) • Supervised classification • Basic/earlier models: PP-attachment, decision list, target word selection • Confidence interval • Naive Bayes classification • Simple smoothing -- add-constant • Winnow • Boosting
Course Outline (2) • Part-of-speech tagging • Hidden Markov Models and the Viterbi algorithm • Smoothing -- Good-Turing, back-off • Unsupervised parameter estimation with Expectation Maximization (EM) algorithm • Transformation-based learning • Shallow parsing • Transformation based • Memory based • Statistical parsing and PCFG (2 hours) • Full parsing - Probabilistic Context Free Grammar (PCFG)
Course Outline (3) • Reducing training data • Selective sampling for training • Bootstrapping • Unsupervised learning • Word association • Information theory measures • Distributional word similarity, similarity-based smoothing • Clustering
Misc. • Major literature sources: • Foundations of Statistical Natural Language Processing, by Manning & Schutze, MIT Press • Articles • Additional slide credits: • Prof. Shlomo Argamon, Chicago • Some slides from the book web-site