Learning Representations of Language for Domain Adaptation

Alexander Yates Fei (Irene) Huang Learning Representations of Language for Domain Adaptation Temple University Computer and Information Sciences

Outline Representations in NLP Machine Learning / Data mining perspective Linguistics perspective Domain Adaptation Learning Representations Experiments

A sequence-labeling task Identify phrases that name birds and cats. BIRD Thrushes build cup-shaped nests, sometimes lining them with mud. CAT Sylvester was #33 on TV Guide's list of top 50 best cartoon characters, BIRD together with Tweety Bird.

Machine Learning Quick formal background: Let X be a set of all possible data points (e.g., all English sentences) Let Z be the space of all possible predictions (e.g., all sequences of labels) A target is a function f: X  Z that we’re trying to learn A learning machine is an algorithm L Input: a set of examples S = {xi} drawn from distribution D, and a label zi = f(xi) for each example. Output: a hypothesish: X  Z that minimizes Ex~D1[h(x) ≠ f(x) ]

Representations for Learning Most NLP systems first transform the raw data into a more convenient representation. • A representation is a function , for some suitable feature space Y, like . • A featureis a dimension in the feature space Y. • Alternatively, we may use the word feature to refer to a value for one component of R(x), for some representation R and instance x. A learning machine L takes as input a set of examples (R(xi),f(xi)) and returns h: Y  Z.

A traditional NLP representation Word-based features Thrushes build cup-shaped nests, … Orthographic features Feature sets are carefully-engineered for specific tasks, but usually include at least the word-based features.

Sparsity • Common indicators for birds: • feather, beak, nest, egg, wing • Uncommon indicators: • aviary, archaeopteryx, insectivorous, warm-blooded “The jackjaw stood irreverently on the scarecrow’s shoulder.” • Sparsity analysis of Collins parser: (Bikel, 2004) • bilexical statistics are available in < 1.5% of parse decisions

Sparsity in biomedical POS tagging • Most part-of-speech taggers are trained on newswire text (Penn Treebank) • In a standard biomedical data set, fully 23% of words never appear in the Penn Treebank

Polysemy • The word “thrush” is not necessarily an indicator of a bird: • Thrush is the term for an overgrowth of yeast in a baby's mouth. • Thrush products have been a staple of hot rodders for over 40 years as these performance mufflers bring together the power and sound favored by true enthusiasts. • “Leopard”, “jaguar”, “puma”, “tiger”, “lion”, etc. all have various meanings as cats, operating systems, sports teams, and so on. • Word meanings depend on their contexts, and word-based features do not capture this.

Embeddings • Kernel trick: • implicitly embed data points in higher-dimensional space • Dimensionality reduction: • Embed data points in a lower-dimensional space • Common technique in text mining, combined with vector space models • PCA, LSA, SVD (Deerweester et al., 1990) • Self-organizing maps (Honkela, 1997) • Independent component analysis (Sahlgren, 2005) • Random indexing (Väyrynen et al., 2007) • But existing embedding techniques ignore linguistic structure.

A representation from linguistics thrushes build Many modern linguistic theories (GPSG, HPSG, LFG, etc.) treat language as a small set of constraints over a large number of lexical features. But lexical entries are painstakingly crafted by hand.

Outline Representations in NLP Domain Adaptation Learning Representations Experiments

Domains Definition: A domain is a subset of language that is related through genre, topic, or style. Examples: newswire text science fiction novels biomedical research literature

Domain Dependence Biomedical Domain … factor for Wnt signaling (noun), … … for the Wnt signaling (noun) pathway via … ... in a novel signaling (noun) pathway from an extracellular guidance cue … … in the Wnt signaling (noun) pathway, and mutation … Newswire Domain … isn’t signaling (verb) a recession … … acquiring the company, signaling (verb) to others that … … in that list, signaling (verb) that all the company’s coal and … Dow officials were signaling (verb) that the company … … the S&P was signaling (verb) that the Dow could fall …

Domain adaptation: a hard test for NLP Formally, a domain is a probability distribution D over the instance set X e.g., sentences in newswire domain ~ DNews(X) sentences in biomedical domain ~ DBio(X) In domain adaptation, a learning machine is given training examples from a source domain The hypothesis is then tested on data points drawn from a separate target domain.

Learning theory for domain adaptation A recently-proved theorem: The error rate of h on target domain T after being trained on source domain S depends on: • the error rate of h on the source domain S • the distance between S and T • The claim depends on a particular notion of “distance” between probability distributions S and T [Ben-David et al., 2009]

Formal version

Objectives for (lexical) representations • Usefulness: We want features that help in learning the target function. • Non-Sparsity: We want features that appear commonly in reasonable amounts of training data. • Context-dependence: We want features that somehow depend on, or take into account, the context of a word. • Minimal domain distance: We want features that appear approximately as often in one domain as any other. • Automation: We don’t want to have to manually construct the features.

Representation learning X BIRD X X We learn this h (hypothesis) Why not learn this, too?! R (representation) Thrushes build cup-shaped nests

1) Ngram Models for Representations finches thrushes

1) Ngram Models for Representations Label X BIRD X XX BIRD Ngram features Training Word-based features Orthographic features True finches are predominantly seed-eating songbirds.

1) Ngram Models for Representations Label BIRD X XXX Ngram features Testing Word-based features Ngram features: • Advantages: • Automated • Useful • Disadvantages • Sparse • Not context-dependent Orthographic features Thrushes build cup-shaped nests, sometimes …

Pause: let’s generalize the procedure • Train a language model on lots of (unlabeled) text (preferably from multiple domains) • Use the language model to annotate (labeled) training and test texts with latent information • Use the annotations as features in a CRF • Train and test CRF as usual

Pause: how to improve procedure? The main idea we’ve explored is – cluster words into sets of related words use the clusters as features We can control the number of clusters, to make the features less sparse.

2) Distributional Clustering • Construct a Naïve Bayes model for generating trigrams • The parent node is a latent state with K possible values • Trigrams are generated according to Pleft(word | parent), Pmid(word | parent) and Pright(word | parent) True finches are predominantly seed-eating songbirds.

2) Distributional Clustering – NB 2) Train the prior P(parent) and conditional distributions on a large corpus using EM, treating all trigrams as independent. 3) For each token in training and test sets, determine the best value of the latent state, and use it as a new feature. True finches are predominantly seed-eating songbirds. Thrushes build cup-shaped nests, …

2) Distributional Clustering – NB Advantages over ngram features 1) Sparsity: only K features, so each should be common 2) Context-dependence: The new feature depends not just on the token at position i, but also on tokens at i-1 and i+1 Potential problems: 1) Features are only sensitive to immediate neighbors 2) The model requires 3 observation distributions, each of which will be sparsely observed. 3) Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?

3) Distributional clustering - HMMs Hidden Markov Model One latent node yiper token xi A conditional observation distribution Pobs(xi | yi) A conditional transition distribution Ptrans(yi | yi-1) A prior distribution Pprior(y0) Joint probability P(x, y) given by True finches are predominantly seed-eating songbirds

3) Distributional clustering - HMMs 1) Train the prior and conditional distributions on a large corpus using EM. Use the Viterbi algorithm to find the best setting of all latent states for a given sentence. Use the latent state value yi as a new feature for xi. True finches are predominantly seed-eating songbirds Thrushes build cup-shaped nests

3) Distributional Clustering – HMMs Advantages over NB features 1) Sparsity: same number of features, but the HMM model itself is less sparse -- it includes only one observation distribution 2) Context-dependence: The new feature depends (indirectly) on the whole observation sequence Potential problem: Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?

4) Multi-dimensional clustering Independent HMM (I-HMM) model: L layers of HMM models, each trained independently. Each layer’s parameters are initialized randomly for EM. True finches are predominantly seed-eating songbirds

4) Multi-dimensional clustering As before, we decode each layer using the Viterbi algorithm to generate features. Each layer represents a random projection from the full feature space to K boolean dimensions. True finches are predominantly seed-eating songbirds

4) Multi-dimensional clustering Advantages over HMM features 1) Usefulness: closer to the lexical representation from linguistics 2) Usefulness: can represent KL points (instead of just K) Potential problem: Each layer is trained independently, so are they really providing additional (rather than overlapping) information?

Experiments • Part-of-speech tagging (and chunking) • Train on newswire text • Test on biomedical text (Huang and Yates, ACL 2009; Huang and Yates, DANLP 2010) • Semantic role labeling • Train on newswire text • Test on fiction text (Huang and Yates, ACL 2010)

Part-of-Speech (POS) tagging Except for the Web ngram features, all features derived from the Penn Treebank plus 70,000 sentences of unlabeled biomedical text.

Sparsity Graphical models perform better on sparse words than not-sparse words, relative to Ngram models. Sparse: The word appears 5 times or fewer in all of our unlabeled text. Not Sparse: The word appears 50 times or more in all of our unlabeled text.

Polysemy Graphical models perform better on polysemous words than not-polysemous words, relative to Ngram models (except for NB). Polysemous: The word is associated with multiple, unrelated POS tags. Not Polysemous: The word has only 1 POS tag in all of our labeled text.

Accuracy vs. domain distance • Distance is measured as the Jensen-Shannon divergence between frequencies of features in S and T. • For I-HMMs, we weighted the distance for each layer by the proportion of CRF parameter weights placed on that layer.

Biomedical NP Chunking The I-HMM representation can reduce error by over 57% relative to a standard representation, when training on news text and testing on biomedical journal text.

Chinese POS Tagging HMMs can beat a state-of-the-art system on many different domains.

Semantic Role Labeling (SRL) (aka, Shallow Semantic Parsing) Input: • Training sentences, labeled with syntax and semantic roles • A new sentence, and its syntax Output: The predicate, arguments, and their roles Example output: Builder Predicate Thing Built Thrushes build cup-shaped nests

Parsing S VP Subject PP Direct Object NP NP NP Proper Noun Verb Det. Noun Prep. Det. Noun Chris broke the window with a hammer

Semantic Role Labeling S VP Breaker Means PP Thing broken NP NP NP Proper Noun Verb Det. Noun Prep. Det. Noun Chris broke the window with a hammer

Semantic Role Labeling Subject S Thing broken NP VP Det. Noun Verb The window broke

Simple, open-domain SRL SRL Label Breaker Pred Thing Broken Means Baseline Features dist. from predicate -1 0 +1 +2 +3 +4 +5 Chunk tag B-NP B-VP B-NP I-NP B-PP B-NP I-NP Proper Noun Verb Det. Noun Prep. Det. Noun POS tag Chris broke the window with a hammer

Simple, open-domain SRL SRL Label Breaker Pred Thing Broken Means Baseline +HMM HMM label dist. from predicate -1 0 +1 +2 +3 +4 +5 Chunk tag B-NP B-VP B-NP I-NP B-PP B-NP I-NP Proper Noun Verb Det. Noun Prep. Det. Noun POS tag Chris broke the window with a hammer

Learning Representations of Language for Domain Adaptation

Learning Representations of Language for Domain Adaptation

Presentation Transcript

Equivalence of Regular Language Representations

Time-Domain Representations of LTI Systems

Domain Adaptation

Domain Adaptation in Natural Language Processing

Domain Adaptation with Structural Correspondence Learning

Deep Learning of Representations for Unsupervised and Transfer Learning

Language Documentation and formal representations of language

Machine Translation Domain Adaptation

Learning visual representations for unfamiliar environments

Bagging-based System Combination for Domain Adaptation

Time Domain Representations of Linear Time-Invariant Systems

Domain Adaptation with Multiple Sources

Time-Domain Representations of LTI Systems

Domain Specific Language

Empirical Evaluation of Learning Styles Adaptation Language

Domain Adaptation for Biomedical Information Extraction

Language for Learning

Time-Domain Representations of LTI Systems

Frustratingly Easy Domain Adaptation

Domain Adaptation for Statistical Machine Translation

Domain Adaptation with Multiple Sources

Time Domain Representations of Linear Time-Invariant Systems