Learning for Semantic Parsing of Natural Language

Learning for Semantic Parsing of Natural Language Raymond J. Mooney Ruifang Ge, Rohit Kate, Yuk Wah Wong John Zelle, Cynthia Thompson December 19, 2005

Syntactic Natural Language Learning • Most computational research in natural-language learning has addressed “low-level” syntactic processing. • Morphology (e.g. past-tense generation) • Part-of-speech tagging • Shallow syntactic parsing (chunking) • Syntactic parsing

Semantic Natural Language Learning • Learning for semantic analysis has been restricted to relatively “shallow” meaning representations. • Word sense disambiguation (e.g. SENSEVAL) • Semantic role assignment (determining agent, patient, instrument, etc., e.g. FrameNet, PropBank) • Information extraction

Semantic Parsing • A semantic parser maps a natural-language sentence to a complete, detailed semantic representation: logical form ormeaning representation (MR). • For many applications, the desired output is immediately executable by another program. • Two application domains: • CLang: RoboCup Coach Language • GeoQuery: A Database Query Application

CLang: RoboCup Coach Language • In RoboCup Coach competition teams compete to coach simulated players • The coaching instructions are given in a formal language called CLang If the ball is in our penalty area, then all our players except player 4 should stay in our half. Simulated soccer field Coach Semantic Parsing ((bpos (penalty-area our)) (do (player-except our{4}) (pos (half our))) CLang

GeoQuery: A Database Query Application • Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] How many cities are there in the US? User Semantic Parsing answer(A, count(B, (city(B), loc(B, C), const(C, countryid(USA))),A)) Query

Semantic-Parser Learner Semantic Parser Logical Form Natural Language Learning Semantic Parsers • Manually programming robust semantic parsers is difficult due to the complexity of the task. • Semantic parsers can be learned automatically from sentences paired with their logical form. NLLF Training Exs

Engineering Motivation • Most computational language-learning research strives for broad coverage while sacrificing depth. • “Scaling up by dumbing down” • Realistic semantic parsing currently entails domain dependence. • Domain-dependent natural-language interfaces have a large potential market. • Learning makes developing specific applications more tractable. • Training corpora can be easily developed by tagging existing corpora of formal statements with natural-language glosses.

Cognitive Science Motivation • Most natural-language learning methods require supervised training data that is not available to a child. • General lack of negative feedback on grammar. • No POS-tagged or treebank data. • Assuming a child can infer the likely meaning of an utterance from context, NLLF pairs are more cognitively plausible training data.

Our Semantic-Parser Learners • CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) • Separates parser-learning and semantic-lexicon learning. • Learns a deterministic parser using ILP techniques. • COCKTAIL(Tang & Mooney, 2001) • Improved ILP algorithm for CHILL. • SILT (Kate, Wong & Mooney, 2005) • Learns symbolic transformation rules for mapping directly from NL to LF. • SCISSOR (Ge & Mooney, 2005) • Integrates semantic interpretation into Collins’ statistical syntactic parser. • WASP(Wong & Mooney, in preparation) • Uses syntax-based statistical machine translation methods. • KRISP (Kate & Mooney, in preparation) • Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations.

S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball S-bowner NP-player VP-bowner PRP$-team NN-player CD-unum VB-bowner NP-null our player 2 has DT-null NN-null the ball SCISSOR: Semantic Composition that Integrates Syntax and Semantics to get Optimal Representations • Based on a fairly standard approach to compositional semantics [Jurafsky and Martin, 2000] • A statistical parser is used to generate a semantically augmented parse tree (SAPT) • Augment Collins’ head-driven model 2 to incorporate semantic labels • Translate SAPT into a complete formal meaning representation (MR) MR: bowner(player(our,2))

NL Sentence learner SAPT Training Examples SAPT TRAINING TESTING ComposeMR MR Overview of SCISSOR Integrated Semantic Parser

SCISSOR SAPT Parser Implementation • Semantic labels added to Bikel’s (2004) open-source version of the Collins statistical parser. • Head-driven derivation of production rules augmented to also generate semantic labels. • Parameter estimates during training employ an augmented smoothing technique to account for additional data sparsity created by semantic labels. • Parsing of test sentences to find the most probable SAPT is performed using a standard beam-search constrained version of CKY chart-parsing algorithm.

ComposeMR bowner player bowner null team player unum bowner 2 null null our player has the ball

ComposeMR bowner(_) player(_,_) bowner(_) null team player(_,_) unum bowner(_) 2 null null our player has the ball

player(team,unum) bowner(player) ComposeMR bowner(player(our,2)) bowner(_) bowner(_) bowner(_) bowner(_) player(our,2) player(_,_) player(_,_) null null team player(_,_) unum bowner(_) 2 null null our player has the ball

WASPA Machine Translation Approach to Semantic Parsing • Based on a semantic grammar of the natural language. • Uses machine translation techniques • Synchronous context-free grammars (SCFG) (Wu, 1997; Melamed, 2004; Chiang, 2005) • Word alignments (Brown et al., 1993; Och & Ney, 2003) • Hence the name: Word Alignment-based Semantic Parsing

Synchronous Context-Free Grammars (SCFG) • Developed by Aho & Ullman (1972) as a theory of compilers that combines syntax analysis and code generation in a single phase • Generates a pair of strings in a single derivation

Compiling, Machine Translation, and Semantic Parsing • SCFG: formal language to formal language (compiling) • Alignment models: natural language to natural language (machine translation) • WASP: natural language to formal language (semantic parsing)

of STATE Ohio Context-Free Semantic Grammar QUERY What QUERY What isCITY is CITY CITY the capitalCITY the capital CITY CITY ofSTATE STATE Ohio

Productions of Synchronous Context-Free Grammars • Referred to as transformation rules in Kate, Wong & Mooney (2005) pattern template QUERY What isCITY /answer(CITY)

What is CITY answer ( CITY ) the capital CITY capital ( CITY ) loc_2 ( STATE ) of STATE stateid ( 'ohio' ) Ohio CITY ofSTATE / loc_2(STATE) CITY the capitalCITY / capital(CITY) QUERY What isCITY / answer(CITY) Synchronous Context-Free Grammars QUERY QUERY answer(capital(loc_2(stateid('ohio')))) Whatis thecapital of Ohio STATE Ohio / stateid('ohio')

Parsing Model of WASP • N (non-terminals)= {QUERY, CITY, STATE, …} • S (start symbol)= QUERY • Tm (MRL terminals) = {answer, capital, loc_2, (, ), …} • Tn (NL words) = {What, is, the, capital, of, Ohio, …} • L (lexicon) = • λ (parameters of probabilistic model) = ? QUERY What isCITY / answer(CITY) CITY the capitalCITY / capital(CITY) CITY ofSTATE / loc_2(STATE) STATE Ohio / stateid('ohio')

CITY capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) Probabilistic Parsing Model d1 CITY CITY capital capital ( CITY ) CITY of loc_2 ( STATE ) STATE Ohio stateid ( 'ohio' ) STATE Ohio / stateid('ohio')

CITY capital CITY / capital(CITY) CITY of RIVER / loc_2(RIVER) Probabilistic Parsing Model d2 CITY CITY capital capital ( CITY ) CITY of loc_2 ( RIVER ) RIVER Ohio riverid ( 'ohio' ) RIVER Ohio / riverid('ohio')

CITY capital CITY / capital(CITY) CITY capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) CITY of RIVER / loc_2(RIVER) + + Probabilistic Parsing Model d1 d2 CITY CITY capital ( CITY ) capital ( CITY ) loc_2 ( STATE ) loc_2 ( RIVER ) stateid ( 'ohio' ) riverid ( 'ohio' ) 0.5 0.5 λ λ 0.3 0.05 0.5 0.5 STATE Ohio / stateid('ohio') RIVER Ohio / riverid('ohio') Pr(d1|capital of Ohio) =exp( ) / Z 1.3 Pr(d2|capital of Ohio) = exp( ) / Z 1.05 normalization constant

Parsing Model of WASP • N (non-terminals)= {QUERY, CITY, STATE, …} • S (start symbol)= QUERY • Tm (MRL terminals) = {answer, capital, loc_2, (, ), …} • Tn (NL words) = {What, is, the, capital, of, Ohio, …} • L (lexicon) = • λ (parameters of probabilistic model) QUERY What isCITY / answer(CITY) CITY the capitalCITY / capital(CITY) CITY ofSTATE / loc_2(STATE) STATE Ohio / stateid('ohio')

Overview of WASP Unambiguous CFG of MRL Lexical acquisition Training set, {(e,f)} Lexicon,L Parameter estimation Parsing model parameterized by λ Training Testing Input sentence, e' Output MR, f' Semantic parsing

Lexical Acquisition • Transformation rules are extracted from word alignments between an NL sentence, e, and its correct MR, f, for each training example, (e, f)

Word Alignments • A mapping from French words to their meanings expressed in English Le programme a été mis en application And the program has been implemented

Lexical Acquisition • Train a statistical word alignment model (IBM Model 5) on training set • Obtain most probablen-to-1 word alignments for each training example • Extract transformation rules from these word alignments • Lexicon L consists of all extracted transformation rules

Word Alignment for Semantic Parsing • How to introduce syntactic tokens such as parens? The goalie should always stay in our half ( ( true ) ( do our { 1 } ( pos ( half our ) ) ) )

Use of MRL Grammar The RULE (CONDITION DIRECTIVE) goalie CONDITION  (true) should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM  our top-down, left-most derivation of an un-ambiguous CFG stay UNUM  1 in ACTION  (pos REGION) n-to-1 our REGION  (half TEAM) half TEAM  our

Extracting Transformation Rules RULE (CONDITION DIRECTIVE) The CONDITION  (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM  our stay UNUM  1 in ACTION  (pos REGION) our TEAM REGION  (half TEAM) half TEAM  our TEAM our / our

REGION TEAMhalf / (half TEAM) Extracting Transformation Rules RULE (CONDITION DIRECTIVE) The CONDITION  (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM  our stay UNUM  1 in ACTION  (pos REGION) REGION TEAM REGION  (half TEAM) REGION  (half our) half TEAM  our

ACTION stay in REGION/ (pos REGION) Extracting Transformation Rules RULE (CONDITION DIRECTIVE) The CONDITION  (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM  our ACTION stay UNUM  1 in ACTION  (pos REGION) ACTION  (pos (half our)) REGION REGION  (half our)

Probabilistic Parsing Model • Based on maximum-entropy model: • Features fi (d) are number of times each transformation rule is used in a derivation d • Output translation is the yield of most probable derivation

Parameter Estimation • Maximum conditional log-likelihood criterion • Since correct derivations are not included in training data, parameters λ* are learned in an unsupervised manner • EM algorithm combined with improved iterative scaling, where hidden variables are correct derivations (Riezler et al., 2000)

KRISP: Kernel-based Robust Interpretation by Semantic Parsing • Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar • Productions of MRL are treated like semantic concepts • SVM classifier is trained for each production with string subsequence kernel • These classifiers are used to compositionally build MRs of the sentences

Kernel Functions • A kernel K is a similarity function over domain X which maps any two objects x, y in X to their similarity score K(x,y) • For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij should be symmetric and positive-semidefinite, then the kernel function calculates the dot-product of the implicit feature vectors in some high-dimensional feature space • Machine learning algorithms which use the data only to compute similarity can be kernelized (e.g. Support Vector Machines, Nearest Neighbor etc.)

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = ?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left K(s,t) = 1+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = our K(s,t) = 2+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = penalty K(s,t) = 3+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = area K(s,t) = 4+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 5+?

String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 11

Normalized String Subsequence Kernel • Normalize the kernel (range [0,1]) to remove any bias due to different string lengths • Lodhi et al. [2002] give O(n|s||t|) for computing string subsequence kernel • Used for Text Categorization [Lodhi et al, 2002] and Information Extraction [Bunescu & Mooney, 2005b]

ρ Support Vector Machines • SVM’s are classifiers that learn linear separators that maximize the margin between data and the classification boundary. • Kernel’s allow SVM’s to learn non-linear separators by implicitly mapping data to a higher-dimensional feature space.

Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs

Learning for Semantic Parsing of Natural Language

Learning for Semantic Parsing of Natural Language

Presentation Transcript

unsupervised semantic parsing

CS 388: Natural Language Processing: Statistical Parsing

CS 388: Natural Language Processing: Semantic Parsing

Schema-based Natural Language Semantic Parsing

Natural Language Processing : Probabilistic Parsing

Statistical Natural Language Parsing

Statistical Natural Language Parsing

Language Technology Machine learning of natural language

Unsupervised Semantic Parsing

Combining Contexts in Lexicon Learning for Semantic Parsing

Natural Language Querying of the Semantic Web

Semantic Parsing for Robot Commands

74.406 Natural Language Processing - Parsing 1 -

DEPENDENCY PARSING ， Framenet , SEMANTIC ROLE LABELING, SEMANTIC PARSING

Natural Language Processing Syntactic Parsing

Semantic Parsing for Question Answering

Semantic Parsing for Robot Commands

Learning for Semantic Parsing of Natural Language