LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 28 • 4/29/2013

Recommended reading • Peter Grünwald. 1996. A minimum description length approach to grammar inference. In Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing (editors S. Wermter, E. Riloff, G. Scheler), pp. 203-216; Lecture Notes in Artificial Intelligence no. 1040. Springer Verlag, Berlin, Germany. • Colin Phillips. 2004. Three Benchmarks for Distributional Approaches to Natural Language Syntax. In R. Zanuttini, H. Campos, E. Herburger, & P. Portner (eds.), Negation, Tense, and Clausal Architecture: Cross-linguistic Investigations. Georgetown University Press. • Janet Dean Fodor. 1998. Unambiguous triggers. Linguistic Inquiry 29(1), 1-36. • Charles Yang. 2004. Universal Grammar, statistics, or both. Trends in Cognitive Sciences. 451-456.

Other recommended reading: books • Noam Chomsky. 1965. Aspects of the Theory of Syntax • William O’Grady. 1997. Syntactic Development • Mark Baker. 2001. The Atoms of Language • Charles Yang. 2002. Knowledge and Learning in Natural Language • Charles Yang. 2006. The Infinite Gift • Maria Teresa Guasti. 2002. Language Acquisition: The Growth of Grammar

Credits • Some slides are borrowed or adapted from: • Paul Hagstrom http://www.bu.edu/linguistics/UG/hagstrom/ • Charles Yang http://www.ling.upenn.edu/~ycharles/

Outline • Grammar induction • Grammar induction algorithms • Children’s acquisition of syntax, and poverty of the stimulus • Principles and parameters: principles • Principles and parameters: parameters • Grammatical development in children • Learning syntax through parameter setting

Grammar induction • Given a finite sample of strings from some language L, what is the grammar G that was most likely to produce that language? • Difficulties: • Generalize from a sample of strings to the underlying language • Sparse data • Poverty of the stimulus (later section)

CFG induction example • Sentence 1: The boy meets the girl • Sentence 2: The girl meets the boy • Underlying CFG: S  NP VP NP  DT N DT  the N  boy | girl VP  V N V  meets

A few possibilities for an induced CFG • G1: S  NP VP NP  DT N DT  the N  boy | girl VP  V N V  meets • G2: S  C1 C2 C1  C3 C4 C3  the C4  boy | girl C2  C5 C4 C5  meets • G3: S  C1 C2 C3 C4 C5 C1  the C2  boy | girl C3  meets C4  the C5  boy | girl • G4: S  C1 C2 C1  the C2  boy C3 | girl C3 C3  meets C4 C4  the C5 C5  boy | girl

Issues in grammar induction • Format of possible grammars: • Type of grammar: regular, context-free, dependency, etc. • Probabilistic / non-probabilistic • Nativism vs. emergence: • Are the labels that linguists use a result of mapping induced structures onto internal representations, or are they emergent properties of the grammar? • Example, difference between these perspectives: • algorithm is specifically looking for an NP in the input, or, • algorithm discovers C35, which has the properties of what linguists call an NP

Issues in grammar induction • Treat as an A.I. search problem 1. Hypothesis space of possible grammars 2. Algorithm to search space of grammars 3. Evaluation criteria to decide between grammars • Input: positive evidence only, or also include negative evidence? • Positive: strings in the language • Negative: strings not in the language

Example paper: Grünwald1996 • Peter Grünwald. A minimum description length approach to grammar inference In Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing (editors S. Wermter, E. Riloff, G. Scheler), pages 203-216; Lecture Notes in Artificial Intelligence no. 1040. Springer Verlag, Berlin, Germany, 1996. • http://homepages.cwi.nl/~pdg/ • Algorithm to induce context-free grammars from text • Doesn’t work so well… • There are other grammar induction papers, but I have never read one that I have found to be satisfactory.

Input and output • Input: text corpus • Output: “Context-free grammar” • (It’s actually a regular grammar; no center-recursion) • Has multiple “start symbol” nonterminals • Words in a sentence may be generated by multiple “start symbols”, because the induced grammar might not generate the entire sentence • Example: [The quick] [brown fox] [jumped over the lazy dog]

Description of algorithm

Space of possible grammars • Initial grammar • For every word wi, create a rule i wi • Bottom-up merging process • For classes ciand cjin the grammar: • Union them into a new class ck ci | ci, or • Concatenate them into a new class ck cicj • Space of possible unions / concatenations describes range of possible grammars

Example • Training corpus: • The dog is big • The dog is fat • Initial grammar: C1  the C2  dog C3  is C4  big C5  fat • First concatenate C1 and C2: C1  the C2  dog C6  C1 C2 C3  is C4  big C5  fat • Then union C4 and C5: C1  the C2  dog C6  C1 C2 C3  is C4  big C5  fat C7  C4 | C5

Compare grammars through MDL Initial grammar Union or concatenation Alternative grammar . . . Alternative grammar Alternative grammar Minimal Description Length Best grammar

Need to figure out how to encode a grammar and a corpus • MDL applied to grammar induction: Minimize: length of description of grammar + description of corpus according to the grammar • Following slides contain one possibility • (MDL formulas are ad-hoc) • Grunwalddoes something else more complicated • Grunwald’s master’s thesis does something else

# of bits to encode a grammar • Suppose this is your grammar C1  C2 C3 C2  w1 C2  w5 C3  C4 C3  w4 w6 C4  w3 • For a particular rule: • Choosing which rule involves: • Choose LHS nonterminal • Choose RHS given the LHS • # of bits to encode a rule = - log2 p(LHS, RHS) = - log2 p(LHS) - Σlog2 p(RHS | LHS) • # of bits to encode grammar = Σrules # of bits for each rule

# of bits to encode a corpus according to a grammar • Corpus = w1 w2 w3 • Encode using this grammar: C1  C2 C3 C2  w1 C2  w5 C3  C4 C3  w4 w6 C4  w3 • w1 w2 w3 is generated through rules 1, 2, 3, 4. p(w1 w2 w3) = p(C1)*p(C2 C3|C1)*p(w1|C2)*p(C4|C3)*p(w3|C4) • # of bits to encode w1 w2 w3 = - log2 p(C1) - log2p(C2 C3|C1) - log2p(w1|C2) - log2p(C4|C3) - log2p(w3|C4)

Grunwald’s experiments • Brown corpus • Choose sentences that only consist of words that are among the 10,000 most-frequent words in the corpus

Experiment 1: union only(no concatenation)

Description Length over merging iterations

Experiment 2: concatenation • Doesn’t work • Takes too long • No results reported • “We do not arrive at very good rules” • “it should be noted here that in experiments with toy grammars much better grammar rules were formed.”

Children as grammar inducers • Let’s now consider grammar induction as a cognitive science problem. • Look at what kids say. • Look at whether the input is sufficient to learn an adult grammar for a language.

1. Kids say the darndest things • If children were acquiring grammars by string pattern generalization, you would not expect them to speak (generate) sentences not in the language. • Whether through Grunwald’s procedure, or some other one; details do not matter • But children say things that are not in the adult grammar. • At the level of individual words • At the level of syntactic constructions

Limited influence of parental feedback • Parents often correct what their children say… but it doesn’t work Billy: Mommy: Billy:

From Braime (1971) • Want other one spoon, daddy. • You mean, you want the other spoon. • Yes, I want other one spoon, please Daddy. • Can you say ‘the other spoon’? • Other…one…spoon • Say ‘other’ • Other • ‘Spoon’ • Spoon • ‘Other spoon’ • Other…spoon. Now give me other one spoon?

Children also do not receive negative evidence for general grammatical principles • Negative evidence (from parents) doesn’t concern core grammatical principles such as phrase structure, headedness, movement • For example, no parent says: • “You can’t question a subject in a complement embedded with that” • You can’t use a proper name if it’s c-commanded by something coindexed with it.”

2. Grammatical knowledge and Poverty of the Stimulus • Adults have intuitions about the grammar of their language. • Example • John ate peas and carrots. • What did John eat ___ ? • Now suppose the speaker knows that John ate peas, and asks for what else John ate with the peas: *What did John eat peas and ___ ? • How do we know that the last sentence is ungrammatical? We never see examples like this in the input. • Other examples of ungrammatical sentences: • Which book did she review __ without reading __ ? • *She reviewed the book without reading __

Poverty of the Stimulus • Argument of the Poverty of the stimulus: • Adults’ knowledge of the structure of their language cannot be accounted for through simple learning mechanisms like string pattern generalization • Therefore humans must have knowledge of grammar that they are born with (nativist, rationalist) • Language acquisition involves both external data (empirical aspect) and innate knowledge • Innate knowledge explains fast rate of language acquisition, especially given limited quantity of observed data • Phillips (2004): grammar induction algorithms should be judged according to whether they model human intuitions • Very high standards! • But otherwise will not be able to convince linguists

Chomsky’s degrees of adequacy, in accounting for a language • Observational adequacy • Theory accounts for the observed forms of a language • Not interesting: could be a list of sentences • Descriptive adequacy • Theory accounts for the observed forms of a language • Theory explains intuitions of native speakers about their language • Utilizes abstract grammatical structures • Distinguishes possible from impossible structures • Explanatory adequacy (highest goal) • Theory accounts for the observed forms of a language • Theory explains intuitions of native speakers about their language • Explains how that knowledge of language can be acquired by a learner

Universal Grammar (UG) • The set of principles / parameters / rules / constraints governing language structure • Common to all human languages • Determines the set of possible human languages • Explains linguistic variation • Innate, unconscious knowledge • Modeled by a linguistic theory such as Principles & Parameters or Minimalism

UG and language acquisition • Language acquisition with UG = a specific setting of UG parameters + a lexicon: phonemes, morphemes, words, and their argument structure, semantics, etc. • During the critical period of language acquisition, the data encountered (primary linguistic data) is used to: • Set parameters of UG • Acquire the lexicon

Principles and Parameters • A specific theory of UG • (same as Government and Binding Theory) • Principles: aspects of linguistic structure that are invariant across languages • Parameters: aspects of linguistic structure that differ across languages • All languages share the same principles, but may differ in their parameter settings (and their vocabulary)

Principles for phrase structure • Motivation: there are many redundant phrase structure rules: VP  V NP, PP  P NP, AP  A NP, etc. • X-bar theory and the principle of Endocentricity • Every phrase has a head • XP is the maximal projection of the head X • Rules out structures such as: • NP  ADJ P • PP  V NP • Benefit: • X-bar theory captures commonalities between rules • There is no explicit CFG in UG

Principles for movement • Explain relationship between a declarative sentence and its question variant: • The student was sleeping • Was the student __ sleeping? • John can solve this problem • Which problem can John solve __? • Theory of Movement: • Questioned constituent is displaced • Other structures in the sentence may be modified also • Constraints on movement

Universal constraints on movement • Coordinate structure constraint: • John ate [bagels and what]NP for lunch? • *What did John eat bagels and ___ for lunch? • Conjoined NP forms an “island” that cannot be extracted from • Relative clause island • John saw the dog that ate pizza. • John saw [the dog that ate what]NP • *What did John see the dog that ate ___ ? • A relative clause also forms an “island” for movement

Parameters and linguistic variation • Languages are superficially different • All languages share a core grammatical structure, as determined by the principles / rules / constraints of UG • Primary differences between languages are in parameter settings • (Each language also has its own vocabulary, but this is a superficial difference)

Every combination of parameter settings determines a unique language type

Japanese vs. English • Head-first: English-type language Kazuate sushi to Tokyo • Head-last: Japanese-type language Kazu sushi ate Tokyo to • In terms of phrase structure rules: • English: VP  V NP PP  P NP • Japanese: VP  NP V PP  NP P

Head direction parameter • A head-first language applies the headfirst rule to all of its phrases: NPs, VPs, etc. • A head-last language applies the head-last rule to all of its phrases: NPs, VPs, etc. English Japanese

Wh- movement parameter • Parameter for presence/absence of wh- movement in a language • Wh- movement occurs in English • Wh- movement does not occur in Korean • Korean is wh- in situ Ne-nun [Mary-ka enutayhak-eykat-tako] sayngkakha-ni you Mary which college went that think “Which college do you think that Mary went to?”

Verb movement parameter • French: V raises to aux • English: aux lowers to V

Null Subject Parameter • Italian allows null subjects but English doesn’t: • I ate shepherds pie. • Ø Ho mangiatoil risotto allamilanese. • Italian allows Pro-drop (omit pronoun): • Mary speaks English very well because she was born in the US. • Vitoparlal’italiano molto bene ma Ø e natoneglistatiuniti. • Italian speakers can figure out who the subject is, because of inflection on the verb.

LING / C SC 439/539 Statistical Natural Language Processing