Manfred Krifka Humboldt University Berlin Center for General Linguistics (ZAS), Berlin

Optimality Theory and PragmaticsLecture series in three partsheld at the V. Mathesius Center, Prague, March 2004Part 3: Stochastic OT, Learning and Language Change Manfred Krifka Humboldt University Berlin Center for General Linguistics (ZAS), Berlin Copy of presentation at: http://amor.rz.hu-berlin.de/~h2816i3x

Things to Do: • The Classic Theory of Implicatures: Grice • Optimality Theory • An OT Account of Scalar Implicatures • The Theory of I/R Implicatures: Horn, Levinson • Strong Bidirectional Optimality Theory Accounts of blocking, pronoun interpretation, freezing of word order • The Theory of M Implicatures: Levinson • Weak Bidirectional OT Account of Partial Blocking • Weak Bi-OT Account of Measure Term Interpretation • An OT account of Scale Alignment • Functional motivation for linguistic scales • Stochastic OT • Bidirectional OT account of Learning • Bidirectional OT account of Language Change: Differential Case Marking

Differential Case Marking: Objects • In many languages, case marking of subject and objectdepends on a variety of factors. • Hebrew: Only definite object NPs are case marked.Ha-seret her’a ‘et-ha-milxama.‘the-movie showed ACC-the-war’Ha-seret her’a (*’et-)milxama.‘the-movie showed (*ACC) war’ • Spanish: Only animate object NPs are case marked.Busco a una señora.I-look-for ACC a woman.Busco (*a) una casa.‘I-look-for (*ACC) a house.’ • Bossong (1985): differential object marking in more than 300 languages. • Aissen (2002) identifies two scales that determine DOM: • Animacy: Human > Animate > Inanimate • Definiteness: Pers.Pronoun > Name > Def.NP > Indef.Spec.NP > Nonspec. NP • Generalization: Object marking more likely at the high end of the scales.

From Judith Aissen: Differential Object Marking. Iconicity vs. EconomyDraft, Stanford 2000 A closer look: DOM in medieval Spanish

Differential Case Marking: Subjects • Differential subject marking (“Split Ergativity”): Example: Dyirbal, Australia. • 1st and 2nd person pronouns: No marking of subject NP • ãna banaga-nyu. ãna ˜urra-na bura-n.we returned. we you-ACC saw. • ˜urra banaga-nyu ˜urra ãna-na bura-n.you returned you us-ACC saw • Other pronouns and NPs: Ergative marking of subject of transitive sentence:˜uma banaga-nyu ˜uma-˜gu yabu bura-n.Father returned. Father-ERG mother saw. • Hundreds of languages; distribution of subject marking governed by similar scales (Silverstein 1976): • Animacy: Human > Animate > Inanimate • Definiteness: Pers.Pronoun > Name > Def.NP > Indef.Spec.NP > Nonspec. NP • Generalization: Subject marking more likely at the low end of the scales.

case marking unlikely case marking likely Differential Case Marking: Scale Alignment • Aissen (2002): Case marking patterns as the result of alignment of two scales,here illustrated with definiteness scale. • Alignment of two scales produces the following markedness scales: • Subj/pronoun > Subj/name > Subj/def > Subj/spec > Subj/nonspec • Obj/nonspec > Obj/spec > Obj/def > Obj/name > Obj/pronoun

Scale Alignment and OT constraints • Prince & Smolensky (1993): Hierarchy A > B > C translated into ranked constraints *C >> *B >> *A • Markedness constraints:*Ø: Avoid non-marking; *STRUC: Avoid marking (structure) • Convention for constraint conjunction:C1 & C2 >> C1 >> C2The conjunction of C1 and C2 outranks C1, which outranks C2 • Expression of marking tendencies, Hebrew: • Relevant part of hierarchy: Obj/indef > Obj/def • Constraint ranking: *Obj/def >> *Obj/indef • Marking constraint ranking: *Ø&*Obj/def >> *STRUCT >> *Ø&*Obj/indef m(ark):Obj/def m(ark):Obj/indef

OT Constraints, Case marking in a Dyirbal-like Language • Basic hierarchies, universal: S(ubj) > O(bj) 1(st) > 3(rd) • Aligned hierarchies: S/1 > S/3 O/3 > O/1 • Generated constraint orders: *S/3 >> *S/1 *O/1 >> *O/3 • Marking constraints: {m:S/3, m:O/1} >> *STRUC >> {m:S/1, m:O/3}

Where do the hierarchies come from? • Aissen simply assumes hierarchies like S > O, 1 > 3, def > indef as given. • Bresnan, Dingare & Manning (2001), Zeevat & Jäger (2002):The hierarchies can be explained by typical patterns of language use. • Example:Subjects and objects in 3151 simple transitive clausesof Swedish everyday conversation (SAMTAL corpus, Ö. Dahl)

Probabities that subjects and objectshave certain properties,SAMTAL Corpus of spoken Swedish(collected by Ö. Dahl, analyzed by Zeevat & Jäger) Biases in the SAMTAL Corpus Resulting stastical biases, expressed as conditional probabilities e.g., p(Subj | +def): probability that a +def NP is subject: 63% p(Subj | +def) = 63% p(Subj | –def) = 4% p(Obj | +def) = 37% p(Obj | –def) = 96% p(Subj | +pron) = 66% p(Subj | –pron) = 9% p(Obj | +pron) = 33% p(Obj | –pron) = 91% p(Subj | +anim) = 90%p(Subj | –anim) = 7%p(Obj | +anim) = 10%p(Obj | –anim) = 93% This holds for a fairly large and representative corpus of spoken Swedish;findings can be reproduced in their tendencies for other languages, communities;but collecting further data absolutely necessary.

–case, Obj/–def  +case, Obj/–def  –case, Obj/+def  +case, Obj/+def  Statistical Bias and Bidirectional OT • Zeevat & Jäger (2002), Jäger (2003) • Economical encoding: • Case marking is disfavored for frequent combinations, e.g., indefinite objects: p(Obj | –def) = 96% • but case marking is favored for infrequent combinations, e.g., indefinite subjects: p(Subj | –def) = 4% definite objects: p(Obj | +def) = 37% • A case for weak bidirectional optimization? • Preference for simple forms: –case >> +case • Preference for meanings that correspond to bias: Obj/–def >> Obj/+def Optimal pairs, case markingpattern of Hebrew. Problem: There is no choiceto interpret a given NPas +def or –def;this is marked by article!

Statistical Bias and Bidirectional OT • Zeevat & Jäger assume the following constraints: • *STRUC: Avoid structure, i.e. avoid overt marking • FAITH: Faithful interpretation of case morphemes, e.g. ACC: Obj, ERG: Subj • BIAS: An NP of a certain category is interpreted as having the grammatical function that is most probable for this category, e.g. Obj: inanimate • Ranking: FAITH >> BIAS >> *STRUC • Hearer optimality and speaker optimality (Asymmetric Bi-OT): • Hearer optimality: For a given form, choose the meaning that shows the least severe constraint violation!In the case at hand, interpret an NP according to its case marking pattern;if there is no case marking, follow statistical bias. • If two competing forms are both hearer optimal for a given meaning,speaker can choose the preferred one (here: the one without case marking) • Hearers have to be served first, as Speaker wants to be understood. • Definition: • A pair F, M GEN is hearer-optimal iff there is no alternative F, M’ GEN such that F, M’ > F, M. • A pair F, M GEN is optimal iff it is hearer-optimaland there is no alternative form F’, M GEN such that F’, M is hearer-optimal and F’, M > F, M.

Example: Animacy in a language with ERG and ACC

From Pragmatics to Grammar? • One caveat:The OT-tableaus typically abstract away from important factors, e.g. word order, plausibility, selectional restrictions. • The lightning killed the man.Even though the man is animate and in object position, it wouldn’t need case marking, as only animates can be killed. • A second caveat:Case marking is typically part of the core grammar,and not derived by pragmatic rules.But: Pragmatic tendencies as one source of core grammar(functionalist view of grammar).

Motivation for Stochastic Optimality Theory • Judith Aissen (2000) and Joan Bresnan (2002):There is not just a universal tendency towards differential case markingin the core grammars of language, • but it can be also describe optional case marking within a language. • Example: Case marking by postpositions in colloquial Japanese(data: Fry 2001, Ellipsis and w-marking in Japanese):Subj/anim: 60% Subj/inanim: 70%Obj/anim: 54% Obj/inanim: 47% • Obligatory case marking patterns can be seen as extreme casesof statistical marking patterns, e.g. Spanish:Obj/anim: 100% Obj/inanim: 0% • Stochastic Optimaltiy Theory (StOT), Boersma (1998), Functional Phonologydeveloped originally for phonological phenomena, can be used to model this intuition:Core grammar phenomena are not essentially differentfrom statistical tendencies based on usagein phenomena that core grammar leaves, to a certain degree, optional.

Constraints C1, C2 overlap:mostly C1 >> C2 sometimes C2 >> C1 Constraints C1, C2 do not overlap:C1 >> C2 (almost) all the time Stochastic Optimality Theory (StOT) • Main differences between standard OT and Stochastic OT: • Constraint ranking on a continuous scaleEvery constraint is assigned a real numberwhich determines the ranking of the constraintsand is a measure for the distance between them. • Stochastic evaluation:For each evaluation, the placement of a constraintis modified by adding a noise value with normal distribution.The ordering of the constraints after adding this noise valuedetermines the actual evaluation of the set of candidates.

Difference between mean values > 10:C1 dominates C2 categorically,p(C2 > C1) < 10-10 Difference between mean values  5:preference for C1 >> C2,but C2 >> C1 lead to grammatical results,p(C2 > C1)  10% Difference between mean values = 0no ranking preferences,p(C2 > C1) = p(C1 > C2) = 50%,random outcomes. Stochastic OT: Ordering Probabilities

Statistical OT and Gradual Learning • Boersma (1998), Boersma & Hayes (2001), in Linguistic Inquiry:Gradual Learning Algorithm (GLA) for learning constraint rankings(not for learning of possible candidates, GEN) • In phonology:GEN: pairs of phonological formsand phonetic interpretations: //, [] • In semantics/pragmatics:GEN: pairs of syntactic/morphological forms and semantic/pragmatic interpretations: F, M

Boersma’s Gradual Learning Algorithm (GLA) • 0. Initial state: All constraint values are set to 0 • Learning datum: input-output pair i, o • Generation:a. For each constraint, a noise value with probability following normal distribuition, is added to current ranking: This is the selection point of the constraint.b. Constraints are ranked by order of their selection points.c. The grammar generates an output o’ for the input i; alternative pair: i, o’ • Comparison:If o’ = o, nothing happens. Otherwise, algorithm compares the constraint violationsof the learning datum i, o with the generated datum i, o’ • Adjustment:a. All constraints that favor the learning datum i, o over the self-generated i, o’ are increased by a small predefined numerical amount (“plasticidy”)b. All constraints that favor the self generated i, o’ over the learning datum i, o are decreased by the plasticity value. • Final state: Steps 1 – 4 are repeated until the constraint values stablize. • Plasticidy may change over life time from high to low.

hypothesized form observed form observed likely meaning hypothesized meaning Bidirectional Gradual Learning Algorithm (BiGLA) • Jäger (2003): ‘The bidirectional gradual learning algorithm’ • Speaker-based learning:Input: Meaning, Output: Form. i, o =M, FSpeaker compares different forms. • Hearer-based learning:Input: Form, Output: Likely meaning. i, o = F, MHearer compares different meanings. • Hearer also uses speaker-based reasoning:On hearing F, M with likely meaning M, speaker checks: Would I have used a different F’ to express M?If yes: Adjust rankings to increase likelihood of using F to express M.

Modelling Pragmatics • The Bidirectional Gradual Learning Algorithm (BiGLA)can be tested experimentally. • Implementation: evolOT, downloadable with files at: http://uni-potsdam.de/~jaeger/nasslli03 • Example: Differential Object Marking triggered by definiteness (e.g., Hebrew)

ranking differences between constraints generations constraints Development of Differential Object Marking markdefiniteobjects! Starting state:constraints start outequally ranked *STRUC After 1000 generations,ranking of constraints firmly established,including previously observedm:Obj/+def >> *STRUCT >> m:Obj/–def markindefiniteobjects!

Development of Split Ergativity (Animacy) markanimateobjects! mark inanimate subjects! Start out with high value of FAITH:Every NP is case marked *STRUC Lower value of FAITH:Fewer NPs are case marked don’t markinanimateobjects! don’t mark animate subjects!

markanimateobjects! don’t markinanimateobjects! mark inanimate subjects! don’t mark animate subjects! mark inanimate subjects! *STRUC don’t markinanimateobjects! markanimateobjects! *STRUC don’t mark animate subjects! Development of Split Ergativity: Initial State doesn’t matter

markanimateobjects! don’t markinanimateobjects! mark inanimate subjects! don’t mark animate subjects! *STRUC *STRUC don’t markinanimateobjects! mark inanimate subjects! markanimateobjects! don’t mark animate subjects! Development of Split Ergativity: Initial State doesn’t matter

*STRUC m:S/+a *STRUC m:S/+a Learning under the Microscope: Speaker Mode Assume current constraint ranking includes the following relative ranking, where m:S/+a: ‘mark animate subjects’ and *STRUC: ‘avoid marking’ • Incoming datum: Subj.anim-Ø (non-marked animate subject) • In speaker mode:Algorithm produces one of the forms:a. Subj.anim-Ø (= learning datum, nothing happens)b. Subj.anim-ERG (satisfying FAITH) • Comparison with learning datum:b. *STRUC favors datum and is promoted, m:S/+a disfavors datum and is demoted. Ultimately, *STRUC will rank higher than m:S/+a, suppressing marking of animate subjects. In general: If a form is produced that differs from the datum and is– a non-marked NP: promotion of *STRUC and/or demotion of marking constraint (see example)– a case-marked NP: demotion of *STRUC, promotion of FAITH if case marking is different.

m:O/+a m:S/+a m:O/+a m:S/+a Learning under the Microscope: Hearer Mode Assume current constraint ranking includes the following relative ranking, where m:S/+a: ‘mark animate subjects’ and m:O/+a: ‘mark animate objects’ • Incoming datum: Subj.anim-Ø (non-marked animate NP interpreted as subject) • In hearer mode:Algorithm produces one of the interpretations (as subject or object):a. Subj.anim-Ø (= learning datum, nothing happens)b. Obj.anim-Ø • Comparison with learning datum:b. m:S/+a favors datum and is promoted, m:O/+a disfavors datum and is demoted. In general: If a meaning is produced that differs from the datum and the NP is– a case-marked NP: promotion of FAITH– a non-marked NPs: promotion and/or demotion of marking constraints (see example)

Things Accomplished: • The Classic Theory of Implicatures: Grice • Optimality Theory • An OT Account of Scalar Implicatures • The Theory of I/R Implicatures: Horn, Levinson • Strong Bidirectional Optimality Theory Accounts of blocking, pronoun interpretation, freezing of word order • The Theory of M Implicatures: Levinson • Weak Bidirectional OT Account of Partial Blocking • Weak Bi-OT Account of Measure Term Interpretation • An OT account of Scale Alignment • Functional motivation for linguistic scales • Stochastic OT • Bidirectional OT account of Learning • Bidirectional OT account of Language Change: Differential Case Marking

Manfred Krifka Humboldt University Berlin Center for General Linguistics (ZAS), Berlin