STRUCTURE AND FREQUENCY OF LEXICAL SEMANTIC CLASSES

STRUCTURE AND FREQUENCY OF LEXICAL SEMANTIC CLASSES Paola Merlo Suzanne Stevenson University of Geneva University of Toronto

What is the role of quantitative approaches? • Can quantitative investigations be the subject matter of linguistic research, or are they only methodological tools? • Investigating the relationship between richly structured representations and distributional properties of language • - provides richer data • - supports falsifiable and predictive • reasoning within weaker theories

Case Study: Verb Classes Manner of Motion TRANS The rider raced the horse past the barn INTR The horse raced past the barn Change of State TRANS The cook melted the butter INTR The butter melted Creation/Transformation TRANS The contractors built the house INTR The contractors built all summer .23 .77 .40 .60 .62 .38

Quantitative Investigations • Observation: Different lexical semantic classes have different patterns of frequency distributions • Q1: Are these distributional properties related to other underlying properties? • Q2: Are the differences in distribution strong enough to support generalisation to new verbs and other verb classes? • Q3: Does the relation between underlying properties and frequency hold typologically?

Frequency and thematic roles • Different lexical semantic classes show different frequency distributions in the use of the transitive construction • The difference in the frequency of the transitive use is related to different thematic assignments

English Verb Classes Manner of Motion The rider raced the horse past the barn (Causal) Agent Agent The horse raced past the barn Agent Change of State The cook melted the butter (Causal) Theme Agent The butter melted Theme Creation/Transformation The contractors built the house Agent Theme The contractors built all summer Agent

Transitive Use • Transitivity by causation: MoM, CoS • Agentive object : MoM

Relationship between frequency and transitivity • Transitivity by causation: MoM, CoS • - Greater complexity, two events • Agentive object : MoM (transitive unergative) • - Infrequent in English: only MoM and SE • - Infrequent typologically (* Italian, French, German, • Portuguese,and Czech. Vietnamese only comitative) • - Difficult to process (Stevenson Merlo 97, Filip et al. CUNY 98) • Explains frequency of transitive use MoM < CoS < C/T

Other frequency facts Are there are other properties specific to verb classes that we can expect to surface as statistical differences?

Animacy • Themes are more likely to be inanimate

Animacy and thematic hierarchies • Thematic hierarchy AGENT > THEME • Animacy hierarchy 1, 2>3,Proper>Human>Animate>Inanimate • Harmonic Alignment • 1,2/AG>3,Proper/AG>Human/AG>Animate/AG>Inanimate/AG • 1,2/TH<3,Proper/TH<Human/TH<Animate/TH<Inanimate/TH • Expected Frequency of Animacy: CoS< MoM,C/T

Causative use • Transitivity by causation: MoM, CoS • Causer subject, same thematic role between subj intr and obj trans • Expected frequency of overlap: MoM, C/T < CoS

Empirical Validation • How do we verify empirically that the distributional properties are as predicted based on the verb class representation? • The properties we have hypothesized are abstract, how do we count them in a sufficiently large corpus? • - by hand, sampling • - automatically by approximation with indicators

Data Collection – Materials • Verbs • Manner of motion (20) -- jump, march • Change of state (19) -- open, explode • Creation/Transformation/Performance (C/T)(20) -- played, painted • Verb Form``-ed'' form assumed to be representative • Corpora • 65 million words tagged Brown + tagged WSJ corpus (LDC) • 29 million words parsed WSJ (LDC corpus, Collins 97 parser)

Data Collection -- Method • TRANS Verb token immediately followed by potential object counted as transitiveelse intransitive. • Potential object = Closest nominal group after verb token . • (or also count passive or past participle frequency) • CAUS Calculate overlap of multiset of subjects and multiset of objects • Take ratio between cardinality of the overlapmultiset, • and the sum of the cardinality of the subject and objectmultisets. • ANIM Ratio of occurrences of pronoun subjects to all subjects

Statistical Analysis of the Data • Mean relative frequencies • All statistically significant at p< .01

Conclusions • Answer to Q1 • different lexical semantic classes have different frequency • distributions of properties systematically related to • the verb’s thematic assignments

Generalising • How well do these distributional properties generalise • - across verbs • - across classes • - across languages ?

The Classification Problem • The Given Statistics reflecting thematic information • about a given set of verb classes • The Goal Automatically classify unseen verbs • Experimental Setup • Materials • Vector template: [ verb,TRANS,…, CAUS,ANIM,class] • Example: [ open, .69, …, .16, .36, CoS ] • Method • Learner: C5.0 (decision tree induction algorithm) • Training/Testing: 10-fold cross-validation repeated 50 times

Results • Overall results: accuracy 69.8% (baseline 33.9, expert upper bound 86.5%) • 54% reduction in error rate on previously unseen verbs • (recent extension range from 62% to 82% accuracy) • Effectiveness of frequency distributions • All distributions are useful in classification • Class by class accuracy • MoM verbs are most accurately classified • Analysis of Errors TRANS sharpens 3 way distinction • ANIM particularly helpful in discriminating CoS • Relation between frequencies and thematic assignments is confirmed

Generalising to a new class • New Class Psychological State Verbs • New thematic roles Experiencer Stimulus • Example The rich love money • Experiencer Stimulus • The rich love too • Experiencer • Properties: TRANS, CAUS, ANIM • PROG use of the progressive (stative/non stative) • Results 74.6% accuracy (baseline 57%) • TRANS, CAUS, ANIM best features

Discussion • Relationship between frequencies and thematic properties holds • across classes • Some specific frequency distributions carry across thematic roles • Discovery We do not need to investigate new frequency • distributions for each new class • Conjecture: Thematic roles are decomposed in more primitive features

Multi-lingual Generalisations • Accurate investigation of relation between grammar and frequency • requires • - a well-founded theory of lexical representation • - a distributional analysis of language • Multi-linguality provides • - abstract, general level of linguistic description • - more data • Greater coverage and accuracy are possible by looking at several • languages

Multi-lingual Generalisations • Extension of mono-lingual method to a new language (Italian) • - Shows similarities in the relations between frequency • distributions and thematic relations across languages • - Extends coverage to new languages • Extension to the use of multi-lingual data to classify verbs in a • given language (Chinese and English data to classify English • verbs) • - Shows that surface differences across languages are • related to a similar underlying representation • - Improves accuracy in the classification of a given language

Exploiting similarities: Extension to Italian(Merlo, Stevenson, Tsang, and Allaria 2002, Allaria 2001) • Verbs: 20 CoS, 20 C/T, 19 Psych • Properties: TRANS, CAUS, ANIM, PROG • Corpus: PAROLE 22 million words (CNR, Pisa) • Counts: relative frequencies, hand counts (exact)

Results • 79% reduction in error rate on unseen verbs • TRANS ANIM best • Relationship between frequencies and • thematic properties holds across languages

Leveraging Cross-language Differences(Tsang, Stevenson and Merlo, 2002) • What is abstract/underlying in one language might be explicit in another • Revealing an underlying common/similar classification • e.g. - Causative forms in Chinese are morphologically marked • Data from several languages classify one language • Training Chinese English • Testing English

Materials and Method English verb classes: MoM, CoS, C/T, 20 verbs in each class English properties: TRANS,PASS,VBN,CAUS,ANIM Chinese translations of the verbs (several) Counts of new frequencies adapted to Chinese: Relative frequencies of - POS tags (indication of subcategorization and stative/active) - passive particle - periphrastic causative particle

Materials and Method English data from BNC (tagged and chunked), Chinese data from Mandarin News (165 million characters) Learning: C5.0 (decision tree induction) Training/Testing: 10-fold cross-validation 50 repeats

Results • Best result in classification of English verbs: • combination of Chinese and English frequencies • ANIM, TRANS, CKIP 83.5% accuracy • (English frequencies 67.6%) • same or at least similar underlying abstract classification • otherwise different views would make the classifier diverge • advantage of working at different levels of description

Conclusions • Distributional properties are correlated to thematic properties • for several verb classes, several thematic roles, several • languages • Relevant for • Notion of verb class: point in a multi-dimensional space? • Representation and inventory of thematic roles • Language acquisition studies: what are the properties necessary to learning verb meaning (Gillette et al 99)

Thank you to our students Gianluca Allaria (Geneva) Eva Esteve Ferrer (Geneva) Eric Joanis (Toronto) Vivian Tsang (Toronto)

THANK YOU

STRUCTURE AND FREQUENCY OF LEXICAL SEMANTIC CLASSES