290 likes | 459 Vues
LIN 3098 Corpus Linguistics Lecture 7. Albert Gatt. In this lecture. We look at some ways in which corpora can be useful in morphological research. Main focus: morphological productivity. Part 1. Morphology, corpora and productivity. Productivity in linguistics.
E N D
LIN 3098 Corpus LinguisticsLecture 7 Albert Gatt
In this lecture • We look at some ways in which corpora can be useful in morphological research. • Main focus: morphological productivity
Part 1 Morphology, corpora and productivity
Productivity in linguistics • The term “productivity” is used in a wide variety of contexts. • Syntactic rules are “productive” in the sense that they can be used to generate new phrases. • The same can be said of some morphological rules.
A definition of productivity • A linguistic process is productive if: • It can be used to produce novel forms. • If a rule is productive, then: • Novel forms (previously unheard) can be understood and produced; • There is no need to store all forms in the mental lexicon.
A couple of examples • Imagine an English adjective garmy. How would you derive a noun out of this adjective? • Many speakers might say garminess • This suggests that –ness suffixation is a productive derivational process. • E.g. Imagine a Maltese verb intoffa. How would you produce a noun from it? • Speakers might say intoffar or inttofamentor intoffazzjoni • This suggests that –ar and –mentsuffixation are productive derivational processes in Maltese.
Productive vs non-productive • Some morphological processes or categories seem to have greater potential to form new words than others • e.g. English -able, -ness • compare to English –th: warmth, strength… (much less productive)
Classical approaches to productivity • Jackendoff (1975): • morphological rules are called redundancy rules: • They capture the relationship between related forms • E.g. Warm warmth (ADJ N via addition of –th) • E.g. Desire desirable (N ADJ via addition of –able) • If a rule is productive, then it can be used to create novel forms. • e.g. adjectives with –ablecan beproduced “online”
Features of classical approaches • Relies on a binary distinction (un/productive) • Productive rules are typically regular & sub-regularities not considered much (Dressler 2003) • Most of these approaches do not look at corpus data
Productive vs regular • Usually, productive morphological rules are regular. Irregular forms are likely to be stored in the lexicon. • However, we can sometimes detect “sub-regularities”: • sing-sang • ring-rang • bring-brang (?) • Speakers can sometimes generalise these sub-regular processes, perhaps by analogy. • What’s the past tense of tring or spling?
“Possible” vs “attested” • Our tentative definition of productivity focuses on production of novel forms. • By definition, novel forms are: • Possible words of the language; • Previously unattested. • This would suggest that we can’t use corpora to study productivity. • Corpora only contain attested forms.
The problem of frequency • Suppose we find that a corpus contains lots of words ending in some suffix –X. • This doesn’t necessarily imply that the -X suffix is productive. • It could have been productive in the past, but is not anymore. • Therefore, the likelihood of a new word ending in –X is low, despite the high frequency.
Getting around the problem • Frequency can’t give us all the answers. However, one interesting solution is to look at hapaxlegomena. • A corpus will usually contain lots of words occurring only once. • We can think of hapaxes as “one-offs”. • It seems likely that some hapaxes will be “new formations” • NB We can only make this assumption if the corpus is very large.
Corpus-based approaches • View productivity as a gradable phenomenon: • some forms become ingrained through frequent usage • category can still be productive to some extent • productivity estimated in terms of a category’s potential to produce new forms • can account for sub-regularities: productivity of a category is due to a lot of factors, including analogy to existing words
The continuum ADJ+ness Noun ADJ+th Noun lexicalised word Productive morphological process • Productive processes tend to: • be compositional • result in a lot of new words
Why is productivity interesting? • No finite lexicon can contain all words of a language at a certain time • productive processes can be exploited to parse new/unseen lexical items • this is helped by the compositionality of productive processes • can also help to distinguish creative neologism from systematic rule-application. compare: • well-defined, well-intentioned, well-specified • lots of adjectives with a well- prefix • YouTube • a one-off
Theoretical implications • raises interesting questions about the relationship between corpus-based measures and psycholinguistic data • likelihood of a morphological process being applied depends on style, genre, speech community… • can give an indication of language change over time (some processes are fossilised, others become more productive)
What we need • A measure of productivity of a process/category C should reflect: • our intuitions about how frequently we encounter C • how easily native speakers can form new words using C • Is it easier to produce a noun with –th (like warmth) or one with –ness (like goodness)?
An analogy • We can compare morphological processes to companies. • All try to dominate a market where the number of clients (words) is limited. • Productivity reflects the extent to which these companies: • have managed to dominate in the past (how many words they’ve formed) • are expanding into new areas of the market (how many new words they’re forming) • may expand in the future (how many as yet unseen words they’re going to form)
Realised productivity (RP) • Given a morphological category C, RP gives a rough indication of the past utility of C in forming new words. • Measured as the number of distinct types in C in a corpus of size N • E.g. regular past tense –ed displays many more types than sub-regular forms such as keep-kept/sleep-slept
Realised productivity cont/d • Why types, not tokens? • Productive processes have lots of types which are hapaxes, or are very infrequent (low token frequency). • Words formed from irregular processes tend to be very frequent (have high token frequency). • Some limitations: • a high RP for a category does not imply that it will keep forming lots of new words • RP is heavily dependent on corpus size
Expanding productivity (P*) • P* gives a rough indication of the rate of expansion of C. • Focuses on the number of hapaxes produced using C in the corpus. • aka hapax-conditioned productivity • NB: P* is still heavily dependent on corpus size!
Potential productivity (P) • Gives an indication of how likely a category C is to form new words in future. • I.e. the potential for C to be already saturated • aka category-conditioned productivity
Some more on P • Unlike RP and P*, P is not very sensitive to corpus size as such • However, very sensitive to frequency of the category. • e.g. if C is realised only once in a corpus of size N, then P = 1! • Recent empirical work has shown that RP and P* may correlate very strongly, but both exhibit a weak correlation with P (Vegnaduzzo 2009) • pattern non-X has high RP and P*, but low P • pattern X-ish has low RP and P*, but high P
P vs. RP and P* • A category C can have low RP and P*, but high P. • In this case, C hasn’t been used much in the past, but is being used quite productively at the moment. • Corresponds to the “ease” with which new words can be formed using the category. • If category has high RP, it may still be saturated, so have low P.
The psycholinguistic connection • Rule vs. direct access: • To produce a word (e.g. illegal), you can either store it directly, or apply the rule on the fly. • Evidence suggests that frequency of baseform vs. derivation is related to which of the two alternatives apply.
The psycholinguistic connection • Complexity-based affix ordering: • Corpus research: more productive affixes follow less productive ones in word formation • It seems that more highly predictable (low productivity) affixes are processed first. • High productivity may also imply less likelihood of entering into further derivational processes.
Works cited • S. Vegnaduzzo (2009). Morphological productivity rankings of complex adjectives. Proc. NAACL-HLT Workshop on Computational Approaches to Linguistic Creativity. • K. Molinen and S. Pulman (2008). The good, the bad and the unknown: Morphosyllabic sentiment tagging of unseen words. Proc. ACL 2008 • Baayen 2006 linked from web page