LIN 3098 – Corpus Linguistics Lecture 5

LIN 3098 – Corpus LinguisticsLecture 5 Albert Gatt

In this lecture… • Corpora and the Lexicon • uses of corpora in lexicography • Counting words • lemmatisation and other issues • types versus tokens • word frequency distributions in corpora

Part 1 Corpora and lexicography

Why corpora are useful • Lexicographic work has long relied on contextual cues to identify meanings. • e.g. Samuel Johnson used examples from literature to exemplify uses of a word. • Corpora make this procedure much easier • not only to provide examples but: • to actually identify meanings of a word given its context • definitions of word meanings should therefore be more precise, if based on large amounts of data

Specific applications • Grammatical alternations of words • E.g. Verb diathesis alternations: • Atkins and Levin (1995) found that verbs such as quiver and quake have both intransitive and transitive uses. (see Lecture 1) • E.g. uses of prepositions such as on, with… • Regional variations in word use • relying on corpora which include gender/region/dialect/date information

Specific applications - II • Identification of occurrences of a specific homograph, e.g. house (Verb) • examination of the contexts in which it occurs • relies on POS tagging • Keeping track of changes in a language through a monitor corpus • Identifying how common a word is, through frequency counts. • many dictionaries include such information now • this shall be our starting point

Part 2 Counting words in corpora: types versus tokens

Running example • Throughout this lecture, reference is made to data from a corpus of Maltese texts: • ca. 51,000 words • all from Maltese-language newspapers • various topics and article types

How to count words: types versus tokens • token = any word in the corpus • (also counting words that occur more than once) • type = all the individual, different words in the corpus • (grouping occurrences of a word together as representatives of a single type) • Example: • I spoke to the chap who spoke to the child • 10 tokens • 7 types (I, spoke, to, the, chap, who, child)

More on types and tokens • The number of tokens in the corpus is an estimate of overall corpus size • Maltese corpus: 51,000 tokens • The number of types is an estimate of vocabulary size • gives an idea of the lexical richness of the corpus • Maltese corpus: 8193 types

Type/token ratio • A (rough!) way of measuring the amount of variation in the vocabulary in the corpus. • Roughly, can be interpreted as the “rate at which new types are introduced, as a function of number of tokens”

Difficult decisions - I • Do we distinguish upper- and lower-case words? • is New in New York the same as new in new car? • but what of New in New cars are expensive? (sentence-initial caps) • in practise, it’s not straightforward to distinguish the two accurately, but can be done

Difficult decisions - II • What about morphological variants? • man – men  one type or two? • go – went  one type or two? • If we map all morphological (inflectional) variants to a single type, our counts will be cleaner (lemmatisation). • depends on availability of automated methods to do this • Maltese also presents problems with variants of the definite article (ir-, is-, ix- etc) • ir-raġel (DEF-man): one token or two?

Difficult decisions - III • Do numbers count? • e.g. is 1,500 a word? • may artificially inflate frequency counts • one approach is to treat all numbers as tokens of a single type “NUMBER” or “###” • Punctuation • can compromise frequency counts • computer will treat “woman!” as different from “woman” • needs to be stripped • problematic for languages that rely on non-alphabetic symbols: Maltese ‘l (“to”)vs l- (“the”)

Part 2 Representing word frequencies

Raw frequency lists (data from Maltese) • A simple list, pairing each word with its frequency

Frequency ranks • Word counts can get very big. • most frequent word in the Maltese corpus occurs 2195 times (and the corpus is small) • Raw frequency lists can be hard to process. • Useful to represent words in terms of rank: • count the words • sort by frequency (most frequent first) • assign a rank to the words: • rank 1 = most frequent • rank 2 = next most frequent • …

Rank-frequency list example (data from Maltese) Rank of type, according to frequency Number of times the type occurs

Frequency spectrum (data from Maltese) • A representation that shows, for each frequency value, the number of different types that occur with that frequency.

Normalised frequency counts • A raw frequency for a word isn’t necessarily informative. • E.g. difficult to compare the frequency of the word in corpora of different sizes. • We often take a “normalised” count. • typical to divide the frequency by some constant, such as 10,000 or 1,000,000 • this gives “frequency of word per million” rather than a raw count.

Type/token ratio revisited • (no. of types)/(no. of tokens) • Another way of estimating “vocabulary richness” of a corpus, instead of just looking at vocabulary size. • E.g. if a corpus consists of 1000 words, and there are 400 types, then the TTR is 40%

Type/token ratio • Ratio varies enormously depending on corpus size! • If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%. • With 4 million words, it’s more likely to be in the region of 2%. • Reasons: • vocab size grows with corpus size but • large corpora will contain a lot of tokens that occur many times

Standardised type/token ratio • One way to account for TTR variations due to corpus size is to compute an average TTR for chunks of a constant size. Example: • compute the TTR for every 1000 words of running text • then, take an average over all the 1000-word chunks • This is the approach used, for example, in WordSmith.

Part 3 Frequency distributions, or “few giants, many midgets”

Non-linguistic case study • Suppose we are interested in measuring people’s height. • population = adult, male/female, European • sample: N people from the relevant population • measure height of each person in the sample • Results: • person 1: 1.6 m • person 2: 1.5 m • …

Measures of central tendency • Given the height of individuals in our sample, we can calculate some summary statistics: • mean (“average”): sum of all heights in sample, divided by N • mode: most frequent value • Median: the middle value • What are your expectations?

The data (example) • Mean: 158.8cm • This is the expected value in the long run. • If our sample is good, we would expect that most people would have a height at or around the mean. • Mode: 160cm • Median: 160

Plotting height/frequency • Observations: • Extreme values are less frequent. • 2. Most people fall on the mean • 3. Mode is approximately same as mean • 4. Bell-shaped curve (“normal” distribution)

Plotting height/frequency • This shape characterises the Normal Distribution. • A “bell curve” • Quite typical for a lot of data sampled from humans (but not all data)

What about language? • Typical observations about word frequencies in corpora: • there are a few words with extremely high frequency • there are many more words with extremely low frequency • the mean is not a good indicator: most words will have an actual value that is very far above or below the mean

A closer look at the Maltese data • Out of 51,000 tokens: • 8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5) • ca. 15% of our corpus size is made up of only 5 different words! • Out of 8193 types: • 4382 are hapax legomena, occurring only once (bottom ranks) • 1253 occur only twice • … • In this data, the mean won’t tell us very much. • it hides huge variations!

Ranks and frequencies (Maltese) • 2195 • 2080 • 1277 … • 1 • 1 … Among top ranks, frequency drops very dramatically Among bottom ranks, frequency drops very gradually

General observations • In corpora: • there are always a few very high-frequency words, and many low-frequency words • among the top ranks, frequency differences are big • among bottom ranks, frequency differences are very small

So what are the high-frequency words? • Top 5 ranked words in the Maltese data: • li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of”), tal- (“of the”) • Bottom ranked words: • żona (“zone”) f = 1 • yankee f = 1 • żwieten (“Zejtun residents”) f = 1 • xortih (“luck.POSS-3SGM”) f = 1 • widnejhom (“ear.POSS-3PL”) f = 1

Zipf’s law • George K. Zipf (1902 – 1950) established a mathematical model for describing frequency data: Frequency decreases with rank. More precisely, frequency is inversely proportional to rank. • We can plot this in a chart: • Y-axis = frequency • X-axis = rank • each dot on the chart represents the lexical item (type) at a given rank

How Zipf’s law pans out (Maltese data) A few high frequency, low-rank words Hundreds of low-frequency, high-rank words

Zipf’s law cross-linguistically • Empirical work has shown that the Zipfian distribution is observable: • independent of the language • irrespective of corpus size (for reasonably large corpora) • The bigger your corpus: • the bigger your vocabulary size (no. types) • the more words of frequency 1 (hapax legomena) • Why?

Some reasons • If words were completely random, every word would be equally likely. • Our plot would be completely flat: all words at all ranks have same frequency. • Language is absolutely non-random: • occurrence of words governed by: • syntax • author/speaker intentions • ... • Some words are the basic “skeleton” for our sentences. They are the most frequent.

Implications • Traditional measures of central tendency (mean etc) not very useful. • No two corpora can be directly compared if they are of different size: • vocab size increases with corpus size • most of the vocab made up of hapax legomena • most of the corpus size (no. tokens) made up of a few, very frequent types, typically function words.

Summary • We’ve introduced some of the uses of corpora for lexicography. • Focused today on word frequencies, especially Zipf’s law • looked at some of the implications • Next up: • collocations and why they’re useful

References • Baroni, M. (2007). Distributions in text. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook. Berlin: Mouton de Gruyter.

LIN 3098 – Corpus Linguistics Lecture 5