390 likes | 489 Vues
Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus. Michael Oakes University of Sunderland, England. Contents. Background and the ICAME disk Two traditional measures: chi-squared and G-squared (Log-likelihood) Information Retrieval.
 
                
                E N D
Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus. Michael Oakes University of Sunderland, England.
Contents • Background and the ICAME disk • Two traditional measures: chi-squared and G-squared (Log-likelihood) • Information Retrieval
Looking for discriminating vocabulary • Two classic papers: Kilgarriff (1996), Which words are particularly characteristic of a text? A survey of statistical approaches. • Yang and Pedersen (1997), A comparative study on feature selection in text categorization. • Identify discriminants, linguistic features more typical of one form of English than another. • Automatic categorisation of text types akin to automatic topic, genre and author identification (Souter, 1994). • Vocabulary differences reveal cultural differences (Leech and Fallon,1992).
Leech and Fallon (1992) compared the vocabulary in Brown and LOB • Linguistic contrasts: • Spelling differences: color / colour • Lexical choice: gasoline / petrol • Proper nouns (Chicago more common in Brown) • Non-linguistic contrasts: indicators of socio-cultural differences between the two countries.
Number of sections of approx. 2000 words in 5 comparable corpora (1)
Number of sections of approx. 2000 words in 5 comparable corpora (1)
The chi-squared (X²) test • See Rayson, Leech & Hodges (1997). • Case study: Is the word “lovely” used more often in speech by men or women? • Experiment: In the BNC conversational corpus, men say “lovely” 414 times while women say “lovely” 1214 times. • Statistics: Is this due to chance, or does the use of this word genuinely vary with the gender of the speaker? Use the chi-square test. • Contingency table of observed values O: see next slide
The chi-squared test (2) • Expected frequencies E • E = row total x column total / grand total • e.g. E (lovely, men) = 1628 x 1714443 / 4307895 • See previous table • X² = Σ (O – E)² / E • Find (O – E)² / E for every box in the table, • e.g. (O – E)² / E for (lovely, men) = • (414 – 647.9)² / 647.9 = 84.4. • X² = sum (Σ) for all four boxes • = 84.0 + 55.8 + 0.0 + 0.0 = 140.2
G² vs. Chi-squared • The chi-squared test is an approximation to the G² test, easier to calculate in the days before PCs and pocket calculators (Wikipedia) • Both can be used to compare corpora of different sizes • The only restriction is that the expected values must be >= 5 (Moore 2004, Rayson et al., 2004)
Bonferroni Correction • Controls the False Discovery Rate • For a single test, X² or G² > 10.83 is significant at the .1 % level. • In comparing the vocabulary across the five corpora, we effectively perform 101,984 tests because there are 101,984 unique word types across the 5 corpora. • To find the appropriate critical value we divided 0.001 by 101,984 to give an adjusted significance level of 9.805 x 10 E-9. • We then identify words with chi-squared contributions > 32.9 • Not more than 0.1% of the words selected in this way will have been incorrectly identified, since the Bonferroni correction is conservative. • We are more interested in ranking than absolute values.
Dispersion • Dispersion measures show how evenly or otherwise a word is distributed throughout a corpus (Lyne 1985, 1986). • In this study, we should only consider words which are relatively evenly spread throughout the corpus. • E.g. thalidomide, ranked 15th most typical of UK, occurs all 55 times in a single medical article.
Juilland’s D (1) • Divide the corpus into n contiguous subsections (we used 5). • Commonwealth was found 31, 8, 32, 88, 5 times respectively in the Australian corpus. • The standard deviation of the number of times the word is found in each subsection = 29.79, and the mean frequency is 32.8.
Juilland’s D (2) • To account for the fact that the standard deviation tends to be higher for more frequent words, it is divided by the mean frequency to give the coefficient of variation V = 29.79 / 32.8 = 0.908 • The coefficient of dispersion falls in the range 0 to 1. • D = 1 - V / sqrt (n-1) = 0.546 for commonwealth • Empirical finding: keep if D >= 0.3, range >= 3.
The Australian list • 18 of top 19 people and places • Exception is Commonwealth (of Australia) • Politics: Premier, Senator, Hawke, Whitlam, ALP, Labor, BHP • Employment rights: unions, unemployed, superannuation
The British list • People and places • Institutions: NHS, BBC • Politics: Tory, Labour • EC (European Community) • Historical epochs: century, eighteenth • Aristocratic titles: Duke, Lord(s), Prince, Royal
The Indian List • People and places • Currency: Rs (rupees) • Numbers: mn (million), crores (ten million), lakhs (ten thousand). • Function words: the, of, in, upto (single word) • Religion: Buddha (86.0), Buddhism (45.4), divine (150.6), Gita (119.3), God (37.8), Gods (78.6), Goddess (44.4), Hindu (299.5), Hindus (148.1), Karma (61.4), Muslim (151.8), Muslims (42.2), mystic (53.1), Mystics (100.7), pandit (104.4), Saints (35.6), Sikh (80.0), Swami (131.2), temple (248.8), temples (104.2), Vedas (101.4), Vedic (102.9), yoga (97.7).
The New Zealand list • Place names • Pakeha (person of European descent) • The natural world: bay, forest, harbour, island(s), landscape. • Rugby
The U.S. list • Few people and places • Spelling variants: toward, percent, programs, defense, program, color, behavior, labor, fiber, gray, theater, favorite, favor, colors, organization • Inclusiveness: black, gender, white
Measures from Information Retrieval • Main difference with corpus linguistics is that we are interested in the information itself rather than its linguistic style. • Raw frequency with stoplisting • TF.IDF • Deviation from Randomness • Kullback-Liebler Divergence
Raw Frequency • Most frequent words in the New Zealand corpus: • the (67355), of (32182), and (28678), to (26552), a (23558), in (20519), is (10284), was (10081), it (9814), that (9743), for (9341), I (7844), on (7629), ‘s (7585), with (7185), as (7027), he (6716), be (6297), at (5530), by (5207)…
The Glasgow Stoplist • a, about, above, across, adj, after, again, against, all, almost, alone, along, also, although, always, am, among, an, and, another, any, anybody, anyone, anything, anywhere, apart, are, around, as, aside, at, away, be … yourself.
Raw Frequency with Stoplisting • ‘s (7875), he (6716), you (3838), New (3319), we (3292), one (3267), my (2078), Zealand (1985), time (1920), like (1607), me (1602), two (1589), people (1583), first (1393), now (1285), back (1208), years (1145), way (1079), work (1041), and made (1019) • only New and Zealand appeared typical of the corpus of New Zealand English. • This shows the need for more sophisticated measures.
TF.IDF • Takes into account both the frequency of a word in a corpus (TF, term frequency) and the inverse of the number of corpora the word appears in (IDF, inverse document frequency). • The highest scores are given to words which are common in the corpus we are looking at, but do not occur in many other corpora.
20 Words in the NZ Corpus with Highest TF.IDF • Maori (1504.8), pakeha (339.5) , Aukland (304.4), Otago (180.2), Dunedin (136.8), Waikato (135.1), Christchurch (127.7), Wellington (112.0), Waitangi (107.8), Aotearoa (91.7), Hutt (91.7), Ngati (83.6), Rotorua (75.6), Maoris (74.2), moa (72.4), Te (68.7), NZPA (67.5), marae (65.9), ANZUS (62.7), TVNZ (62.7), Waitaki (59.5) and Invercargill (57.9) • suggests that TF.IDF is a good measure for finding words typical of a corpus.
Deviation from Randomness • One component is Bose-Einstein probability • If λ is the mean frequency of term t across all the corpora, the Bose-Einstein probability is the probability that a term occurs exactly f times in one of the corpora • Words which occur much more often in one corpus than they do on average across the corpora are typical of that corpus, and have low Bose-Einstein probability.
Inf1 is the negative of log base 2 of the Bose-Einstein probability, so words typical of a corpus will have high Inf1:
The 20 words with highest Inf1 for the corpus of NZ English were: • Maori (28.66), Auckland (28.52), Pakeha (28.47), Otago (28.46), Wellington (28.16), Dunedin (28.12), Waikato (28.11), Christchurch (28.10), Waitangi (28.11), Maoris (27.85), Aoteoroa (27.84), Hutt (27.74), Ngati (27.76), Zealand (27.76), Rotorua (27.67), moa (27.62), NZPA (27.55), Zealanders (27.53). marae (27.52), Te (27.52). • On its own, Inf1 appears to be a good indicator of which words are typical of a corpus.
Kullback-Liebler Divergence and Relevance Feedback (“more like this”)
KLD(t) • pR(t) is the number of times that word is found in relevant documents, divided by the total number of words in relevant documents • pC(t) is the number of words is found in the entire document collection, divided by the total number of words in the entire document collection • μ is a tuning parameter, which worked best when set to 0.5 • Instead of relevant documents we discuss the corpus of interest, and instead of non-relevant documents we have the other comparison corpora.
The 20 highest scoring words for NZ English were: • Zealand (1141), Maori (567), Auckland (359), Wellington (297), Te (175), Christchurch (148), Pakeha (128), Canterbury (89), Zealanders (82), Otago (67), Pacific (57), Rugby (52), Dunedin (51), Waikato (50), Maoris (48), NZ (44), Bay (44), Waitangi (40), Aoteoroa (34), Hutt (34). Values in millionths. • All these words appear typical of NZ English • KLD(t) is a value for a single word. We can add together the KLD(t) values for every word, to derive a single value KLD(Dr, Dc) showing the divergence between relevant documents and non-relevant documents. It thus gives a measure of corpus similarity.
Information Gain • Whereas the other measures tells us something about the strength of the association between a word and a corpus, IG is a single value for the power of a word to discriminate between corpora. • As an exercise in judging the usefulness of this measure, look at the 20 words in all five corpora with highest IG, and try to guess the corpora they are most typical of: • Zealand (332), Maori (213), India (153), Auckland (130), Australian (104), Wellington (98), Rs (Rupees) (75), Gandhi (73), Pounds (68), Clinton (67), Janata (65), Australia (64), Delhi (54), Singh (54), Queensland (50), Bombay (50), Aboriginal (50), Chistchurch (49), pakeha (48), NSW (40). These IG values are in millionths.
Conclusions (1) • In corpus linguistics, interest is mainly in the language used in corpora, while in information retrieval we are mainly interested in the information conveyed by a document • In IR, function words on a “stoplist” are routinely discarded, since these are not related to the topic of a document, but in CL, such words tell us a great deal about the grammatical structures used in a corpus. • The question of “which words are characteristic of a text” is common to both IR and CL. A number of statistical measures are thus relevant to both fields of study.
Conclusions (2) • Our initial results suggest that the IR measures of TF.IDF, Bose-Einstein probability and Kullbeck-Liebler Divergence when μ = 0.5 are all good measures for finding the words most typical of New Zealand English. • A variant of KLD measures the divergence between two corpora • Information Gain provides a single score for a word, reflecting its ability to discriminate between corpora.