Finding Predominant Word Senses in Untagged Text

Finding Predominant Word Senses in Untagged Text Diana McCarthy & Rob Koeling & Julie Weeds & Carroll Department of Indormatics, University of Sussex {dianam, robk, juliewe, johnca}@sussex.ac.uk ACL 2004

Introduction • In word sense disambiguation, the heuristic of choosing the most common sense is extremely powerful. • Does not take surrounding context into account • Assumes some quality of the hand-tagged data • One would expect the frequency distribution of the senses to depend on the domain of the text. • We present work on the use of automatically acquired thesaurus and WordNet similarity package to find the predominant sense. • Does not require any hand-tagged text, such as SemCor

In SENSEVAL-2 even systems which show superior performance to the(above) heuristic often make use of the heuristic where evidence from the context is not sufficient. • There is a strong case for obtaining a predominant sense from untagged corpus data so that a WSD system can be tuned to the domain.

SemCor comprises a relatively small sample of 250,000 words. • tiger -> audacious person / carnivorous animal • Our work is aimed at discovering the predominant senses from raw text. • Hand-tagged data is not always available • Can produce predominant senses for the domain type required. • We believe that automatic means of finding a predominant sense can be useful for systems that use it as backing-off and as lexical acquisition under limiting-size hand-tagges sources.

Many researchers are developing thesaurus from automatically parsed data. • Each target word is entered with an ordered list of “nearest neighbors” which are ordered in terms of the “distributional similarity” with the target word. • Distributional similarity is a measure indicating the degree of co-occurrence in contexts between two words. • The quality and similarity of the neighbors pertaining to different senses will reflect the dominance of the sense. • The neighbors of star in a corpus provided by Lin has the ordered neighbors: superstar, player, termmate, …, galaxy, sun, world,…

Method • We use a thesaurus based on the method of Lin(1998) which provides k nearest neighbors to each target word along with distributional similarity scores. Then use the WordNet similarity package to weight the contribution that each neighbor makes to the various senses of the target word. • We rank each sense wsiusing: • Nw = {n1, n2, …, nk} be the top scoring k neighbors along with DS scores {dss(w, n1), dss(w, n2), …, dss(w, nk)}

Acquiring the Automatic Thesaurus • The thesaurus was acquired using the method described by Lin(1998). • For input we use grammatical relation data extracted using an automatic parser. • A noun w is described using a set of co-occurrence triples <w, r, x> and associated frequencies, where r is a grammatical relation and x is a possible co-occurrence with w in the relation. • For every pair of nouns where each noun has total frequency in the tuple>9, compute their distributional similarity. • IfT(w) is the set of co-occurrence types (r, x) such that I(w, r, x) is positive then distributional similarity of two noun w and n is dss(w, n):

Automatic Retrieval and Clustering of Similar WordsDekang LinProceeding of COLING-ACL 98 • The meaning of an unknown word can often be inferred from its context. • We use a broad-coverage parser to extract dependency triples from text corpus. A dependency triple consists two words and the grammatical relationship. • The triples extracted from “I have a brown dog” are: • The description of a word w consists of the frequency counts of all dependency triples that match the pattern (w, *, *). • For example the description of the word cell is:

Measure the amount of information in the statement that a randomly selected triple is (w, r, w’), when we do not know the value of ||w, r, w’||. • An occurrence of the triple (w, r, w’) can be regarded as the co-occurrence of three events. • A: a randomly selected word is w. • B: a randomly selected dependency type is r. • C: a randomly selected word is w’. • Assume that A and C are conditionally independent given B, thus the probability is given by:

Measure the amount of information when we know the value of ||w, r, w’||, and the difference is the information contained in ||w, r, w’|| =c. • Let T(w) be the set of pairs (r, w’) such that I(w, r, w’) is positive, define the similarity (w1, w2) between words w1 and w2 as follows:

The WordNet Similarity • The WordNet similarity package supports a range of similarity scores. • lesk: maximizes the number of overlapping words in the gloss, or definition, of the senses. • jcn: each synset is incremented with the frequency counts from the corpus of all words belonging to the synset. • Calculate the “information content”IC(s) = -log(p(s)) • Djcn(s1, s2) = IC(s1) + IC(s2) – 2* IC(s3), where s3 is the most informative superordinate synset of s1 and s2. • jcn(s1, s2) = 1/Djcn(s1, s2)

Experiment with SemCor • We generated a thesaurus entry for all polysemous nouns which occurred in SemCor>2 and BNC>9 times in the grammatical relations. jcn use the BNC corpus, and the thesaurus entry k set to 50. • The accuracy of finding the predominant sense in SemCor and the WSD accuracy on SemCor when using our first sense in all contexts are as follows:

We choose jcn on remaining experiments because this gave good results for finding the predominant sense and is more efficient. • There are cases where the acquired first sense disagree with SemCor yet is intuitively plausible. • pipe -> tobacco pipe / tube made of metal or plastic used to carry water, oil or gas etc… with nearest neighborstube, cable, wire, tank, hole, cylinder, … • soil-> filth, stain, the state of being unclean / dirt, ground, earth, this seems intuitive given our expected usage in modern British English.

Experiment in SENSEVAL-2 English all Words Data • To see how well the predominant sense perform on a WSD task we use the SENSEVAL-2 all-words data. • We do not assume that ours is not method of WSD, however, it is important to know the performance for any system that use it. • Generate a thesaurus entry for all polysemous nouns in WordNet and compare the results using the first sense in SemCor and the SENSEVAL-2 all-words data itself. • Trivially label all monosemous items.

The automatically acquired predominant sense performs nearly as well as the hand-tagged SemCor. • Use only raw text with no manual labeling. • The items not covered by our method were those with insufficient grammatical relations for the tuples employed. • today and one each occurred 5 times in the test data. • Extending the grammatical relations used for building the thesaurus should improve the coverage.

Experiment with Domain Specific Corpora • A major motivation is to try to capture changes in ranking of senses for documents from different domains. • We selected two domains: SPORTS(35317 docs) and FINANCE(117734 docs) from the Reuters corpus and acquire thesaurus from these corpora. • We selected a number of words and evaluate these words qualitatively. The words are not chosen randomly since we anticipated different predominant senses for these words. • Additionally we evaluated quantitatively using the Subject Field Codes source which annotates WordNet synsets with domain labels. We selected words that have at least one SPORTS and one ECONOMY labels, resulting 38 words.

The results are summarized below with the WordNet sense number for each word. • Most words show the change in predominant senses. • The first sense of the words like division, tie and goal shift towards the more specific senses. • The senses of the word share remains the same, however the sense stock certificate ended up in higher rank forFINANCE domain.

The figure shows the domain labels distribution of the predominant senses using SPORTS and FINANCE corpora for the set of 38 words. • Both domains have a similarity percentage of factotum(domain independent) labels. • As we expect, the other peaks correspond to the economy label for the FINANCE corpus and sports label for the SPORTS corpus.

Conclusions • We have devised a method that use raw corpus data to automatically find a predominant sense of nouns in WordNet. • We have demonstrated the possibility of finding predominant sense in domain specific corpora. • In the future we will investigate the effect of the frequency and choice of distributional similarity measure and apply our method for words whose PoS other than noun. • It would be possible to use this method with another inventory given a measure of semantic relatedness between the neighbors and the senses.

Finding Predominant Word Senses in Untagged Text

Finding Predominant Word Senses in Untagged Text

Presentation Transcript

Text, not Word Processing

Adam Kilgarriff doesn’t believe in word senses….

Word-finding Intervention

finding Pleasure and Meaning in the text

Text Editing in MS Word 2010

Predominant

Finding Full-Text Articles

Finding Full Text Articles

Finding Text Trends

Clustering Word Senses

Text, not Word Processing

Text, not Word Processing

Finding Full-Text Articles

Finding Full-Text Articles

Finding Word Groups …

Text, not Word Processing

Text, not Word Processing

Finding Associations in Collections of Text

Text, not Word Processing

Finding Full-Text Articles

Text, not Word Processing