Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton>

Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton.edu>

Annotating the Glosses • Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging) • A disambiguation task: Process of linking an instance of a word to the WordNet synset representing its context-appropriate meaning, e.g. run a company vs. run an errand

Glosses as node points in the network of relations • Once a word’s gloss is annotated, the synsets for all conceptually-related words used in the gloss can be accessed via their sense tags • Situates the word in an expanded network of links to other semantically-related words/concepts in WordNet

Annotating the Glosses • Automatically tag monosemous words/collocations • For gold standard quality, sense-tagging of polysemous words must be done manually • More accurate sense-tagged data means better results for WSD systems, which means better performance from applications that depend on WSD

System overview • Preprocessor • Gloss “parser” and tokenizer/lemmatizer • Semantic class recognizer • Noun phrase chunker • Collocation recognizer (globber) • Automatic sense tagger for monosemous terms • Manual tagging interface

Logical structure of a Gloss • Smallest unit is a word, contracted form, or non-lexical punctuation • Collocations are decomposed into their constituent parts • Allows coding of discontinuous collocations • A collocation can be treated either as a single unit or a sequence of forms

Example glosses • n. pass, toss, flip: (sports) the act of throwing the ball to another member of your team; "the pass was fumbled" • n. brace, suspender: elastic straps that hold trousers up (usually used in the plural) • v. kick: drive or propel with the foot

Gloss “parser” • Regularization & clean-up of the gloss • Recognize & XML tag <def>, <aux>, <ex>, <qf>, verb arguments, domain <classif> • <aux> and <classif> contents do not get tagged • Replace XML-unfriendly characters (&, <, >) with XML entities

Tokenizer • Isolate word forms • Differentiate non-lexical from lexical punctuation • E.g., sentence-ending periods vs. periods in abbreviations • Recognize apostrophe vs. quotation marks • E.g., states’ rights vs. `college-bound students’

Lemmatizer • A lemma is the WordNet entry form plus WordNet part of speech • Inflected forms are uninflected using a stemmer developed in-house specifically for this task • A <wf> may be assigned multiple potential lemmas • saw: lemma=“saw%1|saw%2|see%2” • feeling: lemma=“feeling%1|feel%2”

Lemmatizer, cont. • Exceptions: stopwords/phrases • Closed-class words (prepositions, pronouns, conjunctions, etc.) • multi-word terms such as “by means of”, “according to”, “granted that” • Hyphenated terms not in WordNet get split and separately lemmatized • E.g., over-fed becomes over + fed

Semantic class recognizer • Recognizes and marks up parenthesized and free text belonging to a finite set of semantic classes • chem(ical symbol), curr(ency), date, d(ate)range, math, meas(ure phrase), n(umeric)range, num(ber), punc(tuation), symb(olic text), time, year • Words and phrases in these classes will not be sense-tagged

Noun Phrase chunker • Isolates noun phrases (“chunks”) in order to narrow the scope for finding noun collocations in the next stage • Glosses are not otherwise syntactically parsed • Trained and tagged POS using Thorsten Brants’s TnT statistical tagger

Noun Phrase chunker, cont. • Trained and chunked noun phrases using Steven Abney’s partial parser Cass • Enabled automatic recognition of otherwise ambiguous noun compounds and fixed expressions • E.g., opening move (JJ NN vs. VBG NN vs. VBG VB vs. NN VB), bill of fare (NN IN NN vs. VB IN NN) • Effected an increase in noun collocation coverage by 25% (types) and 29% (tokens)

Collocation recognizer • Bag of Words approach • To find ‘North_America’, find glosses that have both ‘North’ and ‘America’ • Four passes • Ghost: ‘bring_home_the_bacon’ • mark ‘bacon’ so it won’t be tagged as monosemous • Contiguous: ‘North_America’ • Disjoint: North (and) [(South) America] • Examples: tag the synset’s collocations in its gloss

Automatic sense-tagger • Tag monosemous words. • Words that have… • …only one lemmatized form • …only one WordNet sense • …not been marked as possibly ambiguous • i.e. non wait-list words, non ‘bacon’ words

The mantag interface • Simplicity • Taggers will repeat the same actions hundreds of times per day • Automation • Instead of typing the 148,000 search terms, use a centralized list • Also allows for easy tracking of double-checking process

Statistics

Statistics, cont.

Aim of ISI Effort • Jerry Hobbs, Ulf Hermjakob, Nishit Rathod, Fahad al-Qahtani • Gold standard translation of glosses into first-order logic with reified events

ISI Effort examples In: gloss for dance, v, 2: graceful#a#1 way#n#8 ignore move in a graceful and rhythmic way move#v#2 ignore ignore rhythmic#a#1 Out: dance-V-2'(e0,x) -> move-V-2'(e1,x) & in'(e2,e1,y) & graceful-A-1'(e3,y) & rhythmic-A-1'(e4,y) & way-N-8'(e5,y)

ISI Effort examples, cont. gloss for allegro, n, 2: musical_composition#n#1 ignore perform#v#2 In: a musical composition or passage performed quickly ignore musical_passage#n#1 quickly#r#4 Out: allegro-N-2'(e0,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) & perform-V-2'(e2,y,x) & quick-D-4'(e3,e2) musical_composition-N-1'(e1,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) musical_passage-N-1'(e1,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x)

ISI Method • Identify the most common gloss patterns and convert them first • Parse • using Charniak’s parser: • uneven, sometimes bizarre results (“aspen”: VBN) • Hermjakob’s CONTEX parser: • greater local control

ISI Progress • Completed glosses of nouns with patterns: • NG (P NG)*: 45% of nouns • + NG ((VBN | VING) NG): 15% of nouns • 45 + 15 = 60% complete! • But gloss patterns are in a Zipf distribution:

Distribution of noun glosses

Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton>