Corpora and Statistical Methods Lecture 5

Corpora and Statistical MethodsLecture 5 Albert Gatt

In this lecture • We begin to consider the problem of lexical acquisition beyond collocations • syntax-semantics interface: • verb subcategorisation frames • prepositional phrase attachment ambiguity • verb subcat preferences • semantic similarity (“thesaurus relations”) • We also introduce some measures for evaluation

The problem of evaluation: How are the results of automatic acquisition to be assessed?

Basic rationale • For a given classification problem, we have: • a “gold standard” against which to compare • our system’s results, compared to the target gold standard: • false positives (fp) • false negatives (fn) • true positives (tp) • true negatives (tn) • Performance typically measured in terms of precision and recall.

Precision • Definition: • proportion of items that are correctly classified • i.e. proportion of true positives out of all the system’s classifications

Recall • Definition: • proportion of the actual target (“gold standard”) items that our system classifies correctly total no. of items that should be correctly classified, including those the system doesn’t get

Combining precision and recall • Typically use the F-measureas a global estimate of performance against gold standard • We need some factor (alpha) to weight precision and recall; 0.5 gives them equal weighting

Fallout • We can also measure fallout: proportion of mistaken classifications total no. of negatives according to the system (true and false)

Why precision and recall? • We could also use simpler measures: • accuracy: % of things we got right • error: % of things we got wrong • Problems: • tn is usually very large, whereas tp, fn, fp are smaller. Precision and recall are more sensitive to these small figures. • Accuracy is only sensitive to the number of errors. F-measure distinguishes true positives from false positives.

Evaluation with humans in the loop • Precision and recall rely on a “gold standard”, i.e. a pre-annotated corpus. • Another form of evaluation is against human subjects: • Correlational: correlation of output against human judgements; • Task-based: use of the output by humans in a task. • e.g. how easily can humans read generated text? • depends on whether there is a well-defined task

Lexical acquisition: overview

Lexical acquisition • Involves discovering properties of words or classes of words. • Examples: • verbs like eat take an object NP denoting some kind of food • nouns like house, theatre and shack denote kinds of edifices, are intuitively “related”, so should behave similarly in syntax • modifiers like with the icing are likely candidates for attachment to cake but not to eat

What is a Lexicon? • Early generative grammar: • lexicon = words + exceptional behaviour • The idea was: • we have general principles governing syntax, morphology etc • the lexicon is rather “boring”, it’s only a repository of what isn’t covered by the general principles

What is a lexicon? • Contemporary theories: • grammar knowledge is knowledge of the lexicon (HPSG, Tree Adjoining Grammar, Categorial Grammar) • lexicon as interface between all the components of the language faculty (Jackendoff 2002) • Semantic Bootstrapping: Pinker 1989 suggests that lexical acquisition is a prerequisite to syntax acquisition

Applications (sample) • PP Attachment ambiguities: • the children ate the cake with a spoon • the children ate the cake with the icing • seems to depend on different lexical preferences: cake–icing vs. eat—spoon • Verb subcategorisation preferences: • I (gave/sent) the book to Melanie • I (gave/sent) Melanie the book • Lexicography: • Semantic classes, e.g. HUMAN/ROLE like {professor, lecturer, reader} • Should exhibit the same syntactic behaviour.

Application 1: Verb Subcategorisation

Problem definition • Verbs have subcategorisation frames: • verbs with similar semantic arguments (AGENT, PATIENT etc) can be grouped together • different semantic arguments can be expressed differently in syntax • e.g. send, give etc allow the dative alternation: • send X to Y / send Y X • give X to Y / give Y X • should be distinguished from donate etc, which don’t (cf. I donated money to the charity vs. *I donated the charity money)

Uses for parsing • Example: • she told the lady where she had grown up • she found the place where she had grown up • Is the where-clause a clausal argument, or an adverbial adjunct? • depends on the verb: tell has a [V NP S] subcat frame, find doesn’t.

Existing resources: Verbnet • Verbnet: online verb lexicon for English • groups verbs into semantic classes • gives subcat information and thematic roles • http://verbs.colorado.edu/~mpalmer/projects/verbnet.html • Verbnetis based on Levin’s (1993) classification of English verbs.

Verbnet example: class admit-65 • Members: admit, allow, include, permit, welcome <FRAME> … <SYNTAX> <NP value="Agent“/> <VERB/> <NP value="Theme“/> <NP value=“Location“/> </SYNTAX> … </FRAME> e.g. she allowed us here <FRAME> … <SYNTAX> <NP value="Agent“/> <VERB/> <NP value="Theme“/> </SYNTAX> … </FRAME> e.g. she admitted us

Verbnet and other resources • Other resources: Framenet • http://framenet.icsi.berkeley.edu/ • verbs annotated with detailed semantic and syntactic info • lexical database + annotated corpus examples • Though very large, such resources are not exhaustive. • Automatic acquisition would help to expand them.

Brent’s (1993) algorithm • Aim: discover the subcat frames of verbs from a corpus. • Ingredients: • Cues: a set of patterns of words & syntactic categories which indicate the presence of a frame: essentially a regular expression • Hypothesis testing: compare the hypothesis (H0) that a given frame is not appropriate for a verb. Reject H0 if cue co-occurs with the verb with high likelihood.

Example cue • [NP NP] frame (e.g. the woman entered the room) • (OBJ|SUBJ_OBJ|CAP) (PUNC|CC)  [NP NP] frame • OBJ = object personal pronoun (him etc) • SUBJ_OBJ = subject or object pers. pro (you) • CAP = word in uppercase • PUNC = punctuation mark • CC = subordinating conjunction (if, because etc) • Example match: greet Steve-CAP ,-PUNC

Rationale behind cues • If the cue applies to a verb very frequently, we conclude that corresponding frame applies to it. • Very unlikely for a phrase to match the cue [NP NP] in the absence of a transitive verb.

Hypothesis Testing • Let c be a cue for frame F • Let v be a verb occurring n times in the corpus • Suppose v occurs m ≤n times with cue c • Note: the cue may be wrong, i.e. a false positive!

Hypothesis testing – Step 1 • Assume a binomial distribution, based on the indicator random variable v(f): • v(f) = 1 if the combination of v+c is a true indicator of the presence of frame f • v(f) = 0 if the v+c combination is there, but we don’t really have frame f • ε = the probability of error (false positive), i.e. the probability that v(f) = 0 given v+c

Hypothesis testing – step 2 • Calculate the probability of error: • likelihood that v does not permit frame f given that v occurs with cue c m times or more • basically an “n choose k” problem: • what are the chances that v doesn’t permit f given m occurrences of v+c? • need an estimate of the error rate of cue c, i.e. the probability that cue c is a false indicator of frame F

Probability of error in rejecting H0 error rate of the cue (false positives): chances of finding c when f is not the case frequency of v with frame c prob. that v does not permit frame f

Explanation • If εis the probability that cue c falsely indicates frame f then: • given that v+c occurs m times or more out of n; • we risk an incorrect rejection of H0 with probability pE, having observed v+c m times

Accepting or rejecting H0 • Brent (1993) proposed a threshold value. If the probability of error is less than the threshold, then we reject H0 • e.g. set threshold at 0.02 • System has good precision, but low recall • many low-frequency verbs not assigned frames due to lack of evidence.

Improvements • Manning (1993): applies POS tagging before running Brent’s cue detection. • NB: this combines two error-prone systems (cues + tagger)! • Example: • cue c has ε=0.25. • c occurs 11/80 times with v • then pE = 0.011 < 0.02, so H0 is still rejected • I.e. given appropriate hypothesis testing, an unreliable cue can be useful if it occurs enough times.

Application 2: PP Attachment ambiguity

PP Attachment • Pervasive problem for NL Parsing: • PP follows an object NP • Problem is whether PP attaches to VP or NP • Heuristics for improvement: • lexical co-occurrence likelihoods (cake + (with) icing vs. eat + (with) spoon) • local operations: preference for attaching PP as low as possible in the tree (i.e. to the NP)

Approach 1 • Moscow sent 5000 soldiers into Afghanistan • Compute co-occurrence counts between: • verb & preposition (send + into) • noun & preposition (soldier + into) • Compare the two hypotheses using log-likelihood ratio:

Limitations of Approach 1 • Lexical co-occurrence stats ignore syntactic preferences. • The preference seems to be to attach new material to the “last seen” syntactic node • Lynn Frazier’s minimal attachment principle • This predicts preference for PP attachment to object NP, unless there is strong evidence for the contrary.

Why minimal attachment is important • Chrysler confirmed that it would end its venture with Maserati. • PP of interest: with Maserati • occurs frequently with end (e.g. the play ended with a song) • occurs frequently with venture too • So simple frequencies of lexical co-occurrence will not be able to decide (or risk the wrong decision)

Approach 2: Hindle & Rooth (1993) • Event space of interest: • potentially ambiguous sentences with PPs • Given a PP headed by p, a VP headed by v and an NP headed by n, define two indicator random variables: • VA = 1 iff PP attaches to VP • NA = 1 iff PP attaches to NP • possible in principle for both to be 1: • he put the book [on WW2] [on the table] • VA = 1, NA = 1

Hindle & Rooth - II • Given the sequence [v… n… PP], we calculate the probability that VA = 1 and NA = 1, given the verb and noun: • P(VA & NA|v & n) = P(VA|v)P(NA|n) • NB: We assume that attachment to NP or to VP are independent

Some explanation • Why do we only need to consider NA for P(NA=1|n)? • any one PP can only attach to VP or NP, not both • (VA = 1 and NA = 1 is only true if a sentence has multiple PPs) • If VA = 1 and NA = 1 for any sentence, then: • first PP must attach to the NP • second PP must attach to VP • otherwise, we’d have crossing branches • However, to determine, for a specific PP within the sentence, whether VA=1, we need to exclude the possibility that NA=1. • this accounts for cases where there are 2 pps, both attaching to the NP

Final step • Once we’ve computed, for a given PP, the probability that VA = 1 and the probability that NA=1, we use log likelihood to compare them: • If value is negative, we choose NP attachment; if positive, we choose VP attachment

Estimating the initial probabilities • The Hindle & Rooth model needs prior estimates for: • P(VA=1| v) • P(NA=1 | n) • This is plain old conditional probability, but where do the frequencies come from? • We need to disambiguate all ambiguous PPs to count them. • But that’s exactly the initial problem! • OK if we have a treebank,but often we don’t.

Hindle and Rooth’s solution • Build an initial model by looking only at unambiguous cases. • The road to London is.. • She sent him into the nursery… • Apply the initial model to ambiguous cases if the λ value exceeds a threshold (e.g. 0.2 for VP and -0.2 for NP) • For each remaining ambiguous case, divide it between the two counts for NA and VA: • i.e. add 0.5 to each count

Other attachment ambiguities • Noun compounds: • left-branching: • [[statistical parsing] practitioner] • = “someone who does statistical parsing” • right-branching: • [statistical [parsing algorithm]] • = “a parsing algorithm which is statistical” • Could apply a Hindle&Rooth solution, but data sparseness problem is great for these complex N-compounds.

Indeterminacy we signed an agreement with X • VP-attachment: • we signed the agreement in the presence of/in the company of/ together with X • NP-attachment: • we signed an agreement between us and X • Probably, both are true, and one must be true for the other to be true. • So is this a real ambiguity? Indeterminacy?

Corpora and Statistical Methods Lecture 5

Corpora and Statistical Methods Lecture 5

Presentation Transcript

Corpora and Statistical Methods – Lecture 3

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods Lecture 13

Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 8

Corpora and Statistical Methods – Lecture 7

Corpora and Statistical Methods

Corpora and Statistical Methods – Lecture 8

Corpora and Statistical Methods Lecture 7

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 9

Corpora and Statistical Methods

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 10

Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 9

Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 5

Corpora and statistical methods

Corpora and Statistical Methods