Penn Treebank • The Penn Treebank Project annotates naturally occurring text for linguistic structure. It produces skeletal parses showing rough syntactic and semantic information: a bank of linguistic trees. It annotates text with POS tags. • Bracketing (strictly POS Vs. syntax and predicates): (Mary) (visited a very nice boy) (1) (A very nice boy) (visited Mary) (2) (1) (S (NP Mary) (VP (V visited) (NP (ART a) (ADJP (ADV very) (ADJ nice)) (N boy))))
syntactic tags • ADJP - Adjective phrase. Example: “outrageously expensive”. • ADVP - Adverb phrase. Examples: “rather timidly”, “very well indeed”. • NP - Noun phrase. • PNP - Proper noun phrase. • PP - Prepositional phrase. • S - Simple declarative clause • SBAR - Clause introduced by a subordinating conjunction. • SBARQ - Direct question introduced by a wh-word or wh-phrase.
syntactic tags • SINV - Inverted declarative sentence, one in which the subject follows the verb. • SQ - That part of an SBARQ that excludes the wh-word or wh-phrase. • VP - Verb phrase. Phrasal category headed a verb. • WHADVP - Wh-adverb phrase. Example: “how” or “where”. • WHNP - Wh-noun phrase. Examples: “who”, “whose daughter”, “which book”. • WHPP - Wh-prepositional phrase. Example: “on what”. • QP - Quantifier phrase used within NP. • X - Constituent of unknown or uncertain type.
examples • adverb and preposition: (S (NP He) was (VP (ADVP very hurriedly) throwing (NP clothes) (PP into NP (a suitcase))) .) • apposition: (NP (NP Mr. Smith) , (ADJP (NP 65 years) old) , (NP chairman (PP of (NP the board)))) • comparative: (S(NPHe) (VPis (ADJPastall (SBARas (S(NPJohn) (VPis))))) .) (S(SBAR(Xthesooner) (Sourvanshittheroad)) , (S(Xtheeasier) (S we will fulfillthatobligation)).)
function tags • Subject and Predicate NP’s: (S (NP-SBJ I) (VP consider (S (NP-SBJ Kris) (NP-PRD a fool)))) • Benefactive: (S (NP-SBJ I) (VP baked (NP a cake) (PP-BNF for (NP Doug)))) • ADV (adverbial noun: “a little bit”), VOC (vocative), DTV (dative), DIR (direction with PP like from-to), LOC (locative with PP), MNR (manner), TMP (temporal), CLR (closely related: predication adjuncts or phrasal verbs), HLN (headline or dateline), TTL (title), etc.
gapping • gap coindexing: (S (S (NP-SBJ-1 Mary) (VP likes) (NP-2 Bach)) and (S (NP-SBJ=1 Susan) , (NP=2 Beethoven))) predicate-argument structure LIKES(Mary,Bach) & LIKES(Susan,Beethoven) • (S (NP-SBJ I) (VP (VP eat (NP-1 breakfast) (PP-TMP-2 in (NP the morning))) and (VP (NP=1 lunch) (PP-TMP=2 in (NP the afternoon)))))
empty categories • Emptycategoriesor null elements are used for non-local dependencies, discontinuous constituents, and missing elements. They are coindexed with their antecedents in the same sentence. • In addition, if a node has a particular grammatical function (such as subject) or semantic role (such as location), it has a function tag indicating that role; empty categories may also have function tags.
indexing & *T* examples • Indices used to express coreference, binding (wh- movement), close association (it extraposition) • (S (NP-SBJ Willie) (VP knew (SBAR (WHNP-1 who) (S (NP-SBJ *T*-1) (VP threw (NP the ball)))))) • (SBARQ (WHNP-1 what) (SQ are (NP-SBJ you) (VP thinking (PP-CLR about (NP *T*-1)))) ?)
NP * • object of passive verb: (S (NP-SBJ-1 John) (VP was (VP hit (NP *-1) (PP by (NP a ball))))) • reduced relative clause: (NP (NP an agreement) (VP signed (NP *) (PP by (NP everyone)))) • subjects of participial clauses and gerunds: (S NP-SBJ-1 I) (VP stopped (S (NP-SBJ *-1) (VP eating (NP chocolate))))) (S (NP *) (VP Having (VP carefully considered (NP his options)))) • adverbial: (S (NP-SBJ-1 She) (VP left, (S-ADV (NP-SBJ-2 *-1) (VP offended (NP *-2) (PP by (NP their remarks))))))
0 and *U* • that: (S (NP-SBJ I) (VP believe (SBAR 0 (S (NP-SBJ he) (VP will (VP stay)))))) • WHNP 0: (NP (NP a movie) (SBAR (WHNP-1 0) (S (NP-SBJ *) (VP to (VP see (NP *T*-1)))))) • WHADVP 0: (S (NP-SBJ That) (VP is (NP-PRD (NP a good way (SBAR (WHADVP-1 0) (S (NP-SBJ *) (VP to (VP keep (ADJP-PRD warm) (ADVP-MNR *T*-1)))))))) • units: (NP US$ 5 *U*) (NP (QP between 12% to 13%) *U*)
recovery of empty categories [Campbell 2004] recovery refers to: • detection: locate empty categories in the parse tree • resolution: coindexation with their antecedents, assign them function tags. NOT learning- or corpus-based, but syntax rule-based.
algorithm for recovery • Walk the tree from top. At each node X try to insert every empty category c. If the syntactic context of c (rule-based) is met by X, decide for c. Assign function tags to X. If X = NP *, try to find antecedent for X. • rule to insert NP *: if X is passive VP & X has no complement S if postmodifying PP Y ins NP * before postmodifiers of Y else ins NP * before postmodifiers of X else if X is a non-finite S and X has no subject ins NP-SBJ * after all premodifiers of X
parameters • rules make no use of lexical information. • only some function words (aux. or inf. to) but no content. • for WHADVP: check quality of the head of the NP relative clause and add function tag to *T* (why: PRP, how: MNR, when: TMP, etc) (NP (NP the country) (SBAR (WHADVP-1 where) (S (NP-SBJ I) (VP live (ADVP-LOC *T*-1))))) “time to go”? • the method depends on the system’s ability to detect passives, infinitives, modifiers, functional info such as subject etc.
more rules • an extra rule inserts NP * as subject of imperative: (S (NP-VOC Chris), (NP-SBJ *) (VP go (ADVP-DIR home)) !) • to find antecedents of NP *: If non-subject NP *, assign local subject (“John was hit (NP *) by a ball”). If NP * subject of a non-finite S, search the tree for another NP subject (“I stopped (NP-SBJ *) eating chocolate”).
evaluation • perfect input: PTB w/o empty categories. correct recovery: label + string position. Prec: % correct empty categories / empty categories detected Rec: % correct empty categories / empty categories in corpus F1: 2*PR/(P+R) • perfect input w/o function tags.
evaluation • Charniak’s parser output as input. PCFG parser based on the PTB for training/testing. correct recovery: label + string position. • low results: errors introduced by the parser and no function tags on parser output.
learning & lexical-based? • we need lexical info in some cases: VP (S (VP to…)…) empty category as subject to S: NP * or NP *T* ? “I’d like (NP-SBJ *) to have.” “Everyone seems (NP-SBJ *) to dislike him.” “John designed telescopes (NP-SBJ *T*) to sit on Kitt Park.” “We bought a broom (NP-SBJ *) to sweep the floor with (NP *T*)” • the last 2 verbs + to express purpose (PRP). • combined learning + rule-based for function or subject tags in NP * and their antecedents (resolution).