probabilistic and lexicalized parsing n.
Skip this Video
Loading SlideShow in 5 Seconds..
Probabilistic and Lexicalized Parsing PowerPoint Presentation
Download Presentation
Probabilistic and Lexicalized Parsing

Probabilistic and Lexicalized Parsing

135 Vues Download Presentation
Télécharger la présentation

Probabilistic and Lexicalized Parsing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Probabilistic and Lexicalized Parsing CS 4705

  2. Probabilistic CFGs: PCFGs • Weighted CFGs • Attach weights to rules of CFG • Compute weights of derivations • Use weights to choose preferred parses • Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR • Parsing with weighted grammars: find the parse T’ which maximizes the weights of the derivations in the parse tree for all the possible parses of S • T’(S) = argmaxT∈τ(S) W(T,S) • Probabilistic CFGs are one form of weighted CFGs

  3. Rule Probability • Attach probabilities to grammar rules • Expansions for a given non-terminal sum to 1 R1: VP  V .55 R2: VP  V NP .40 R3: VP  V NP NP .05 • Estimate probabilities from annotated corpora • E.g. Penn Treebank • P(R1)=counts(R1)/counts(VP)

  4. Derivation Probability • For a derivation T= {R1…Rn}: • Probability of the derivation: • Product of probabilities of rules expanded in tree • Most likely probable parse: • Probability of a sentence: • Sum over all possible derivations for the sentence • Note the independence assumption: Parse probability does not change based on where the rule is expanded.

  5. One Approach: CYK Parser • Bottom-up parsing via dynamic programming • Assign probabilities to constituents as they are completed and placed in a table • Use the maximum probability for each constituent type going up the tree to S • The Intuition: • We know probabilities for constituents lower in the tree, so as we construct higher level constituents we don’t need to recompute these

  6. CYK (Cocke-Younger-Kasami) Parser • Bottom-up parser with top-down filtering • Uses dynamic programming to store intermediate results (cf. Earley algorithm for top-down case) • Input: PCFG in Chomsky Normal Form • Rules of form Aw or ABC; no ε • Chart: array [i,j,A] to hold probability that non-terminal A spans input i-j • Start State(s): (i,i+1,A) for each Awi+1 • End State: (1,n,S) where n is the input size • Next State Rules: (i,k,B) (k,j,C)  (i,j,A) if ABC • Maintain back-pointers to recover the parse

  7. S  NP VP VP  V NP NP  NP PP VP  VP PP PP  P NP NP  John | Mary | Denver V -> called P -> from S VP NP PP VP V NP P NP called from John Mary Denver Structural Ambiguity John called Mary from Denver S VP NP NP V NP PP called John Mary P NP from Denver

  8. Example

  9. Base Case: Aw

  10. Recursive Cases: ABC

  11. Problems with PCFGs • Probability model just based on rules in the derivation. • Lexical insensitivity: • Doesn’t use words in any real way • But structural disambiguation is lexically driven • PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish. • Context insensitivity of the derivation • Doesn’t take into account where in the derivation a rule is used • Pronouns more often subjects than objects • She hates Mary. • Mary hates her. • Solution: Lexicalization • Add lexical information to each rule • I.e. Condition the rule probabilities on the actual words

  12. An example: Phrasal Heads • Phrasal heads can ‘take the place of’ whole phrases, defining most important characteristics of the phrase • Phrases generally identified by their heads • Head of an NP is a noun, of a VP is the main verb, of a PP is preposition • Each PFCG rule’s LHS shares a lexical item with a non-terminal in its RHS

  13. Increase in Size of Rule Set in Lexicalized CFG • If R is the number of binary branching rules in CFG and ∑ is the lexicon, O(2*|∑|*|R|) • For unary rules: O(|∑|*|R|)

  14. Example (correct parse) Attribute grammar

  15. Example (less preferred)

  16. Computing Lexicalized Rule Probabilities • We started with rule probabilities as before • VP  V NP PP P(rule|VP) • E.g., count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities • VP(dumped)  V(dumped) NP(sacks) PP(into) • i.e., P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ into is the head of the PP) • Not likely to have significant counts in any treebank

  17. Exploit the Data You Have • So, exploit the independence assumption and collect the statistics you can… • Focus on capturing • Verb subcategorization • Particular verbs have affinities for particular VPs • Objects’ affinity for their predicates • Mostly their mothers and grandmothers • Some objects fit better with some predicates than others

  18. Verb Subcategorization • Condition particular VP rules on their heads • E.g. for a rule r VP -> V NP PP • P(r|VP) becomes P(r ^ V=dumped | VP ^ dumped) • How do you get the probability? • How many times was rule r used withdumped, divided by the number of VPs thatdumpedappears in, in total • How predictive of r is the verbdumped? • Captures affinity between VP heads (verbs) and VP rules

  19. Example (correct parse)

  20. Example (less preferred)

  21. Affinity of Phrasal Heads for Other Heads: PP Attachment • Verbs with preps vs. Nouns with preps • E.g. dumped with into vs. sacks with into • How often is dumped the head of a VP which includes a PP daughter with into as its head relative to other PP heads or… what’s P(into|PP,dumped is mother VP’s head)) • Vs…how often is sacksthe head of an NP with a PP daughter whose head isinto relative to other PP heads or… P(into|PP,sacks is mother’s head))

  22. But Other Relationships do Not Involve Heads (Hindle & Rooth ’91) • Affinity of gusto for eat is greater than for spaghetti; and affinity of marinara for spaghetti is greater than for ate Vp (ate) Vp(ate) Np(spag) Vp(ate) Pp(with) np Pp(with) v np v Ate spaghetti with marinara Ate spaghetti with gusto

  23. Log-linear models for Parsing • Why restrict to the conditioning to the elements of a rule? • Use even larger context…word sequence, word types, sub-tree context etc. • Compute P(y|x); where fi(x,y) tests properties of context and li is weight of feature • Use as scores in CKY algorithm to find best parse

  24. S N NP S Adj N NP VP Adv S N underground V NP now poachers control S VP NP N N S S NP VP Adv VP Det NP N N N N NP NP VP VP V NP now the e S NP Adj V V poachers trade VP underground S VP Adv control control S NP NP VP now NP N V NP trade e N e trade Supertagging: Almost parsing Poachers now control the underground trade S S NP VP S NP V NP NP VP e N V NP e poachers : : e Adj : : : underground

  25. Summary • Parsing context-free grammars • Top-down and Bottom-up parsers • Mixed approaches (CKY, Earley parsers) • Preferences over parses using probabilities • Parsing with PCFG and PCKY algorithms • Enriching the probability model • Lexicalization • Log-linear models for parsing • Super-tagging