Probabilistic and Lexicalized Parsing

Probabilistic and LexicalizedParsing

Handling Ambiguities • The Earley Algorithm is equipped to represent ambiguities efficiently but not to resolve them. • Methods available for resolving ambiguities include: • Semantics (choose parse that makes sense). • Statistics: (choose parse that is most likely). • Probabilistic context-free grammars (PCFGs) offer a solution.

Issues for PCFGs • The probablistic model • Getting the probabilities • Parsing with probabilities • Task: find max probability tree for an input string

PCFG • A PCFG is a 5-tuple (NT,T,P,S,D) where D is a function that assigns probabilities to each rule p  P • A PCFG augments each rule with a conditional probability.A  [p] • Formally this is the probability of a given expansion given LHS non-terminal. • Read this as P(Specific rule | LHS) • VP -> V .55 • What’s the probability that VP will expand to V, given that we have a VP?

Example PCFG

Example PCFG Fragment • S NP VP [.80]S Aux NP VP [.15] S VP [.05] • Sum of conditional probabilities for a givenANT = 1 • PCFG can be used to estimate probabilities of each parse-tree for sentence S.

Probability of a Derivation • For sentence S, the probability assigned by a PCFG to a parse tree T is given by P(r(n)) • i.e. the product of the probabilities of all the rules r used to expand each node n in T • Note the independence assumption

Ambiguous Sentence P(TL) = 1.5 x 10-6 P(TR) = 1.7 x 10-6 P(S) = 3.2 x 10-6

Getting the Probabilities • From an annotated database • E.g. the Penn Treebank • To get the probability for a particular VP rule just count all the times the rule is used and divide by the number of VPs overall • What if you have no treebank (e.g. for a ‘new’ language)?

The Parsing Problem for PCFGs • The parsing problem for PCFGs is to produce the most likely parse for a given sentence, i.e. to compute the T spanning the sentence whose probability is maximal.

Typical Approach • Bottom-up (CYK) dynamic programming approach • Assign probabilities to constituents as they are completed and placed in the table • Use the max probability for each constituent going up the tree • CYK algorithm assumes that grammar is in Chomsky Normal Form: • No  productions • Rules of the form A B C or A  a

CKY Algorithm – Base Case • Base case: covering input strings with of length 1 (i.e. individual words). In CNF, probability p has to come from that of corresponding rule • A  w [p]

CKY AlgorithmRecursive Case • Recursive case: input strings of length > 1: A * wij if and only if there is a ruleA  B Cand some 1 ≤ k ≤ j such that B derives wik and C derives wkj • In this case P(wij) is obtained by multiplying together P(wik) and P(wjk). • These probabilities in other parts of table • Take max value

Problems with PCFGs • Fundamental Independence Assumptions:A CFG assumes that the expansion of any one non-terminal is independent of the expansion of any other non-terminal. • Hence rule probabilities are multiplied together.

Problems with PCFGs (cont.) • Doesn’t use the words in any real way – e.g. PP attachment often depends on the verb, its object, and the preposition (I ate pickles with a fork. I ate pickles with relish.) • Doesn’t take into account where in the derivation a rule is used – e.g. pronouns more often subjects than objects (She hates Mary. Mary hates her.)

Dependencies cannot be stated • These dependencies could be captured if it were possible to say that the probabilities associated with, e.g. NP → Pronoun orNP → Det Noundepend on whether NP is subject or object • However, this cannot normally be said in a standard PCFG.

Lack of Sensitivity to the properties of individual Words • Lexical information can play an important role in selecting between alternative interpretations. Consider sentence: • "Moscow sent soldiers into Afghanistan." • NP → NP PPVP → VP PP • These give rise to 2 parse trees

S VP VP PP NP VP NP NP PP Attachment Ambiguity 33% of PPs attach to VPs 67% of PPs attach to NPs S VP VP NP NP PP NP NP N V N P N Moscow sent soldiers into Afghanistan N V N P N Moscow sent soldiers into Afghanistan

Lexical Properties • In this case the raw statistics are misleading and yield the wrong conclusion. • The correct parse should be decided on the basis of the lexical properties of the verb "send into" alone, since we know that the basic pattern for this verb is (NP) send (NP) (PPinto)where the PPinto attaches to the VP.

Lexicalised PCFGs • Add lexical dependencies to the scheme… • Add the predilections of particular words into the probabilities in the derivation • I.e. Condition the rule probabilities on the actual words • Basic idea: each syntactic constituent is associated with a head which is a single word. • Each non-terminal in a parse tree is annotated with that single word.

Phrasal Heads • Heads • The head of an NP is its noun • The head of a VP is its verb • The head of a PP is its preposition • Can ‘take the place of’ whole phrases, in some sense • Define most important characteristics of the phrase • Phrases are generally identified by their heads

Lexicalised Tree

Example (wrong)

How? • We started with rule probabilities • VP -> V NP PP P(rule|VP) • E.g., count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities • VP(dumped)-> V(dumped) NP(sacks)PP(into) • P(r|VP ^ dumped is the verb ^ sacks is the head of the NP ^ into is the head of the PP) • Not likely to have significant counts in any treebank

Probabilities of Lexicalised Rules • Need to establish probabilities of, e.g. VP(dumped) → V(dumped) NP(sacks) PP(into) VP(dumped) → V(dumped) NP(cats) PP(into) VP(dumped) → V(dumped) NP(hats) PP(into) VP(dumped) → V(dumped) NP(sacks) PP(above) • Problem – no corpus big enough to train with this number of rules

Declare Independence • So, exploit independence assumption and collect the statistics you can… • Focus on capturing two things • Verb subcategorization • Particular verbs have affinities for particular VP rules • Objects have affinities for their predicates (mostly their mothers and grandmothers) • Some objects fit better with some predicates than others

Verb Subcategorization • Condition particular VP rules on their head… so r: VP -> V NP PP P(r|VP) Becomes P(r | VP ^ dumped) What’s the count? How many times was this rule used with dump, divided by the number of VPs that dump appears in total

Preferences • The issue here is the attachment of the PP • So the affinities we care about are the ones between dumped and into vs. sacks and into. • Count the times dumped is the head of a constituent that has a PP daughter with into as its head and normalize • Vs. the situation where sacks is a constituent with into as the head of a PP daughter

Probabilistic and Lexicalized Parsing