150 likes | 199 Vues
Learning syntactic patterns for automatic hypernym discovery. Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun 2009-02-17. Introduction. hypernym/hyponym relation : it describes a relationship between 2 nouns X and Y. Y is a hypernym of X if X is a member of Y.
E N D
Learning syntactic patterns for automatichypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun 2009-02-17
Introduction • hypernym/hyponym relation: it describes a relationship between 2 nouns X and Y. Y is a hypernym of X if X is a member of Y. • Previous work: uses hand-crafted patterns to automatically label hypernym relation between nouns. E.g., pattern “NPx and other NPy” implies that NPx is a hyponym of NPy. • Novelty of this paper: uses known hypernym pairs to automatically identify useful lexico-syntactic patterns, and then train a high accuracy hypernym classifier by applying these patterns to a supervised learning algorithm. Author (Y) Shakespeare (X)
Introduction (cont’) • Overview of the approach • 1. Training: (a) Collect noun pairs from corpora, identifying pairs of nouns in a hypernym/hyponym relation using WordNet. (b) For each noun pair, collect sentences in which both nouns occur. (c) Parse the sentences, and automatically extract patterns from the parse tree. (d) Train a hypernym classifier(binary) based on these features. • 2. Test: (a) Given a pair of nouns in the test set, extract features and use the classier to determine if the noun pair is in the hypernym/hyponym relation or not.
Representing lexico-syntactic patterns with dependency paths • What does dependency parsing do? • A dependency parser produces a dependency tree that represents the syntactic relations between words by a list of edge tuples of the form: (word1, CATEGORY1:RELATION:CATEGORY2, word2) Example: (Herrick, -N:conj:N, Shakespeare) • Thus, define space of lexico-syntactic patterns to be shortest paths between any two nouns in a dependency tree. Example: dependency path between 'authors' and 'Herrick': Herrick, N:pcomp-n:-Prep, as, as, Prep:mod:-N, authors
Representing lexico-syntactic patterns with dependency paths(cont’) • Generalization and Extension of dependency parsing • Generalization: remove the original nouns Example: Herrick, N:pcomp-n:-Prep, as, as, Prep:mod:-N, authors => N:pcomp-n:-Prep, as, as, Prep:mod:-N • Extension: • (a) Capture function words like ‘such’ (in “such NP as NP”) and add optional ‘satellite links’ to each shortest path since they are important parts of lexico-syntactic patterns. Example: N:pcomp-n:-Prep, as, as, Prep:mod:-N => N:pcomp-n:-Prep, as, as, Prep:mod:-N, (such, PreDet:pre:-N) • (b) Capitalize on the distributive nature of the syntactic conjunction relation (nouns linked by ‘and’ or ‘or’, or in comma-separated lists) by distributing dependency links across such conjunctions. Example: see red dotted link in the next slide
Representing lexico-syntactic patterns with dependency paths(cont’) • Let’s look at the generation of dependency representation of pattern NPy such as NPx: Herrick, N:pcomp-n:-Prep, as,as, Prep:mod:-N, authors => (generalization) N:pcomp-n:-Prep, as ,as, Prep:mod:-N => (extension a) N:pcomp-n:-Prep, as,as, Prep:mod:-N, (such, PreDet:pre:-N) => (extension b) N:PCOMP-N:PREP, as,as, PREP:MOD:N, (such, PREDET:PRE:N)
Experimental paradigm • Corpus: 6 million newswire sentences • Preprocessing: Parse each sentence using MINIPAR and extract noun pairs. • Dev/test data: 1) WordNet labeled data: Label hypernym/hyponym relation between 2 nouns according to WordNet’s hypernym taxonomy. Known Hypernym Set: 14,387 pairs Known Non-Hypernym Set: 737,924 pairs 2) Hand-labeled data(Key): totally 5387 noun pairs, 5122 are “unrelated”, 134 are hypernym pairs and 131 are “coordinate”(explained later in the paper). • Evaluation: Compare binary classifier’s performance against WordNet’s judgments and hand-labeled data(involve annotation disagreement, will show later).
Features: pattern discovery • Focus on discovering which dependency paths might prove useful features for binary hypernym classifier(will show details later). • Rediscovered hand-designed patterns(Hearst’s patterns, marked inred): high-performance boundary of precision and recall for individual features • Discovered new patterns(marked in blue): also have high-scoring
A hypernym-only classifier • The Classifier: 1) Create a feature lexicon of 69,592 dependency paths, consisting of every dependency path that occurred between at least five unique noun pairs in corpus 2) Record in noun pair lexicon each noun pair that occurs with at least five unique paths from feature lexicon 3) Create a feature count vector for each such noun pair 4) Each entry of the 69,592-dimension vector represents a particular dependency path, and contains the total number of times that that path was the shortest path connecting that noun pair in some dependency tree in corpus Thus the task becomes binary classification of a noun pair as a hypernym pair based on its feature vector of dependency paths. 5) Train a number of classifiers: Perform 10-fold cross validation on WordNet-labeled data and evaluate each model based on its maximum F-Score averaged across all folds.
A hypernym-only classifier(cont’) • Comparison of performances: • The first 4 are hypernym-only classifiers. Hearst's patterns simply detects the presence of at least one of Hearst's patterns, arguably the previous best classier consisting only of lexico-syntactic patterns; “And/or other” pattern consists of only the “NP and/or other NP” subset of Hearst's patterns • Clearly, the hypernym-only classifiers’ performance are much better than hand-designed pattern classifiers. But the performance is still NOT very good. WHY?
Using coordinate terms to improve hypernym classification • Problem with patterns: Pattern can only handle noun pairs which happen to occur in the same sentence; There are many hypernym/hyponym pairs may not occur in the same sentence. • Solution: consider coordinate terms • coordinate terms: nouns or verbs that have the same hypernym. • Assumption: If two nouns (ni, nj) are coordinate terms, and that nj is a hyponym of nk, we may then infer with higher probability that ni is similarly a hyponym of nk despite never having encountered the pair (ni, nk) within a single sentence. • Expectation: Using coordinate information will increase the recall of our hypernym classier.
Using coordinate terms to improve hypernym classification(cont’) • 3 Classifiers: 1) Distributional Similarity Vector Space Model 2) Thresholded Conjunction Pattern Classier Pattern “X, Y and Z” 3) Best WordNet Classier • Comparison of performance
Hybrid hypernym-coordinate classification • 1) : probability that noun ni has nj as an ancestor in its hypernym hierarchy • 2) : probability that nouns ni and nj are coordinate terms • 3) : probability produced by hypernym-only classifier • 4) is used to compute the new probability that nk is a hypernym of ni by linear interpolation. (for final eval, they set )
Result • Logistic regression hypernym-only model has a 16% relative F-score improvement over the best WordNet classier. • Combined hypernym/coordinate model has a 40% relative F-score improvement. • The best performing classier is a hypernym-only model additionally trained on the Wikipedia corpus, with an expanded feature lexicon of 200,000 dependency paths; this classier shows a 54% improvement over WordNet.