180 likes | 370 Vues
Gibbs Sampling with Treenes constraint in Unsupervised Dependency Parsing. David Mare ček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague September 15, 2011, Hissar, Bulgaria. Motivations for unsupervised parsing.
E N D
Gibbs Sampling with Treenes constraint in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague September 15, 2011, Hissar, Bulgaria
Motivations for unsupervised parsing We want to parse texts for which we do not have any manually annotated treebanks texts from different domains different languages We want to learn sentence structures from the corpus only What if the structures produced by linguists are not suitable for NLP? Annotations are expensive It’s a challenge: can we beat the supervised techniques in some application?
Outline Parser description Priors Models Sampling Sampling constraints Treeness Root fertility Noun-root dependency repression Evaluation on Czech treebank on all 19 CoNLL treebanks from shared task 2006-2007 Conclusions
Basic features of our approach • Learning is based on Gibbs sampling • We approximate probability of a tree by a product of probabilities of individual edges • We used only POS tags for predicting a dependency relation • but we plan to use lexicalization and unsupervised POS tagging in the future • We introduce treeness as a hard constraint in the sampling procedure • It allows non-projective edges
Models • We use two simple models in our experiments • the parent POS tag conditioned by the child POS tag • the edge length (signed distance between the two words) conditioned by the child POS tag
Gibbs sampling • We sample each dependency edge independently • 50 iterations • The rich get richer (self-reinforcing behavior) • counts are taken from the history • Exchangability • we can deal with each edge as it was the last one in the corpus • nominators and denominators in the product are exchangable • Dirichlet hyperparameters α1 α2 were set experimentally
Basic sampling For each node, sample its parent with respect to the probability distribution The sampling order of the nodes is random Problem: it may create cycles and discontinuous graphs 0.01 0.02 0.05 0.04 0.03 0.05 0.07 ROOT Její dcera bylavčera v zoologické zahradě. 5 3 2 6 7 1 4
Treeness constraint In case a cycle is created: choose one edge in the cycle (by sampling) and delete it take the formed subtree and attach it to one of the remaining nodes (by sampling) 0.02 0.01 0.02 0.04 0.02 0.02 0.05 0.02 ROOT Její dcera bylavčera v zoologické zahradě.
Root fertility constraint Individual phrases tend to be attached to the technical root A sentence has usualy only one word (the main verb) that dominate the others We constrain the root fertility to be one If it has more than one child, we do the resampling sample one child that will stay under the root resample parents of other children 0.04 0.02 0.01 0.02 0.05 0.04 0.02 0.03 ROOT Její dcera bylavčera v zoologické zahradě.
Nouns (especially subjects) often substitute verbs in the governing positions. Majority of grammars are verbocentric Nouns can be easily recognized as the most frequent coarse-grained tag category in the corpus We add the following model: This model is useless when an unsupervised POS tagging is used Noun-ROOT dependency repression
Evaluation measures Evaluation of unsupervised parser on GOLD data is problematic many linguistics decisions must have been done before annotating each corpus how to deal with coordination structures, auxiliary verbs, prepositions, subordinating conjunctions? We use three following measures: UAS (unlabeled attachment score) – standard metric for evaluating dependency parsers UUAS (undirected unlabeled attachment score) – edge direction is disregarded (it is not a mistake if governor and dependent are switched) NED (neutral edge direction, Schwartz et al, 2011) which treats not only a node’s gold parent and child as the correct answer, but also its gold grandparent UAS < UUAS < NED
Evaluation on Czech Czech dependency treebank from CoNLL 2007 shared task Punctuation removed max 15-word sentences
Error analysis for Czech • Many errors are caused by the reversed dependencies • preposition – noun • subordinating conjunction – verb
Evaluation on 19 CoNLL languages We have taken the dependency treebanks from CoNLL shared tasks 2006 and 2007 POS tags from the fifth column were used The parsing was run on concatenated trainining and development sets Punctuation was removed Evaluation on the development sets only We compare our results with the state-of-the-art system, which is based on DMV (Spitkovsky et al, 2011)
Conclusions • We introduced a new approach to unsupervised dependency parsing • Even though only a couple of experiments were done so far and only POS tags with no lexicalization are used, the results seem to be competitive to the state-of-the-art unsuperrvised parsers (DMV) • We have better UAS for 12 languages out of 19 • If we do not use noun-root dependency repression, which is useful only with supervised POS tags, we have better scores for 7 languages out of 19
Future work We would like to add: Word fertility model to model number of children for each node Lexicalization the word forms itself must be useful Unsupervised POS taging some recent experiments show that using word classes instead of supervised POS tags can improve the parsing accuracy