150 likes | 156 Vues
Universal Dependencies. Joakim Nivre U ppsala University. Universal Dependencies. Background: Treebank annotation schemes vary across languages Hard to compare results across languages [ Nivre et al. 2007] Hard to evaluate cross-lingual learning [McDonald et al. 2013]
E N D
Universal Dependencies JoakimNivre Uppsala University
Universal Dependencies • Background: • Treebank annotation schemes vary across languages • Hard to compare results across languages [Nivre et al. 2007] • Hard to evaluate cross-lingual learning [McDonald et al. 2013] • Hard to build multilingual systems • Universal Dependencies (http://universaldependencies.github.io/docs/): • Stanford universal dependencies [de Marneffeet al. 2014] • Google universal part-of-speech tags [Petrov et al. 2012] • Interset morphological features [Zeman 2008] First guidelines released Oct 1, 2014 First 10 treebanks released Jan 15, 2015
Universal Dependencies • Syntactic words – explicit splitting of clitics and contractions • Universal part-of-speech tags + morphological features • Dependency tree + augmented dependencies (not shown)
Goals • Cross-linguistically consistent grammatical annotation • Support multilingual NLP and linguistic research • Build on common usage and existing de-facto standards • Complement – not replace – language-specific schemes • Open community effort – anyone can contribute
Guiding Principles • Maximize parallelism • Don't annotate the same thing in different ways • Don't make different things look the same • Don't annotate things that are not there • Don't annotate things that are not there • Languages select from a universal pool of categories • Allow language-specific extensions
Design Principles • Dependency • Widely used in practical NLP systems • Available in treebanks for many languages • Lexicalism • Basic annotation units are words – syntactic words • Words have morphological properties • Words enter into syntactic relations • Recoverability • Transparent mapping from input text to word segmentation
Morphological Annotation • Lemma represent the semantic content of a word • Part-of-speech tag represent its grammatical class • Features represent lexical and grammatical properties of the lemma or the particular word form
Syntactic Annotattion • Content words are related by dependency relations • Function words attach to the content word they modify • Punctuation attach to head of phrase or clause
Dependency Structure • Keeping content words as heads promotes parallelism • Function words often correlate with morphology English Swedish
Dependency Relations [de Marneffeet al. 2014] • Taxonomy of 42 universal grammatical relations, broadly supported across many languages in language typology • Language specific subtypes can be added
Morphology: POS • Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset [Petrov et al. 2012]
Morphology: Universal Features • Standardized inventory of morphological features, based on the Interset system [Zeman 2008]
Morphology: Examples la Definite=Def|Gender=Fem|Number=Sing|PronType=Art hannoMood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin fatto Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part casa Gender=Fem|Number=Sing