1 / 23

Linking Etymological Database : A case study in Germanic

LDL – 2014, LREC Reykjavik , Iceland 27th May 2014. Linking Etymological Database : A case study in Germanic. Christian Chiarcos , Maria Sukhareva Goethe University Frankfurt am Main. Overview. Background Linked Etymological Dictionaries

finley
Télécharger la présentation

Linking Etymological Database : A case study in Germanic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LDL – 2014, LRECReykjavik, Iceland 27th May 2014 LinkingEtymologicalDatabase:A case study in Germanic Christian Chiarcos, Maria Sukhareva Goethe University Frankfurt am Main

  2. Overview Background Linked Etymological Dictionaries Enriching of Linked Etymological Dictionaries Application Conclusion

  3. Background

  4. Background • Processing of Old Germanic Languages at Goethe University Frankfurt, • in collaboration between: ACoLi Lab • Empirical Linguistics Thesaurus of Indo-European Text and Language Materials (TITUS) • ACoLiLab (Applied Computational Linguistics) • LOEWE Cluster “Digital Humanities” • DFG-funded Old German Reference Corpus (DDD) TITUS DDD ReferenzkorpusAlthochdeutsch

  5. LinkedEtymological Data

  6. LinkedEtymological Data

  7. LinkedEtymological Data • Linkability: representation of relations within and beyond lexicons • Interoperability: (meta)data representation through community-maintained vocabularies (lexvo, Glottolog, OLiA, lemon) • Inference: filling the logical gaps of the original XML representation • Symmetric closure of cross-references Conversion of etymological dictionaries to RDF

  8. LinkedEtymological Data all language identifiers were mapped from the original abbreviations and assigned ISO 639-3 codes wherever possible. lemonet:translates a relation between lemon:LexicalEntrys lemonet:etym links between languages, transitive and symmetric. Subproperty of lemon:lexicalVariant

  9. LinkedEtymological Data Original XML (lemma) RDF Triples Symmetric closure of etymological relations generated by SPARQL pattern Links to external resources

  10. EnrichingEtymologicalDictionaries

  11. EnrichingEtymologicalDictionaries Germanic parallel Bible corpus (parentheses indicate marginal fragments with less than 50,000 tokens)

  12. EnrichingEtymologicalDictionaries • Statistical word alignment of parallel texts (GIZA++) • Lexical translation tables as basis for the extracted word lists: • Unidirectional: maximum of P(wt|ws) • Bidirectional: maximum of P(wt|ws) P(ws |wt) • Pruning by frequency

  13. Application

  14. Application Thematical Alignment of Bible paraphrases • E.g., cross references within the Bible and between the Bible and gospel harmonies • an interlinked index of thematically similar sections in the gospels and OS/OHG gospel harmonies • OS Heliand and OHG Tatian section level alignment (Sievers, 1872) has been digitized • 4560 inter-text groups based on the Eusebian canon • Basis for a more fine-grained level of alignment

  15. Application similarity metrics δ(wOS;wOHG) for every OS word wOSand its potential OHG cognate wOHG Character-based similarity measures: • GEOMETRY: δ= difference between the relative positions of wOS and wOHG • IDENTITY: δ(wOS;wOHG) = 1 iff wOHG= wOS(0 otherwise); • ORTHOGRAPHY: relative Levenshtein distance & statistical character replacement probability (Neubig et al., 2012) • NORMALIZATION: norm(wOS;wOHG) = δ(w’OS;wOHG) , with w’OSbeing the OHG ‘normalization’ (Bollmann et al., 2011) • COOCCURRENCES: δ(wOS;wOHG) = P(wOS|wOHG)P(wOHG|wOS) • Lexicon-based • similarity measures: • δlex(wOS;wOHG) = 1 iff wOHG 2 W (0 • otherwise) where W is a set of possible OHG translations • for wOS suggested by a lexicon, i.e., either: • ETYM: etymological link in (the symmetric closure of the etymological dictionaries, • ETYM-INDIRECT: shared German gloss in the etymological dictionaries, • TRANSLATIONAL DIRECT: link in the translational dictionaries, • TRANSLATIONAL INDIRECT: indirectly linked in the translational dictionaries through a third language.

  16. Application similarity metrics δ(wOS;wOHG) for every OS word wOSand its potential OHG cognate wOHG Character-based similarity measures: • GEOMETRY: δ= difference between the relative positions of wOS and wOHG • IDENTITY: δ(wOS;wOHG) = 1 iff wOHG= wOS(0 otherwise); • ORTHOGRAPHY: relative Levenshtein distance & statistical character replacement probability (Neubig et al., 2012) • NORMALIZATION: norm(wOS;wOHG) = δ(w’OS;wOHG) , with w’OSbeing the OHG ‘normalization’ (Bollmann et al., 2011) • COOCCURRENCES: δ(wOS;wOHG) = P(wOS|wOHG)P(wOHG|wOS) • Lexicon-based • similarity measures: • δlex(wOS;wOHG) = 1 iff wOHG 2 W (0 • otherwise) where W is a set of possible OHG translations • for wOS suggested by a lexicon, i.e., either: • ETYM: etymological link in (the symmetric closure of the etymological dictionaries, • ETYM-INDIRECT: shared German gloss in the etymological dictionaries, • TRANSLATIONAL DIRECT: link in the translational dictionaries, • TRANSLATIONAL INDIRECT: indirectly linked in the translational dictionaries through a third language.

  17. Conclusion & Discussion

  18. Conclusion Application of Linked Data Paradigm to modeling of etymological dictionaries Adopting of Lemon core model Representation of Köbler’s dictionary in a machine-readable format Enriching etymological dictionaries by automatically obtained translation pairs Initial experiment on usage of dictionaries for quasi-parallel alignment

  19. lemon & etymology: A square peg for a round hole ? L! L! L! L! L! L! L! lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. L!

  20. lemon & etymology: A square peg for a round hole ? L! L! L! L! L! L! L! lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. … but many of these resources are created by (or for) linguists rather than ontologists. The original motivation for lemon was to lexicalize ontologies. Quite a different problem from the inter- operability issues that linguists are trying to solve by using it. L!

  21. lemon & etymology: A square peg for a round hole ? lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. But obviously, our usage of lemon is slightly abusive. Etymological and translational links between WordForms ? No external ontology to ground senses ? No word senses at all ? But that is symptomatic for linguistic resources in a strict sense 4. Similar problems observed by Cysouw & Moran on multilingual dictionaries for South American indigeneous languages.

  22. lemon & etymology: A square peg for a round hole ? • lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. • But obviously, our usage of lemon is slightly abusive. • Etymological and translational links between word forms ? • No external ontology to ground senses ? • No word senses at all ? • But that is symptomatic for linguistic resources in a strict sense • What can we do about this state of affairs ? • Would there have been alternative ways to model our data ? • Shall we extend/abandon/replace/adjust lemon?

  23. Takkfyrir!

More Related