1 / 26

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS. Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de . Outline. TITUS Resource Data Peculiarities of TITUS texts

diza
Télécharger la présentation

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Types und Tokens Distribution in TITUS Распределениесловоформ в корпусеTITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de

  2. Outline • TITUS Resource Data • Peculiarities of TITUS texts • Tokens and Types calculation in TITUS Resources • Metadata for Tokens and Types distribution 26.06.2013

  3. TITUS Resource Data • TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) http://titus.uni-frankfurt.de • TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens A tokenrepresents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 26.06.2013

  4. TITUS Data http://www.clarin.eu/node/1512 Addedby J. Gippert, R. Mittmann 26.06.2013

  5. TITUS Search Engine • TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. 26.06.2013

  6. Peculiarities of TITUS texts: Gothic • BibliaGothicacontainsadditional parallel passages in LatinandGreek. BibliaGothica(http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm). 26.06.2013

  7. Peculiarities of TITUS texts: Old Church Slavonic • Old Church Slavonictextsarerepresentedin twoways: in theGlagoliticalphabet– original form ofthetext– andin Cyrillicone. Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm). 26.06.2013

  8. Peculiarities of TITUS texts: Old Polish • Old Polish texts contain a simultaneous display of editions that have arisen at different times. KazaniaŚwiętokrzyskie(http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm). 26.06.2013

  9. Peculiarities of TITUS texts: Ossetian • The OssetianNartepic is represented in Latinica und in the advanced Cyrillic. Ossetian: Nartepic(http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/ nart/nart.htm). 26.06.2013

  10. Peculiarities of TITUS texts: Russian-Low German • TönniesFenne's Manual (17th century) containsat least 9 different languages ​​orlanguagevariations. 26.06.2013

  11. Peculiarities of TITUS texts: Old Prussian Old Prussiancorpusconsistsofat least 21 different languages ​​orlanguagevariants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 26.06.2013

  12. Creation • A digitizedsourceconsists not onlyof a sourcelanguagewords, but containsvariousinformationwhichdoes not belongoriginallytothedocument: numbers, tags, punctuation marks, edition information etc. $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; #<†‡„> $zeile =~ s/\d*\s+<\W<?ConvertCheck:\s+LevelNameTooLong>//g; #<?ConvertCheck: LevelNameTooLong> 26.06.2013

  13. Examples: Gothic 26.06.2013

  14. Examples: Gothic 26.06.2013

  15. Examples: TönniesFenne'sManual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 26.06.2013

  16. Examples: furtherapplication 26.06.2013

  17. Metadata • DC – Dublin Core • TEI – Text Encoding Initiative • CEI – Corpus Encoding Initiative • IMDI – ISLE Meta Data Initiative • OLAC – Open Language Archives Community • CMDI – ComponentMetaData Infrastructure 26.06.2013

  18. CMDI- ComponentMetaData Infrastructure http://www.clarin.eu/cmdi 26.06.2013

  19. TITUSMetadata: HTML Format 26.06.2013

  20. New Metadata Set forTITUS 26.06.2013

  21. MetadataExampleforTITUS – XML CMDI <ResourcePublicationTimeElectronic>16.6.2002</ResourcePublicationTimeElectronic> <ResourceWordcountGeneral> <Tokens>1629 Tokens</Tokens> <Types>893 Types</Types> </ResourceWordcountGeneral><ResourceWordcountTT> <Language></Language> <LanguageTokensTypes> Tokens | Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 1_General</Language> <LanguageTokensTypes>10 Tokens | 9 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 2_Gothic</Language> <LanguageTokensTypes>420 Tokens | 240 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 4_Latin</Language> <LanguageTokensTypes>572 Tokens | 325 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 5_Greek</Language> <LanguageTokensTypes>627 Tokens | 319 Types</LanguageTokensTypes> </ResourceWordcountTT> 26.06.2013

  22. MetadataforTITUS – Browser 26.06.2013

  23. MetadataforTITUS – Browser 26.06.2013

  24. MetadataforTITUS – Browser 26.06.2013

  25. Thankyouforyourattention! Links • ARBIL (Metadaten-Editor) http://tla.mpi.nl/tools/tla-tools/arbil/ • CLARIN http://www.clarin.eu • CMDI http://www.clarin.eu/cmdi • Dublin Core http://dublincore.org/documents/dcmi-terms/ • IMDI http://www.mpi.nl/IMDI/ • OLAT http://www.language-archives.org/ • TEI http://www.tei-c.org/index.xml • TITUS http://titus.uni-frankfurt.de 26.06.2013

  26. 26.06.2013

More Related