320 likes | 419 Vues
Explore how artificial intelligence can enhance linguistic and literary research through the analysis of folklore texts and medieval manuscript transcriptions. Learn about the challenges and innovative solutions in encoding folklore data and representing punctuation in historical texts.
E N D
Alexei Lavrentiev Alexei.Lavrentev@ens-lsh.fr Ecole Normale Supérieure Lettres et Sciences humaines, Lyon, France Processing Textual Sources for Linguistic and Literary Research:What a 'Solitary Scholar' Can Do University of Kentucky, October 24 2007
Two projects • Scholarly re-edition of a 1861 “Anonymous” folklore collection • Corpus of Medieval French manuscript transcriptions for the study of punctuation
Folklore Project 2/14 Project Team • Vera Kuznetsova • Senior Researcher, Institute of Philology SB RAS • Specialist in Russian folklore • Olga Laguta • Professor, Novosibirsk State University • Linguist • Alexei Lavrentiev
Folklore Project 3/14 Objectives • Verify the authenticity of folklore texts in the collection • Analyze linguistic features of the texts • Learn more about the author of the collection • Make these texts available to scholarly community
Folklore Project 4/14 Challenges • Encode data in a sustainable format (TEI XML) using available tools • Microsoft office (Word, Access) • XML processing software (XML Spy) • Perl • Configure the tools for the users with virtually no experience in IT
Folklore Project 5/14 Workflow Metadata Tokenized XML-TEIdocuments Word Documents XSL Stylesheets Perl script Lemmatized XML-TEIdocuments AccessDatabase Printededition Linguistic analysis Vocabularywith contexts
Folklore Project 6/14 Worddocument
Folklore Project 7/14 Metadata file [1. File name] chtochelovekzakhochet ; [номер] 20 ; [2. Заглавие текста (в источнике)] Что человек захочет, то и сделает ; [3. Заглавие текста (рабочее)] Что человек захочет ; [4. Коллектив - редактор электронной версии] Сектор русского языка в Сибири, Институт филологии СО РАН ; [5. Ответственные исполнители] : [функция] Ввод текста и предварительная разметка ; [ФИО] Кузнецова Вера Станиславовна, Алешина Ольга Николаевна ; [функция] Конвертирование в формат XML-TEI, валидация ; [ФИО] Лаврентьев Алексей Михайлович . [6. Информация о проекте] : Корпус текстов русской фольклорной прозы (легенды) ; [7. Информация об источнике] : [Информация о редакторе(ах), составителе(ях) и т.п.] : [функция] подготовка к изданию ; [ФИО] Кузнецова Вера Станиславовна ; [функция] составитель сборника ; [ФИО] аноним ; [функция] автор записи ; [ФИО] не указан . [Место записи] не указано ; [Издательство] типография Ф. Иванова; [Место издания] Санкт-Петербург ; [Год издания] 1861 ; [ISBN] ???? .
Folklore Project 8/14 Perl script • Takes Word document saved in HTML (filtered) format • Takes the metadata • Produces an XML-TEI document • Tokenizes and gives ID to <w> and <s> • Transforms analytical markup into <seg type=“…”> elements
Folklore Project 9/14 XML Document
Folklore Project 10/14 XSLT Stylesheets • Produce legible text for proofreading • Produce tables to be exported to the database
Folklore Project 11/14 Access Database
Folklore Project 12/14 Access Database
Folklore Project 13/14 Access Database
Folklore Project 14/14 Results • Printed edition • Texts • linguistic analysis supplement • indexes • XML-TEI lemmatized text corpus • XSLT stylesheets • Access database • morphological table, • forms for lemmatization and dictionary • Problem: no direct connection between the printed edition and the XML texts
Punctuation Project 1/12 Challenges • Create an adequate representation of linguistically relevant data from a medieval manuscript • Multiple visualizations according to various editing traditions • Annotate and analyze the use of punctuation marks
Punctuation Project 2/12 Project “History” • 1994-1999: first transcriptions using ASCII special characters • 2001: first annotation using Excel • 2003: XML-TEI (Charrette-style) transcriptions • 2005-2007: XML-TEI (Menota-style) transcriptions
Punctuation Project 3/12 “Special” data to be encoded
Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs
Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs • Abbreviations
Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs • Abbreviations • Large initials • “Abnormal” word spacing
Multiple visualizations Punctuation Project 4/12 “Normalized” Presentation [ § 7] Endementres qu'il parloient einsi si entra laienz uns vaslez qui dist au roi: « Sire noveles vos aport mout merveilleuses. – Queles ? XML Transcription <pn="7"> <lbn="6"/> <wxml:id="w016_0251"> <norm>Endementres</norm> <dipl>ENdementres</dipl> <facs><mdv_dropcapletter="E" color="blue"size="2"sizeAct="2"> E</mdv_dropcap>Ndementre&slong;</facs> </w> <waggl="elision"xml:id="w016_0252"> <norm>qu</norm> <dipl>qu</dipl> <facs>qu</facs> </w> “Diplomatic” Presentation [ § 7] ENdementres qu'il parloient einsi si entralaienz uns uaslez qui dist au roi. Sire noueles uos aport mout merueilleuses. Queles “Imitative” Presentation [ § 7] ENdementreſ quıl parloıent eínſı ſı entͣ laıenz unſ uaſlez quı dıſt au roı . Sıre noueleſ uoſ apot mout merueılleuſeſ . Queleſ Extract from Ms.Lyon BM, P.A. 77, Queste del saint Graal, Photo: BM Lyon, Transcription: Graal Project
Punctuation Project 5/12 Encoding choices • “Menota-style” TEI extension • Multiple representation at a word level (norm, dipl, facs, pal?) • Additional elements • punct, mdv_dropcap, mdv_lb… • Additional attributes • w/@aggl, punct/@force...
Punctuation Project 6/12 Workflow • Compact syntax transcription • xml + “shortcut” characters (cf. Wiki) • Text description using Access Database • Ms Description • Text typology • Expanding to a standard XML format using a Perl script • Export to tabular format for annotation • Re-integration of annotation to XML documents • Export and analysis using Weblex software
Punctuation Project 7/12 Compact syntax
Punctuation Project 8/12 Manuscript description
Punctuation Project 9/12 Expanded XML
Punctuation Project 10/12 Annotation
Punctuation Project 11/12 Weblex
Punctuation Project 12/12 Results • 25 fragments of manuscripts transcribed and described • Encoding guidelines • Integrated database of text descriptors (editions and transcriptions) • Perl scripts for conversions • XSLT stylesheets