170 likes | 294 Vues
This presentation discusses the challenges and approaches to multilingual information access in digital libraries, focusing on the Digital Library of India with over 300,000 books in multiple languages. With only 20% of India's literate population understanding English, the necessity for cross-lingual information retrieval becomes evident. We explore techniques like machine translation, corpus-based and hybrid approaches, and the importance of resources like universal dictionaries. We also address current capabilities and future needs in enhancing multilingual access to information.
E N D
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information Technology Hyderabad, India
Context • Digital Library of India • 155,000 English books • 145,000 Other language books • Population of literates • 20% of India understand English • 80% can not IIIT Hyderabad - http://dli.iiit.ac.in
Multilingual Access to Information • Retrieve a book • By metadata • By keyword / content • Cross Lingual Information Retrieval • Read a book • Help understand sentences in a language • Help understand sentences across languages • Machine Translation IIIT Hyderabad - http://dli.iiit.ac.in
Approaches to Multilingual Access • Cross Lingual Retrieval • Translate Query to Document Language • Translate Document to Query Language • Machine Translation • Knowledge Based Approaches • Corpus Based Approaches • Hybrid Approaches IIIT Hyderabad - http://dli.iiit.ac.in
Challenges in Multilingual Access • Corpus Based Approaches • Unavailability of Parallel Corpus for pairs of languages • Unavailability of Computational Linguistics Resources • Dictionary Based Approaches • Unavailability of multiple bilingual dictionaries IIIT Hyderabad - http://dli.iiit.ac.in
Resources • Universal Dictionary • Conceived and implemented by Michael Shamos at CMU, USA • ITRANS • A transcription scheme and associated tool built by IISc, IIIT and CMU • Corpus • Data Entry by TTD and DLI project • TIDES project IIIT Hyderabad - http://dli.iiit.ac.in
Universal Dictionary IIIT Hyderabad - http://dli.iiit.ac.in
How are we doing it • Cross Lingual Search (Identify Information) • Dictionary lookup • User feedback based • Lucene Search Engine • Machine Translation (Understand Information) • Corpus based technique (EBMT) • Dictionary based word-word lookup • Good-enough translation vs Perfect translation IIIT Hyderabad - http://dli.iiit.ac.in
Cross Lingual Retrieval IIIT Hyderabad - http://dli.iiit.ac.in
Cross Lingual Retrieval IIIT Hyderabad - http://dli.iiit.ac.in
Reading Assistant System IIIT Hyderabad - http://dli.iiit.ac.in
Reading Assistant IIIT Hyderabad - http://dli.iiit.ac.in
Status Today • CLIR for 6 languages • MT for 3 languages • Shakti (a knowledge based MT system) • Parallel Corpus for Hindi-Eng • UDICT • About 40 Foreign Languages • 6 Indian Languages IIIT Hyderabad - http://dli.iiit.ac.in
What more is needed? • UDICT • Improving coverage of existing languages • Adding new languages • Machine Translation • Corpus acquisition • State of art techniques applied to Indian Languages • Multi-way parallel corpus development • Textual format for the books • Books currently are in Image formats • OCR should be developed for textual content IIIT Hyderabad - http://dli.iiit.ac.in
Thank You Questions ?