250 likes | 260 Vues
This paper explores the components of digital libraries and their role in language technology research and applications. It discusses the efforts made by CDAC, Noida in using digital libraries for next-generation operational systems.
 
                
                E N D
Digital Library: Language Centered Research, Test Beds and Applications Digital Library: Language Centered Research, Test Beds and Applications V N Shukla * Karunesh Kr. Arora Vijay Gugnani Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications & Information Technology) ‘Anusandhan Bhawan’, C 56/1 Sector 62, Noida – 201 307, India vnshukla@cdacnoida.com
What the paper is all about... • This paper describes • Digital Library components, • World over initiatives, • research and applications areas • efforts made by CDAC, Noida in using digital library in • research applications in Language technology. • Since digital libraries can serve as intellectual infrastructure, the research activities have to stimulate creation of next-generation operational systems.
What does Digital Library mean ? A computerized "library" that would supplement, adds functionality, and even replaces traditional libraries. Digital library Vs traditional library • Provides efficient & qualitative services by collecting, organizing, storing, disseminating, retrieving and preserving the information. • Support preservation besides making information retrieval & delivery more comfortable. • Provides online access to historical and cultural documents whose existence is • endangered due to physical decay. • Digital libraries necessarily include a strong focus on the management of digital content, just as traditional libraries have focused for long on the management of content in physical forms.
Digital Content Management • Most of the digital content that is being managed includes: • Human Language, in various forms character-coded electronic text, scanned images, printed or handwritten text or human speech. • Language technology helps in managing digital content in better ways. • Management through learning from past experience adds to manage content in even better way. • The major areas for great exploitation are: • Information retrieval, • multimedia, • database, • data mining, • data warehouse, • on-line information repositories, • image processing, hypertext, • World Wide Web and wide area information services (WAIS).
Few advantages of digital libraries • Access anywhere • Reducing delays • Distributed storage – central access • Better cataloguing • Cross references to other documents • Full text search • Protected information source • Wide exploration and exploitation of the information The information explosion, the wide bandwidth data networks and the potential of Internet-based technologies - such as the Web - make digital libraries one of the important application areas of computer science.
MAJOR DIGITAL LIBRARY INITIATIVES THE WORLD-OVER VD17 project in Germany All prints of the 17th century published in the German speaking area will be recorded and first partially digitized. About 300,000 catalog entries and 1,2 million pixel images of scanned pages will be accessed worldwide via the Internet.(http://www.forwiss.tu-muenchen.de/-vd17) Medoc in Germany (http://medoc.informatik.tu-muenchen.de/) Providing access to computer science literature via the Web. Digital Library Projects at the National Diet Library in Japan Digital library system for children. In this project, Children's books published in Japan during 19th century will be collected and provided for worldwide access. (http://www.ndl.go.jp/index-e.html) The Tsinghua University Central Library in China First online full text search public service in China based on IBM's Digital Library and ATM technologies. Bibliotheca Universalis - G7 Project Project started by G7 countriesto provide access to the world's cultural and scientific heritage. Additionally, it aims to enhance the international cooperation by establishing a global digital library system. Library of Congress in USA One of projects of Library of Congress is the American Memory, which consists of historical and cultural collection of America, such as photographs, documents, motion pictures, sounds, etc.(http://lcweb2.loc.gov/ammem)
MAJOR DIGITAL LIBRARY INITIATIVES THE WORLD-OVER ... contd NSW Parliaments Newspaper Clippings an Press Releases Imaging Project in Australia A large repository of newspaper articles for the use of members and their staff. About 1 million clippings stored on a mixture of paper and microfiche have to be converted in digital form. (http://www.sirsi.com/) Internet ArchiveUSA Digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, they provide free access to researchers, historians, scholars & general public (http://www.archive.org) California (The California Digital Library ) Provides access to scholarly materials, databases of journal article abstracts and citations, electronic journals, publishing tools, and reference databases for the University of California community. In addition, the CDL pursues technological innovations that enhance services for accessing, sharing, manipulating, and integrating scholarly content in all forms. (http://www.cdlib.org/)
MAJOR DIGITAL LIBRARY INITIATIVES THE WORLD-OVER ... contd • Digital Library of India (http://www.dli.gov.in) • Aimed to digitize 1 million books, predominantly in Indian languages (less than 1% of all books in all languages ever published) by 2005. • Providing a test bed that will support other researchers working on improved scanning techniques, improved optical character recognition, and improved indexing. • It will provide a gateway to Indian Digital Libraries in science, arts, culture, music, movies, traditional medicine, palm leaves and many more. • The result will be a unique resource accessible to anyone in the world 24x7, without regard to socioeconomic background or nationality.
DIGITAL LIBRARY RESEARCH ACTIVITIES AND APPLICATIONS • RESEARCH AREAS • Research areas may be broadly categorized into three areas for ease of exposition. • User Centric • Content Centric • System Centric
User Centric Research • This research is primarily focused on: • Methods, algorithms, and software leading to information discovery, search, retrieval, • manipulation and presentation capabilities such as • Designing software tools and applications • Developing browsing and navigation software for ease of use • Exploring theories and models related to semantic search and retrieval • Enabling multilingual information access • Writing Software applications for searching, filtering, abstracting and summarizing large volumes of data, images, and other kinds of information • Intelligent user interfaces • Exploring User/ system learning and adaptation • Processes associated with interactive use • Managing information presentation and visualization • User and usability studies, including human-machine interaction • Education, learning and capacity building
Content Centric Research • Research is focused on content from various knowledge domains to develop • Techniques for efficient data capturing, representation, preservation and archiving • Intelligent systems and algorithms for indexing, abstracting, interpreting, classifying and cataloging • Intelligent text processing; natural language analysis for data extraction and for structure, style • Structuring and linking of information objects and documents
Systems-Centric Research • Systems-centric digital libraries research focuses on component technologies and integration to realize information environment that are dynamic and flexible. • The major areas are: • Open, networked architectures for new information environments capable of supporting complex information access and analysis and collaborative work • Systems scalability and extensibility • Interoperability • New approaches and protocols for high bandwidth applications; metadata services; reliability and integrity of services; quality of service and payment models & issues • Advanced multimedia information capture, representation and digitization • Systems evaluation and performance studies
TEST BEDS • The focus is on development of digital libraries test beds for technology testing, demonstration and validation. • This approach focuses on • Integration of functional components into useful systems • Applications that enhance the general functionality of existing and future digital libraries • Specialized digital libraries applications designed for specific knowledge domains • Improving processes which support education, learning, scholarly communication and collaboration
EXPERIMENTS AND OBSERVATIONS: CDAC NOIDA • RESEARCH AREAS • The different research experiments carried out and observations made by CDAC Noida • User Centric Research • Gyan Nidhi : A multilingual parallel corpus is being developed for English & 12 Indian languages funded by DIT, Ministry of Communications and Information Technology, Govt. of India. • The documents/ books containing text parallel in more than one language (translated version of books in more than one language) were scanned and converted to text in UNICODE format. • An application front-end “Prabandhika” manages corpus and provides information on: • Number of pages in books, • Author, • Abstract of book, • Keywords information and • Number of languages in which the book is parallel.
Other features • Displays the aligned text for Indian languages • The Meta data of the files is stored in the form of XML along with the text extracted out of scanned images of books. • Use of GyanNidhi • Creation of Multilingual dictionaries, • Spell checkers for Indian languages, • Creating translation memory for Example Based Machine Translation Systems. • This multilingual parallel-aligned corpus development is first attempt in context of Indian languages. • This is the initiation of several efforts, which will follow the trend of enhancing the research in the field of Computational Linguistics.
Content Centric Research • The Text Analyzer “Vishleshika” tool can • Search for patterns of word combinations, N-grams • Check word frequencies, cluster and character statistics • And provide examples of all uses of particular words. • Academic linguists, language teachers, translators and students can use the corpus analysis tool to deepen their understanding of vocabulary and grammar of language
Sample results from Vishleshika Percentage distribution of different types of consonants w.r.t. total number of consonants in sample text
System Centric Research The data has been created using XML technologies and UNICODE standard format. The DTD’s for XML have been defined in such a manner that information extraction is easy. The Data storage in XML and UNICODE makes it possible to be used onto any platform. The Application was developed with considerations that user can extend corpus data to include other languages and domains. Systems evaluation and performance studies Another application tool “Test Bed for Machine Translation System” makes it possible to evaluate MT systems and will help in benchmarking the performance of systems. The data for this Test Bed will be collection of linguistically rich and diversified sentences taken from Gyan Nidhi Parallel Corpus.
TEST BEDS One of the major hurdles of transcribing material printed in Indian language scripts into their electronic form is the unavailability of perfect OCR systems for most of the major Indian Language scripts. The output of available systems is below threshold accuracy on old and poor quality documents. To make any Digital Library searchable ocring of the text is the first and prime requirement. CDAC Noida has been work being on OCR for Devanagari script (based on technology provided by ISI Kolkata) and test data was taken from Digital Library.
The complexity of rare documents can be illustrated with the help of following images. Based on the testing results of OCR, a training module was added to the OCR “Chitraksharika” which can be trained for recognizing different character glyphs which have been slowly disappearing in current writing style which as numbered out using “Vishleshika”. Variations in printing technologies indicate that OCR system software spell checker and dictionary support needs to be rich and diversified. Screen Shot of “Chitraksharika” (OCR for Devanagari) Training module for “Chitraksharika” Sample Old Characters from scanned text
APPLICATION OF DIGITAL LIBRARY: KNOWLEDGE DISSEMINATION The knowledge dissemination is an integral part of success story of popularity of creating digital libraries. CDAC Noida initiated a project “Dware Dware Gyan Sampada” funded by DIT (http://mobilelibrary.cdacnoida.com/) The technology used in this project makes it possible to download public domain books from the Internet via satellite and printing them anytime, anywhere, for anyone. Coverage Different places such as schools in villages and other remote areas are covered under this programme to promote literacy and demonstrate use of technology for masses.. The project is aimed is to provide universal access to human knowledge, and given the advancement of digital storage and communications this goal is now achievable.
FUTURE DEVELOPMENTS • To align multi-lingual Corpus at sentence & word level • Parts of speech tagging , Development of tools for automatic Sentence & Chunk alignment and Translation Memories • Creation of Test Bed for OCRs. • Automatic text summarisation in Indian languages • Development of Cross language information Retrieval for searching the web • Semantic Indexing for contextual search
CONCLUSION For exploiting the benefits of Digital Library in Indian languages there is urgent need of tools and applications such as OCRs and Machine Translation systems Quantitative and Qualitative Analysis of Text will boost the development of Lexical and Terminology databases, lexicography, knowledge acquisition, language and writing variation studies. Digital libraries creation have been a good test bed for OCR’s and the world is moving towards multi-modal information access and speech to speech translation all these tools together will help building one for Indian languages.