S S Agrawal Advisor,C-DAC,Noida & Executive Director KIIT,Gurgaon, India

Recent Developments in Speech Corpora in Indian Languages Country Report -India O-COCOSDA 2010 Kathmandu,Nepal,Nov.25, 2010 S S Agrawal Advisor,C-DAC,Noida & Executive Director KIIT,Gurgaon, India ssagrawal@cdacnoida.in, ss_agrawal@hotmail.com

Speech Corpora DevelopmentContinuing activities CDAC-NOIDA A-STAR/U-STAR Project 250 Hindi Sentences, 2 utterances, 2 Channels 70 Speakers ,4 age groups,S/N-50db / 16 bit, 44.1 khz. Used for development of Hindi ASR. 1250 phonetically rich Hindi sentences by one male speaker–phonetically labeled using HTK tool kit (manually corrected) for developing Hindi TTS . 50 hrs of Annotated speech corpora for six Indian Languages TDIL(DIT/MCIT). Hindi, Marathi, Punjabi by CDAC, Noida Bengali, Assamese & Manipuri language by CDAC, Kolkata Tamil, Telugu, Malayalam and Kannada by CDAC, Thiru Phonetically rich database from large number of multilingual speakers in 3 languages Hindi, Marathi and Indian accented English.50k phonetically Hindi rich sentences by TIFR. Multi-channel, Multi-lingual database for 100 speaker database in contemporary/Non contemporary situations – CFSL, Chandigarh Data base for Dialectal variations, Emotional variations, in Hindi KIIT,AMU, IPU Speech Database for Bodo (Assamese) language - IIT-G 1000 phonetically rich sentences

CIIL-LDCIL Size of Speech Corpora

KIIT- Mobile Text & Speech Database Collection in Hindi and Indian Spoken English(Contracted by Nokia Research Center, China) Collection of 2 million words (Approx.200,000 Mobile messages in each of Hindi and Indian Spoken English languages) Cleaning and expansion of raw data with reference to the grammar rules and context. Creation of 13 Prompt sheets containing 630 phonetically rich sentences in each language based on the text data collected. Recording of prompt sheets sentences by 100 speakers through three channels simultaneously. Audio Annotation of the recorded sentences . Database for Emotional Speech AMU,IPU,KIIT Recognition of Emotional Speech using ANN and Human Perception Happiness, Anger, Sadness, Fear, Neutral 600 Sentences- 6 students of Drama Club- 5 sentences,4 times in 5 emotional conditions. Database for Transformation of Emotions (Hindi Speech) Happiness, Sadness, Anger, Surprise, Neutral 1500 sentences – 15 Speakers, 20 sentences,5 emotions

Speech Recognition and databases Continuing ASR Systems: LVSR and Language models for Tamil ,Telugu, Speech Recognition – Anna University. Telephone Speech Recognition System for Hindi – IBM (I) Research Lab. Speech to Text System for Hindi – Shrut- lekhan – Prototype - CDAC Pune Indian Language Speech Recognition Systems - H.P. Labs India. Manner Based Lexically Driven Bengal Speech Recognition System – CDAC Kolkata Consortium project on Speech Recognition Speech Based Access for Agriculture Commodity Wire or Wireless communication based enquiry : Telephone or Mobile Phones Six languages in first Phase: Hindi – IIT, Kanpur ,Assamese - IIT Guwahati, Bengali– CDAC, Kolkata , Marathi– TIFR + IIT, Mumbai, Telugu-IIIT, Hyderabad, Tamil - IIT, Madras A database of sentences from 3000 farmers in each Language.

Text to Speech Synthesis-Consortium Project Collection of Text & Speech Corpora –Festival Based Language LexiconNo. SyllableSpeech (hrs)Institute Hindi 0.1 M 4300 6 hrs IIT-M Tamil 0.45 M 4200 6 hrs IIT-M Telugu 0.1 M 3000 45 hrs IIIT-H Marathi 0.11 M 9200 10 hrs CDAC-Mum Malayalam 0.05 M 6561 6 hrs CDAC-TRV Bengali 0.023 M 7400 10hrs IIT-Kh Other projects: Vaachak: Hindi, going on for Indian English – followed Indian Language – SAPI Compliant- Prologix Software Hindi Vani: TTS for Hindi based on Klatt’s format synthesizer – version – CEERI Bangla Vani: Concatinative Bangla and Nepali (ESNOLA Based) TTS – developing using E/rock - CDAC Kolkata Subhasini : TTS Malayalam : Band on disphonic concation – suggests ISCII, ISFOC & UNICODE – CDAC, Thiruvanantpuram

S S Agrawal Advisor,C-DAC,Noida & Executive Director KIIT,Gurgaon, India