210 likes | 394 Vues
Cooperation for Arabic Language Resources and Tools – The MEDAR Project. Bente Maegaard, Mohamed Attia, Khalid Choukri, Olivier Hamon, Steven Krauwer, Mustafa Yaseen Presented by: Bente Maegaard, University of Copenhagen, Co-ordinator of MEDAR. MEDAR: Background and mission. Mission
E N D
Cooperation for Arabic Language Resources and Tools – The MEDAR Project Bente Maegaard, Mohamed Attia, Khalid Choukri, Olivier Hamon, Steven Krauwer, Mustafa Yaseen Presented by: Bente Maegaard, University of Copenhagen, Co-ordinator of MEDAR
MEDAR: Background and mission Mission • Support the development of language technology, language resources and tools for the Arabic language • Important for the people, the economy and the culture in the Arab countries But current efforts are too small and too fragmented • MEDAR is funded by the European Commission, and focuses on the Mediterranean area, but our scope for collaboration is much broader – all Arab countries, all continents – and we also want to include other Semitic languages in the future.
University of Copenhagen, Denmark (coord.) ELDA, France University of Balamand, Lebanon Al-Ahlyya Amman University, Jordan Universiteit Utrecht, The Netherlands ILSP - Athena, Greece RDI, Egypt Birzeit University, West Bank and Gaza Strip ENSIAS, University of Mohammed V Soussi, Morocco CEA, France CNRS, France The Open University, United Kingdom Université Lumière Lyon 2, France IBM, Egypt Sakhr, Egypt MEDAR partners
MEDAR Objectives and ‘streams’ 1) Technical stream • Survey of players, projects, products • BLARK for Arabic • Focus on multilingual tools, develop MT 2) Roadmap stream • Cooperation roadmap • Network creation 3) Dissemination stream
Multilingual sub-project • Focus: Machine Translation • English-Arabic • Into Arabic • Important to use Open Source • Education and training
MT system, corpora • MOSES was chosen as the MT system • Wide community • Already experiments English-Arabic • Previous experience of consortium partners • Basic MOSES system developed by Balamand • Enhanced system provided by IBM Cairo and Dublin City University. • Partners collected parallel corpus, monolingual corpora
Evaluation - 1 Automatic evaluation • 10,000 words evaluation corpus • In 200,000 words masking corpus • Four human translations have been produced, validated Human evaluation
Evaluation - 2 • Second evaluation campaign will take place in June • External participants have been invited and expressed interest
Resources for the community • MT systems, the baselines developed in the project will be made publicly available according to the original licenses (MOSES, Giza++ ..) • Training data, through ELRA, fair conditions • Evaluation package, through ELRA, fair conditions
Cooperation roadmap Roadmap concept • Set goals • Define the steps to get there • Define timeline The MEDAR roadmap covers 3 periods • 2010-2012 • 2012-2014 • 2013-2015
Elements of the roadmap • Players and human resources, education • Technology and R&D • E-infrastructure: internet penetration, mobile penetration • Market A few examples are presented here, please refer to the booklet
Players and human resources, Education Players need skilled work force - not enough HLT experts • We need HLT enabled professionals • Typically one could add • Linguistics, phonetics, language or speech processing – to engineers’ education • Computing, machine learning, language or speech processing – to linguists’ education • Do this in collaboration with other universities in the region, and with e.g. universities in Europe or the US
Players and human resources, Education - 2 • Staff exchange • Student grants • Participation of (more) Arabic partners in EU funded projects MEDAR has chosen this as an area to investigate further Partners will elaborate a cooperation scheme
Technology • BLARK - Basic building blocks: LR and tools • Reusable • Can be shared with other players • Follow standards • We need more resources and tools for Semitic languages, and they need to be shared. Free or cheap. • Essential for education, research and first development
Technology - 2 Driving applications • Fight illiteracy through HLT – speech enabled software etc • Collaborate to make this happen • Governments could introduce eGovernment etc. • Many basic technologies are needed • Discussion ongoing with other parties • Agree what they are • Agree on distribution of tasks, if possible
Market Important factors • Piracy (38% worldwide, 60% in Middle-East and Africa) • Fight piracy – this is ongoing • Provide IT services, not products which can be copied
Conclusions • Long-term goal of MEDAR • Create better conditions for the development of language and speech technology for Arabic – in order to support the people, the culture, the economy • Through collaboration and networking • Therefore we welcome all comments and invite for a broad cooperation, • Not only for Arabic, also for other Semitic languages. • And also with partners outside the EU/Mediterranean Arabic countries
MEDAR Acknowledgement: All MEDAR partners Mediterranean Arabic Language and Speech Technology See the full Roadmap report and other information at www.medar.info