100 likes | 107 Vues
This paper provides an overview of the achievements, challenges, and major advancements in the development of language resources. It discusses the progress made in research and technology development, standardization, and resource coverage. It also highlights the major challenges in evaluation, benchmarking, and integration of different types of resources. The centralization of resources and the interlinking of morpho-syntactic lexicons on top of a semantic backbone are identified as major challenges for the future.
E N D
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005
Overview • What has been achieved? • What has not been achieved? • What are the major challenges?
What has been achieved? • Research and technology development: • Lexical representations • Large-scale and medium-scale lexical acquisition: • Machine Readable Dictionaries • Corpora • Acquilex, Multilex, Parole, Simple, EuroWordNet, BalkaNet, MEANING, etc.. • Standardization: • early initiatives EAGLES, ISLE • best practices and descriptions • Medium-scale shallow resources for a number of languages, e.g. Parole lexicons and wordnets for about 15 languages. • Small-scale deep resources for a few languages, i.e. Acquilex, Simple
What has not been achieved (1)? • Evaluation and benchmarking: • No well-defined and commonly accepted criteria • No benchmark data to validate language resources • Insufficient concerage: • 100K entries and 200K concepts per languages is needed for realistic applications, only half is achieved • Many European languages still do not have the basic resources • Insufficiently rich in data coverage: • Language coverage: mainly English • Size: e.g. Simple, FrameNet 10,000 concepts
What has not been achieved (2)? • Most resources are developed in a distributive way, i.e. common project but national groups with different approaches: • Insufficient conceptual overlap and matching across languages: • very low intersection of concepts (all Wordnets about 10,000 concepts) • diversing interpretations and definitions of relations and concepts • Insufficient overlap and consensus in the representation of lexical knowledge • Not enough progress to integrate and merge different types of resources: • Ontological resources (Semantic Web) • Lexical semantic resources (Wordnets) • Morpho-syntactic & semantic (Simple, Acquilex) • Morpho-syntactic (Parole)
What has not been achieved (3)? • Integration in real applications: • Evidence of added value, i.e. scientific proof that language technology and resources help -> more deep-thought applications • More acceptance by the general public (show cases): • The positive effects of language technology should be visible to the general public • Be aware of the language myth! The negative effects and limitations should be clear too... • More awareness by the general public on limitations: • create realization how bad the current systems are (precision and recall) • explain the undemocratic limitations of the current Internet
What is the major challenge (1)? • Critical issues: • Languages that are not well-supported: • lower economic value • less speakers • Divergence of resources and lack of semantic and conceptual intersection • Integration of semantic-conceptual knowledge (more language neutral and sharable) with morpho-syntactic knowledge (language-specific)
What is the major challenge (2)? • Centralized development of a semantic conceptual backbone: • Maximizes sharing and re-use of lexical knowledge and tools across languages; • Maximizes intersection of concepts and this interlinking of languages; • Stimulates the standardization of lexical knowledge representation; • Enables the early development of impressive Europe-wide applications on a short term: • Good show cases (Information retrieval or dialogues in all European languages) • Application-based evaluation and benchmarking
What is the major challenge (3)? • Interlinking and developing morpho-syntactic lexicons on top of the semantic backbone: • Captures the valuable non-sharable, idiosyncratic properties of languages (also has cultural value) • Enables long-term high-quality applications such as Machine Translations • Should be corpus-based but is also necessary to develop large-scale comparable corpora • Can be achieved gradually (phase-by-phase) with intermediate results
Semantic Web Semantic Backbone Wordnets Morpho-syntactic Lexicons Corpora T M D D D D D violin D bank violist play Language neutral Language specific Non-Sharable Sharable