1 / 76

Chu-Ren Huang Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline: Language Resources and Tools.

cid
Télécharger la présentation

Chu-Ren Huang Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Synergy to Knowledge: Integrating multiple language resourcesPart I: Language Resources and Tools Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm

  2. Outline: Language Resources and Tools • Introduction: 10 Years in Chinese Language Processing-A mirror for other Asian Languages • The Starting Point: Resources and Resources Sharing • OLAC: The Open Language Archives Community • Asian Language Resources Committee of AFNLP • Standards: ISO TC37 Language Resources Mangagement • Language Archives Project of Taiwan • Tools: Getting Started in NLP with NLTK C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  3. Why Resources and Tools Language Resources • Foundation and empirical basis of scientific studies of natural languages • The only reliable source for language specific features • Infrastructure for knowledge representation and knowledge engineering • Essential to preserve linguistic and cultural diversity Tools • Needed to ‘process’ • General enough for multilingual processing and cross-lingual comparison • Robust enough to deal with language specific issues C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  4. Chinese Language Processing as a Mirror For the development of Asian Language Processing • Unlike Japanese, which has enjoying being one of the leaders in technological innovation • The development of Chinese language processing coincides with the developing economies of Taiwan and China • Especially the availability of Chinese language PC’s • Similar to the situation of many Asian languages now C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  5. CLP in the past 10 years A review of what happened in the past ten years in Chinese Language Processing (1992-2002) from a somewhat personal perspective 1992 –Corpora Completion of the first Chinese corpus for linguistic research (Huang and Chen, COLING ’92.1214-1217) -untagged, non-segmented -but searchable C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  6. CLP 1992 –1993 1992 –Segmentation Standard Announcement of the first national standard for word segmentation by PRC government. 《GB 13715-信息處理用現代漢語分詞規範》. 1993 –Lexicon Completion and Release of the first version of CKIP lexicon (with the category set and ICG thematic roles) First version of K. Chen’s parser for Chinese C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  7. CLP Corpus 1994 – 1995 1994 10th year anniversary for the Automation of Chinese historical textual databases. Completion of the pre-Qin Classic Chinese corpus at Academia Sinica. 1995 Completion of Sinica Corpus (v. 1.0 1 million words), the first balanced and tagged Chinese corpus. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  8. CLP 1996 –Research Institutes 10th Anniversary of the Institute of Computational Linguistics at Peking University 10th Anniversary of the Chinese Knowledge Information Processing Group at Academia Sinica –Anthology of Papers Readings in Chinese Natural Language Processing (Journal of Chinese Linguistics Monograph) Editors: Huang, Chen, and T’sou C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  9. CLP 1996 November-1997 Sinica Corpus on Web One of the first fully searchable language corpus on the WWW http://www.sinica.edu.tw/ftms-bin/kiwi.sh (old webpage in web archives) http://www.sinica.edu.tw/SinicaCorpus/ (current page) 1997 Publication of the first Chinese dictionary compiled directly from a corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and Noun-Classifier Collocation Dictionary) The Tenth Annual ROCLING conference C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  10. CLP 1998 –KnowledgeNet Release of HowNet, the first full-fledged Chinese and English-Chinese LKB http://www.keenage.com/ -Segmentation Standard Official announcement of CNS14366 for Taiwan C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  11. CLP 2000 –Treebanks Simultaneous completion and announcement of two Chinese Treebanks: *Penn Chinese Treebank *Sinica Treebank ACL Workshop on Chinese Language Processing C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  12. CLP 2001-2002 2001 –Society Formal approval of the formation of ACL SigHAN, the first international organization on Chinese Language Processing 2002 First SigHAN workshop on Chinese Language Processing Formal launch of Hsieh’s Intelligent Character Encoding System (a sustainable solution to the missing character problem) COLING2002 in Taipei C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  13. CLP 2003 - 2003 • THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION BAKEOFF http://www.sighan.org/bakeoff2003/ 2002-2005 • Chinese Proposition Bank http://www.cis.upenn.edu/~chinese/cpb/ 2003,2005,2007 • Chinese Gigaword Corpus v.1., v.2, and tagged version C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  14. What CLP Development Showed? • Resources Lead • When tools and standards completes a comprehensive infrastructure • Research will bloom C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  15. Resources Development • Towards a Sharable and Sustainable Model of Resources Development OLAC Open Language Archives Community http://www.language-archives.org C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  16. OLAC Aims OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: • developing consensus on best current practice for the digital archiving of language resources; • developing a network of interoperating repositories and services for housing and accessing such resources. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  17. OLAC Organization Coordinators: Steven Bird & Gary Simons Council: Anthony Aristar (Linguist List), Christopher Cieri (LDC), Gary Holton (Alaska Native Lanuage Center), Chu-Ren Huang (Academia Sinica), Heidi Johnson (Archive of the Indigenous Languages of Latin America), Laurent Romary (Atilf, University of Nancy), Joan Spanne (SIL), Martin Wynne (Oxford Text Archive) Participating Archives & Services: 39 archives including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht, Academia Sinica, TalkBank, Rosetta, MPI Individual Members: ~120 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  18. Types of Language Resource DATA: any information which documents or describes a language, such as a: • monograph, data file, shoebox of index cards, unanalyzed recordings, heavily annotated texts, complete descriptive grammar TOOLS: computational resources that facilitate creating, viewing, querying, or otherwise using language data • includes fonts, stylesheets, DTDs, Schemas ADVICE: any information about: • reliable data sources, appropriate tools and practices C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  19. The Gap C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  20. Coordinated Approach OLAC OAI "A shared architectural vision, having many components, and implemented in stages by the community, will bridge the gap" Analogies: federated databases; semantic web C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  21. CONVERT CREATE CREATE EXPORT DELIVER FORMAT PROC MHP MS OLAC OAI DC OLAC Recommendations Initiatives Software Standards OLAC USER SERVICES OLAC SERVICES OLAC REPOSITORIES CONTENT METADATA OAI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  22. The Foundation: 3 initiatives Dublin Core Metadata Initiative (DC) • founded in 1995 (Dublin, Ohio) • conventions for resource discovery on the web Open Archives Initiative (OAI) • founded in 1999 (Santa Fe) • interoperability of e-print services Open Language Archives Community (OLAC) • founded in 2000 (Philadelphia) • a partnership of institutions and individuals • creating a worldwide virtual library of language resources C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  23. Foundation 1: DC Elements 15 metadata elements: • broad interdisciplinary consensus • each element is optional and repeatable • applies to digital and traditional formats • Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights. dublincore.org C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  24. Foundation 1: DC Qualifiers Encoding Schemes: • a controlled vocabulary or notation used to express the value of an element • helps a client system to interpret the element content • e.g. Language = "en" (not "English", "Anglais", ...) Refinements: • makes the meaning of an element more specific • e.g. Subject.language, Type.linguistic C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  25. Foundation 2: OAI Repository C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  26. Foundation 2: OAI Standards To implement the OAI infrastructure, an archive must comply with two standards: 1. The OAI Shared Metadata Set • Dublin Core • interoperability across all repositories 2. The OAI Metadata Harvesting Protocol • HTTP requests - 6 verbs: • Identify, ListIdentifiers, ListMetadataFormats, ListSets, ListRecords, GetRecord • XML responses C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  27. Foundation 2: OAI Service Providers and Data Providers C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  28. Foundation 3: OLAC & OAI Recall: OAI data providers must support: • Dublin Core Metadata • OAI Metadata harvesting protocol BUT: OAI data providers can support: • a more specialized metadata format • a more specialized harvesting protocol What OLAC does: • specialized metadata for language resources • specialized harvesting (extra validation) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  29. OLAC Standards Aside: • standards = the protocols and interfaces that allow the community to function • recommendations = "standards" for representing linguistic content OLAC has three primary standards: • OLACMS: the OLAC Metadata Set (Qualified DC) • OLAC MHP: refinements to the OAI protocol • OLAC Process: a procedure for identifying Best Common Practice Recommendations C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  30. The OLAC Metadata Set The three categories of metadata: • Work language: describes information entities and their intellectual attributes • e.g. names of works and their creators • Document language: describes and provides access to the physical manifestation of information • e.g. format, publisher, date, rights • Subject language: describes what a document is about • e.g. subject, description C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  31. OLACMS and Controlled Vocabularies Language: A language of the intellectual content of the resource (OLAC-Language) Subject.language: A language which the content of the resource describes or discusses (OLAC-Language) OLAC-Language: A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  32. CONVERT CREATE CREATE EXPORT DELIVER FORMAT PROC MHP MS OLAC OAI DC OLAC Recommendations Initiatives Software Standards Summary: With the software in place, we have a complete platform CONTENT METADATA OAI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  33. CONVERT CREATE CREATE EXPORT DELIVER FORMAT PROC MHP MS OLAC OAI DC OLAC Recommendations Initiatives Software Standards Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources OLAC REPOSITORIES CONTENT METADATA OAI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  34. CONVERT CREATE CREATE EXPORT DELIVER FORMAT PROC MHP MS OLAC OAI DC OLAC Recommendations Initiatives Software Standards OLAC USER SERVICES OLAC SERVICES OLAC REPOSITORIES CONTENT METADATA OAI Acknowledgements: ISLE and TalkBank projects (NSF), participants of the Philadelphia workshop, Eva Banik (programmer), Hernando de Soto (the analogy) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  35. OLACMS helps archive versatility Given Shared Metadata Standard • New language archives can be created on the fly by harvesting existing archives • Rich information can be inferred by establishing temporal and geographic anchors for each document. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  36. OLAC Infrastructure Helps to Solve Language Archive Problems such as • Language Identification and • Metadata Set for Multi-lingual Language Archives C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  37. The Language Identification Problem The DC code (e.g. ‘en’ for English) is not enough to describe all the languages in the world Enthnologue (http://www.ethnologue.org) is comprehensive but not complete Potential Problems of using Enthnologue (or any existing language list) • over-splitting • over-chunking • omission C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  38. A Fundamental Solution to Language Identification Problems Registering language groups with an OLAC registration service OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Enthnologue codes) AS:Amis = {ALV, AIS} ALV= Amis, AIS= Nataoran C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  39. Describing Multi-Lingual Resources in OLACMS • Directionality is crucial in multilingual resources • However, OLAC metadata is flat and unordered Bi-directional MT <Language code= X/> <Language code= Y/> <Subject.language code= X/> <Subject.language code= Y/> C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  40. Multi-lingual Resources II Text: language Bitext (bilingual aligned corpus) • There is always an directionality • Original: language • Translation: Subject.language Language Description (Field Notes) • Elicitation, transcription, translation, notes Multiple related resources C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  41. Language Archives Project of Taiwan • Part of the National Digital Archives Project (NDAP) • Pilot Stage 2000-2001 • First Phase: 2002-2006 • Both Language Archives • And Linguistic Anchor C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  42. Language and Digital Archives C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  43. Digital Archives are Linguistically Anchored • Archives are anchored with Lexical KnowledgeBase (LKB) -because LKB as collection of lexical types instantiated in archives uniquely defines each archive -And each lexical item is the conceptual atom projecting knowledge from archive to archive C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  44. Multi-anchor Knowledge Linking • Geographical anchor based on GIS (geography information system) -Ecology (Fauna, Weather, Geology etc.) -Socio-Anthropological classification • Linguistic anchor based on LKB -etymology, language grouping, loan words, C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  45. Institute of Linguistics Language Archives C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  46. Two branch projects : • 1 Chinese Archives -- 5 sub-projects: • Early- Mandarin Chinese Lexicon • Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts • Modern Chinese Corpus and Treebank • New Age Corpus: Linguistic Representations and Archives of Multimedia Data • Southern-Min Archive: A Database of Historical Change in Language Distribution • 2 Formosan Language Archives. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  47. Early- Mandarin Chinese Lexicon • GOAL: • Collect the corpus and the lexicon in the period of Early Mandarin Chinese. • Provide a systematical knowledge thesaurus as well as powerful instrument for the study of the grammatical development. • Archives Description: • Digitalization of texts (10,000,000 characters). • Tagging of grammatical markers (3,500,000 characters). • Construction of the lexical database. • http:www.sinica.edu.tw/Early_Mandarin C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  48. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  49. Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts • Archives Description: • to digitize the bronze inscriptions from the Shang to the Eastern Chou dynasties. • the construction of a typological lexicon of bronze inscriptions and bamboo scripts accurate encoding and analysis for the bronze inscriptions and Chu scripts. • Achievement: • Proof-read bronze inscriptions (12113 piece of bronze inscriptions). • http://Inscription.sinica.edu.tw C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

  50. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

More Related