360 likes | 539 Vues
開放語言典藏組織( OLAC )與語言典藏後設資料之標準. 黃居仁、張如瑩. Outline. Introduction to OLAC Dublin Core & OAI OLAC Standards OLAC Metadata Set OLAC and Asian Languages Examples Some Relative Web Site OLAC Launch. The Open Language Archives Community. OLAC Aims.
E N D
開放語言典藏組織(OLAC)與語言典藏後設資料之標準開放語言典藏組織(OLAC)與語言典藏後設資料之標準 黃居仁、張如瑩
Outline • Introduction to OLAC • Dublin Core & OAI • OLAC Standards • OLAC Metadata Set • OLAC and Asian Languages • Examples • Some Relative Web Site • OLAC Launch
OLAC Aims • OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: • developing consensus on best current practice for the digital archiving of language resources; • developing a network of interoperating repositories and services for housing and accessing such resources.
OLAC Organization • Coordinators: • Steven Bird & Gary Simons • Advisory Board: • Helen Aristar Dry, Susan Hockey, Chu-Ren Huang, Mark Liberman, Brian MacWhinney, Michael Nelson, Nicholas Ostler, Henry Thompson, Hans Uszkoreit, Antonio Zampolli • Participating Archives & Services: • LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht • Prospective Participants: • ASEDA, Academia Sinica, AISRI, INALF, LCAAJ, Linguist, MPI, NAA, OTA, Rosetta, Tibetan Digital Library (UVA) • Individual Members: ~120 • www.language-archives.org
Introduction to OLAC • 許多協會需要語言資源,如:語言學家、工程師、教師、演說家﹔許多機構提供片段性的架構,如:檔案管理員、軟體發展者和出版者。 • 前所未有的契機: • 延伸性標誌語言(Extensible Markup Language,XML)和 Unicode題供以結構化方式彈性呈現以及長期儲存資料。 • 線上或非線上的數位化出版品有效且實際上達到分享語言資源涵義 • Dublin Core 後設資料集(資源分類標準模組)連同Open Archives Initiative所提供的交換方法,可建立一個跨越多個儲存器與檔案櫃的架構。
The Vision for an Open Language Archives Community • 使用者透過一個OLAC的服務題供者網站搜尋與呈現OLAC的metadata欄位。
The Vision for an Open Language Archives Community#2 理論上-使用者可取得任何需要的資源 • DATA: • 任何描述語言的相關資訊。 • 問卷結果:25%數位化,但並未採用相同的後設資料欄位。 • TOOLS: • 有助於創造、瀏覽、查詢或使用語言資料的計算機資源。 • ADVICE:什麼資源是可靠的?什麼工具適用於此情境?創造新資料時該如何作?
The Vision for an Open Language Archives Community#3 實際上 • 無法得到想要的資源 • 在不同網站擁有不同名字(Name)造成召回率低 (low recall). • 在其他領域有相同意義,造成正確率低(precision). • 是否運用適當軟體以及判斷ADVICE的價值? • 許多語言資源並非以文字為基礎。 • 語言資源散佈在不同的網站.
The Vision for an Open Language Archives Community -Bridging the gap through community infrastructure • Gateway:使用者可獲得data,tool,advice的單一入口網站。 • Metadata: data,tool,advice的統一描述,包含所有項目的連結以及解釋如何存取。 • Review:瀏覽 data,tool,advice的評價。 • Standards:上述各項過程與協定的基礎,例如:metadata schema,harvesting protocol.
CONVERT CREATE CREATE EXPORT DELIVER FORMAT PROC MHP MS OLAC OAI DC OLAC Recommendations Initiatives Software Standards The Vision for an Open Language Archives Community -Summary: Seven Layers Complete the Bridge USER SERVICES OLAC SERVICES OLAC REPOSITORIES CONTENT METADATA OAI
Dublin Core Metadata Initiative • 起於1995挖掘web資源的一個會議 http://dublincore.org/ • Dublin Core後設資料元素一個廣泛跨學科的核心元素,有效廣泛支援資源挖掘,適用於任何以數位化或傳統型態存在的資源描述. • 包含十五個可任選與重複的元素(elements): Title, Creator, Subject, Description, Publisher, Contributor, Date,Type, Format, Identifier, Source, Language, Relation, Coverage and Rights. • 2002/01/07--以RDF/XML呈現: http://dublincore.org/documents/2001/11/28/dcmes-xml/
The Open Archives Initiative #1 • 1999/10成立,一般性的跨電子印刷品的檔案櫃(Archives)架構,不論是哪一種學術性媒材的數位儲存器(repositories) • OAI基礎建設必須有的兩個標準: • OAI Shared Metadata Set (Dublin Core): 使內部跨儲存器運作容易. • OAI Metadata Harvesting Protocol: http協定下使用軟體查詢儲存器.
The Open Archives Initiative #2 • The Relationship Between an OAI Repository and an Archive
Applying the OAI to Language Resource OAI特色 • 透過單一介面以metadata為基礎搜尋各data provider. • Web分散式與由下而上的特色 • 集中式資料庫結構化的本質適合使用者獲取成長迅速的資源和大量使用者導向的資源描述. • 支援以Dublin Code延伸的後設資料(metadata). • 收集meta-archives在單一地方,使用者同時搜尋多個檔案館. OAI的SERVICE PROVIDER OAI的Archive
The Open Language Archives Community • 2000年十二月在workshop on Web-Based Language Documentation and Description由來自北美、南美、歐洲、非洲、中東、亞洲、澳洲的語言學家與軟體發展者所創。 • OLAC gateway:http://www.language-archives.org/
Foundation: OLAC & OAI • Recall: OAI data providers must support: • Dublin Core Metadata • OAI Metadata harvesting protocol • BUT: OAI data providers can support: • a more specialized metadata format • a more specialized harvesting protocol • What OLAC does: • specialized metadata for language resources • specialized harvesting (extra validation)
OLAC Standards • Aside: • standards = the protocols and interfaces that allow the community to function • recommendations = "standards" for representing linguistic content • OLAC has three primary standards: • OLACMS: the OLAC Metadata Set (Qualified DC) • OLAC MHP: refinements to the OAI protocol • OLAC Process: a procedure for identifying Best Common Practice Recommendations
OLAC Metadata Set #1 • 以Dublin Core的15個元素(elements)為基礎,元素經進一步組織與定義,元素的限制準則為[DC-Q],釋例[DCQ-HTML] • 可由XML DTD或Schema編碼驗證. • OLAC最新版的XML Schema: http://www.language-archives.org/OLAC/0.4/olac.xsd • 例子:http://www.language-archives.org/OLAC/0.4/olac.xml
The OLAC Metadata Set #2 The three categories of metadata: • Work language: describes information entities and their intellectual attributes e.g. names of works and their creators • Document language: describes and provides access to the physical manifestation of information e.g. format, publisher, date, rights • Subject language: describes what a document is about e.g. subject, description
OLAC Metadata Set #3 • refine::其element較精細或更多含意的規格. • code : encoding scheme精準的控制後設資料的值 • scheme : 規範元素內容文字其標準化的名稱 • lang :元素內容(element content)所使用的語言 • langs :屬於<olac>這元素的屬性,規範後設資料(metadata)閱讀時的語言 Element element refine code scheme lang attributes control vocabulary Control Vocabulary Control Vocabulary Control Vocabulary Control Vocabulary <creator refine="editor">Smith</creator>
OLAC Metadata Set #3 • Name:標籤的正式名稱。 • Definition:以一行說明描述如何使用該元素(element). • Comments:詳細描述如何使用該元素.包括DCMS和OLAC如何使用. • Attributes: XML中該元素的屬性. • Examples:例子. • 每個元素可重複出現.
OLAC Metadata SetLanguage #1 • Name: Audience Language • Definition:資源內容所使用的語言. • Comments: • 創造者讓觀眾了解作品所使用的語言. • 請與Subject.language比較. • 例如:文學作品或僅使用一種語言的文件,演講者輔助的特殊語言,聲音記錄所使用的語言,句法描述所使用的語言,註解文字和雙語字典的解釋所使用的語言,但被註解的文字以及雙語字典中被定義的文字都要以Subject.language標註. • Attributes: • code:控制詞彙請參見[OLAC-Language].控制詞彙不足或與控制詞彙用語不同時,則以元素內容加以描述.
OLAC Metadata SetLanguage #2 Examples • A resource in English about the Sikaiana language: <language code="en"/> <subject.language code="x-sil-sky"/> • A Yemba-French dictionary, where the alternate name Dschang is preferred. <language code="fr"/> <subject.language code="x-sil-ban">Dschang</subject.language> • The American Heritage Dictionary, which is both in and about American English: <language code="en-us"/> <subject.language code="en-us"/> • A resource about a language for which the controlled vocabulary does not yet provide a code: <subject.language>Ancient Sumerian</subject.language>
OLAC and Asian Languages TWO Issues • Language Identification • Is current OLAC/Enthnologue vocabulary rich enough to describe all Asian languages? • Multilingual Resources • Is current OLACMS and Processes comprehensive enough to describe multilingual resources?
Language Identification • The DC two letter code (e.g. ‘en’ for English) is not enough to describe all the languages in the world • Enthnologue (http://www.ethnologue.org) is currently the most comprehensive description of the world’s languages • Potential Problems of using Ethnologue (or any existing language list) • over-splitting • over-chunking • omission
Solution LI Problems #1 Use controlled vocabulary for elaboration: <language code="x-sil-BNN">Northern/Takituduh</> <language code="x-sil-BNN">Northern/Takibakha</> <language code="x-sil-BNN">Central/Takbanuaz</> <language code="x-sil-BNN">Central/Takivatan</> <language code="x-sil-BNN">Southern/Isbukun</>
Solution LI Problems #2 • Registering language groups with an OLAC registration service : • OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Ethnologue codes) • AS:Amis = {ALV, AIS}
Multilingual Resources #1 • Directionality is crucial in multilingual resources • However, OLAC metadata is flat and unordered • In MT systems: lost information but sufficient for resource harvesting • Bi-directional MT <Language code= X/> <Language code= Y/> <Subject.language code= X/> <Subject.language code= Y/>
Multilingual Resources #2 • One-to-many MT: <Subject.language code= S/> <Language code= T1/> <Language code= T2/> <Language code= T3/> • Many-to-one MT: <Subject.language code= S1/> <Subject.language code= S2/> <Subject.language code= S3/> <Language code= T/>
Multilingual Resources #3 • Text: language • Bitext (bilingual aligned corpus) • There is always an directionality • Original->language Translation->Subject.language • Language Description (Field Notes) • Elicitation, transcription, translation, notes--Multiple related resources
Examples #1 • 中央研究院現代漢語平衡語料庫 • http://corpus.ling.sinica.edu.tw/project/LanguageArchive/process/SinicaCorpus.xml • 中央研究院近代漢語標記語料 • http://corpus.ling.sinica.edu.tw/project/LanguageArchive/process/Early_Mandarin.xml • 中央研究院台灣南島語語料庫 • http://corpus.ling.sinica.edu.tw/project/LanguageArchive/process/Formosan.xml
Examples #2 • OLAC Metadata Editor • http://wave.ldc.upenn.edu/OLAC/minirepo/home.php4 • OLAC Service Provider: • http://wave.ldc.upenn.edu:8082/olac/index.html
Some Relative Web Site • OLAC • http://www.language-archives.org/ • Dublin Core Metadata Initiative • http://dublincore.org/ • Language Archive • http://corpus.ling.sinica.edu.tw/project/LanguageArchive/
OLAC Launch • OLAC will be officially launched at the next meeting of the Linguistic Society of America (San Francisco, January 2002) http://www.language-archives.org/docs/lsa-symposium.html