1 / 18

Sangwon Park January 12, 2011

KKAP : KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser. Sangwon Park January 12, 2011. Research Goal.

gwylan
Télécharger la présentation

Sangwon Park January 12, 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

  2. Research Goal • The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean natural language analysis. • The KKAP will be flexible and easy to utilizeso that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc.

  3. Contents • 1. Introduction of Korean Morphological Analysis • 2. HanNanum Korean Morphological Analyzer & POS Tagger • 3. Extension to KKAP(KAIST Korean Analysis Platform)

  4. Difficulties of Korean morphological analysis • Features of Korean morphological analysis • 가시는 • 가시/noun + 는/josa (thorn, prickle) • 가시/verb + 는/eomi (leave, disappear) • 가/verb + 시/eomi + 는/eomi (go) • 갈/verb + 시/eomi + 는/eomi (grind, sharpen) • Example Sentences: • 그 선인장의 가시는 참 따가웠다. • 물을 마셨더니 갈증이 가시는 기분이다. • 할머니께서는 집에 가시는 길이었다. • 아저씨의 칼을 가시는 모습은 인상적이다. Ambiguity of part-of-speech • Ambiguity of segmentation of morpheme

  5. HanNanum Korean Morphological Analyzer • HanNanum has been developed since 1990s. • Written in C programming language • Module-based architecture • Based on KAIST morphological analyzed corpus • HMM-based, Maximum Entropy-based POS Tagger

  6. Segment Position Inverse Segment Position HanNanum Architecture Morpheme Chart Morphological Analyzer INPUT Analyzer Connection Check Sentence Divisor Chart (lattice form) Tagger Dictionary Search Phoneme Restoration Computation Tag Mapper OUTPUT Tag Set Code Conversion Frequency Dictionary Tag Set Table Connection Info. Table System Dictionary (Trie) Bigram Info. User Dictionary (Trie) Number Dictionary

  7. HMM-based POS Tagger • Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-of-Speech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference on Hangul and Korean Language Information Processing, 389-394, 1994. • Transition probability between word phrase tag • Transition probability between morpheme tag in a word phrase • Probability of occurrence of morpheme and POS

  8. Analysis Example • HMM-based Tagger • Find the most suitable result among the candidates - POS-tagged Dictionary - Check Connection rule - Phoneme Restoration

  9. Plug-In Component-based System • Each functionality for the Korean morphological analysis is implemented as a plug-in. • It allows a user to set up a workflow with existing plug-ins for his own goal. Plug-In Pool HMM POS Tagger CRF POS Tagger Chart-base Morph Analyzer Corpus-base Morph Analyzer Tag Mapper Trans-literation Noun Extractor Phase1 Supplement Plugin … Tag Mapping Unknown Noun Proc. Phase2 Morphological Analyzer … Auto Spacing Noun Extracting Sentence Splitter Phase2 Supplement Plugin Input Filter … Phase3 POS Tagger Phase3 Supplement Plugin

  10. Flexible Workflow $$$$$ $/su+$/su+$/su+$/su+$/su 장소 장소/ncn $$$$$ $/su+$/su+$/su+$/su+$/su 서울 서울/nq 코엑스 코엑스/ncn 3층 3/nnc+층/nbu - Analysis of Announcement on Web SentenceSplitter HMM-basedPOSTagger UnknownProcessor Chart-basedMorphologicalAnalyzer AutoSpacing InformalInputFilter $$$$$장소$$$$$ 서울코엑스3층 Morpheme Processor POS Tagger Morphological Analyzer Plain Text Processor - Indexing of News Articles NounExtractor Chart-basedMorphologicalAnalyzer SentenceSplitter 9월/n 거제도/ncn 축제/ncn 지난 9월 거제도에서 열린 축제 … Morpheme Processor Morphological Analyzer Plain Text Processor

  11. HanNanum Korean Morphological Analyzer Workflow for Morphological Analysis Phase 1. Text Preprocessing Phase 2. Morphological Analysis Phase 3. POS Tagging Supplement Plugin Major Plugin Supplement Plugin Major Plugin Supplement Plugin • 7/nnc+일/nbu저녁/ncn발표예정/ncpa+이/jp+ㄴ/etm노벨문학상/nq+의/jcm유력/ncps수상자/ncn+로/jca고은/nq시인/ncn+이/jcc거론/ncpa+되/xsv+고/ecc있/paa+다/ef./sf통신은 통/ncn+신/ncn+은/jxc스웨덴/nq+의/jcm노벨상/ncn관측통/ncn+들/xsn사이/ncn+에/jca…. • 7일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다. AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다. … Plugin Pool Phase 1. Plugin Phase 2. Plugin Unknown Term Processing Sentence Segmentation Auto Spacing Chart-base Morph Analyzer Input Filter Noun Extraction Tag Mapper Noun Extraction Korean Document Analysis CRF-based POS Tagging HMM-based POS Tagging Extract the Part Of Speech Information from Korean Text Tag Mapper Phase 3. Plugin

  12. Open Source Project • http://kldp.net/projects/hannanum/ • 2011.01.10 jhannanum 0.8.2 was released

  13. GUI Demo Workflow Information of a plug-in Plug-in Pool Workflow control Input & Output

  14. KKAP: KAIST Korean Analysis Platform Workflow for Korean Analysis Phase 1. Text Preprocessing Phase 2. Morphological Analysis Phase 3. POS Tagging Phase 4. Parsing Supplement Plugin Major Plugin Supplement Plugin Major Plugin Supplement Plugin Major Plugin Supplement Plugin • 7/nnc+일/nbu저녁/ncn발표예정/ncpa+이/jp+ㄴ/etm노벨문학상/nq+의/jcm유력/ncps수상자/ncn+로/jca고은/nq시인/ncn+이/jcc거론/ncpa+되/xsv+고/ecc있/paa+다/ef./sf통신은 통/ncn+신/ncn+은/jxc스웨덴/nq+의/jcm노벨상/ncn관측통/ncn+들/xsn사이/ncn+에/jca…. • 7일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다. AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다. … Plugin Pool Phase 2. Plugin Phase 1. Plugin Unknown Term Processing Sentence Segmentation Auto Spacing Chart-base Morph Analyzer Noun Extraction Input Filter Tag Mapper Noun Extraction Noun Phrase Extractor Analyzed Korean Document HMM-based POS Tagging VerbPhrase Extractor Tag Mapper Korean Document Analysis Chart Parser Phase 3. Plugin Phase 4. Plugin

  15. Korean Syntactic Tree Tagged Corpus • Registered at BoRA (Bank of Resource for Language and Annotation) • http://bora.or.kr • Corpus 5. Manual sentence analysis corpus • 31,091 Sentences from 97 different sources. • Length: 1 ~ 33 Eojeols Average 11.35 Eojeols • Related document • Kong joo Lee, ByungGyu Chang, Gil Chang Kim, “Bracketing Guidelines for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department Technical Report, CS/TR-97-112, 1997 (In Korean) • ByungGyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp.421~429, 1997 (In Korean)

  16. Question & Comments

More Related