1 / 14

HanNanum Project

HanNanum Project. Sangwon Park 2010.11.24. Contents. The result of applying plug-in component based architecture Key differences with previous HanNanum ( jhannanum ver.0.7.4) GUI demo A measurement of the morphological analyzer Features of Korean morphological analysis

cicero
Télécharger la présentation

HanNanum Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HanNanum Project Sangwon Park 2010.11.24

  2. Contents • The result of applying plug-in component based architecture • Key differences with previous HanNanum (jhannanum ver.0.7.4) • GUI demo • A measurement of the morphological analyzer • Features of Korean morphological analysis • Measurement 1. Strict criteria • Measurement 2. Loose criteria

  3. The result of applying plug-in component based architecture • HanNanum ver.0.8 was released • Plug-in component based architecture • Faster analysis speed • Object based communication • Reduced overhead between components • More accurate result • Several bugs were fixed. • GUI Demo • It helps people to understand the concept of HanNanum workflow • People can test various workflow for their own purpose

  4. GUI Demo Workflow Information of a plug-in Plug-in Pool Workflow control Input & Output

  5. A measurement of the morphological analyzer POS Tagger

  6. A measurement of the morphological analyzer • Features of Korean morphological analysis • 가시는 • 가시/noun + 는/josa (thorn, prickle) • 가시/verb + 는/eomi (leave, disappear) • 가/verb + 시/eomi + 는/eomi (go) • 갈/verb + 시/eomi + 는/eomi (grind, sharpen) Ambiguity of part-of-speech • Ambiguity of segmentation of morpheme

  7. Evaluation Metrics POS Tagger • Input • 집에 가시는 • Output • 집에 • 집/pvg+에/ecx • 집/pvg+에/jca • 가시는 • 가시/ncn+는/jxc • 갈/pvg+시/ep+는/etm • 가/pvg+시/ep+는/etm • 가/px+시/ep+는/etm • Correct Analysis • 집에 • 집/ncn+에/jca • 가시는 • 가/pvg+시/ep+는/etm

  8. Evaluation Metrics • Measurement 1. Strict criteria • Only when the analysis result is exactly same with the corpus, it is considered as a correct one. • A measurement can be performed on large amount of test data automatically. • This has not been used in papers on Korean morphological analyzer. • Measurement 2. Loose criteria • There can be several correct answers on a input Eojeol. • Only few tags, such as {N, P, M, I , J, E, X, F, S} are considered. • Most of the papers use this criteria and say that their analyzers show around 98% accuracy.

  9. Measurement 1. Strict criteria Input Data • Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus • Test Set 20 sentences, which have more than 10 eojeols, from 68 documents • # of sentences 1360 • # of eojeols 25515 Result • # of generated eojeols 74415 • # of eojeols which are restored and segmented correctly 23605 • # of eojeols which are tagged correctly 19147 • Precision 19147/25515 (0.75) • Recall 19147/74415 (0.26) • F-measure 0.38

  10. Measurement 2. Loose criteria Larger Morpheme Dictionary • Morpheme Dictionary was extended with the Corpus • 29098 morphemes+tagsare extended Input • Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus • Test Set 2 sentences, which have more than 10 eojeols, from 68 documents • # of sentences 136 • # of eojeols 2527 Result • # of generated eojeols 30536 • # of eojeols which are restored and segmented correctly 2340 • # of eojeols which are tagged correctly 2041 • Precision 2041/2527 (0.81) • Recall 2041/30536 (0.07) • F-measure 0.12

  11. Measurement 2. Loose criteria

  12. Appendix. Segmentation

  13. Appendix. Spacing

  14. Thank you HAPPY CILAB

More Related