1 / 18

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method. Hai Zhao , Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China zhaohai@cs.sjtu.edu.cn 2010.05.20. Motivation. If

nickan
Télécharger la présentation

How Large a Corpus do We Need: Statistical Method vs. Rule-based Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Large a Corpus do We Need:Statistical Method vs. Rule-based Method Hai Zhao, Yan Song and Chunyu Kit Department of Computer Science and Engineering Shanghai Jiao Tong University, China zhaohai@cs.sjtu.edu.cn 2010.05.20

  2. Motivation • If • corpus scale is the only factor that affects the learning performance, • then • how large an annotated corpus do we need for a specific performance metric?

  3. Zipf’s Law • Data sparseness becomes serious

  4. Choosing the task Chinese word segmentation • A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces. • Original: • 她来自苏格兰 • She comes from SU GE LAN Meaningless! • Segmented: • 她/来/自/苏格兰 • She comes from Scotland. Meaningful!

  5. Why the Task (CWS) • A simple task • Both statistical and rule-based methods are available for this task • Multiple standard large scale annotated corpora are available, too. • A word-oriented task just like what Zipf’s law will be interested in

  6. Performance Metric • Evaluation Metric, F-score: F=2RP/(R+P) • R: recall, the proportions of the correctly segmented words to all words in the gold-standard segmentation • P: precision, the proportions of the correctly segmented words to all words in a segmenter’s output

  7. Data sets and ApproachesCharacters in number of characters • Approaches • CRFs as the statistical method: learning from an annotated corpus • Forward maximal matching algorithm (FMM) as the rule-based method: perform segmentation based on a predefined lexicon • Comparable: • FMM lexicon is extracted from the same annotated corpus that CRFs adopts

  8. Data Splitting • Overcome data sparseness by training corpus splitting

  9. Learning Curves:CRFs vs. FMM

  10. CRFs Performance vs. Corpus ScaleExponential enlargement of corpus gives linear performance improvement

  11. FMM: about the Lexicon • let L denote the size of the lexicon, and s for that of the corpus from which the lexicon is extracted, we will have • And, F-score given by FMM

  12. FMM Performance vs. Corpus Scale

  13. FMM Lexicon Size vs. Performance

  14. OOV issue • Special interest in CWS: Out-of-vocabulary words (OOV) mean those that appear in test corpus but absent in training corpus. • the rate of OOV, the proportion of OOV to all words from test corpus, will heavily affect the segmentation performance.

  15. OOV rate vs. Corpus Scale

  16. Fitting OOV Rate

  17. Conclusions • A bad news: Statistical method asks for an exponential increase of annotated corpus scale to overcome the sparseness caused by Zipf’s law. • To enlarge annotated corpus is not a good way for statistical method’s performance improvement. • A little surprise: Rule-based method only asks for a negative inverse increase of corpus (lexicon) scale. • Is rule-based method more effective than statistical one? • Lexicon is much cheaper than annotated corpus(text).

  18. Thanks!

More Related