Small-Corpus-Based Automatic Chinese Unknown Word Extraction

國立雲林科技大學National Yunlin University of Science and Technology • Automatic Chinese unknown word extraction using small-corpus-based method • Advisor：Dr. Hsu • Graduate：Chien-Shing Chen • Author：Tao-Hsing Chang • Chia-Hoang Lee Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Extracting possible unknown words • SPLR • Modification • Prefixed/suffixed, Compound word selection • Experiment • Conclusion • Opinion

Motivation • N.Y.U.S.T. • I.M. • any Chinese character can either represent a word or be a part of other words • no blank between Chinese words for identifying the boundaries • some drawbacks- Statistics and Rules Based • “拍打皮卡丘” • “觀光協會”、”神奇寶貝”

Objective • N.Y.U.S.T. • I.M. • Extract Chinese unknown words • efficiency • accuracy • words occur rarely • small size of document for training

1-1.Introduction • N.Y.U.S.T. • I.M. • unknown words which don’t exist in dictionary or vocabulary • Identifying the boundaries “拍打皮卡丘” “資料探勘非常有意思” • Semantic ambiguity “觀光協會”,”神奇寶貝”

1-2.Introduction • N.Y.U.S.T. • I.M. • Restrict scope for Particular types of the unknown words • ‘Prefixes/suffixes’ identify proper name • Hybrid method to estimate the probability • Identifying general unknown words difficultly • “熱鬧非凡”、”回味無窮”、”神奇寶貝” • “發生什麼”、”老師問問題”

1-3.Introduction • N.Y.U.S.T. • I.M. • Statistics-based methods • Small documents cause low accuracy • Develop a method • Advantage of the efficiency of statistics-based • Accuracy of identify when small size of document

2.Previous Works • N.Y.U.S.T. • I.M. • The proper name can’t be identified (compound word) • “中國國際商業銀行” • “中國”，”國際”，”商業”，”銀行” • Statistics-based method • occur frequency • PLU-based likelihood ration (PLR) • Not only efficient but also fast • Occur rarely can’t be extracted

3-1.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • Preprocessing • Retrieving possible character sequences • Maximum length of character sequences is limited • Eliminate stop words from character sequences • The frequently occurring character sequences are then regarded as possible unknown words.

3-2.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • sequence occur follows the subsequence, the sequence should not be unknown words • “去福利社” occur follow “福利社”, so “去福利社” isn’t a possible unknown word

3-3.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • Defined:

3-4.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • “去福利社” 200 times • “福利社” 1000 times • SPLR(tp)= = Tolerate error coefficients

4.Modification • N.Y.U.S.T. • I.M. 1.one-charactered prefix(前綴) or suffix(字尾) “導師室” “導師” results in low SPLR of “導師室” 2.Familiar sequences “從教室裡衝出來” isn’t an unknown word but would be identified by simple SPLR method

4-1-1. Prefixed/Suffixed Word Revising • N.Y.U.S.T. • I.M. • Some words which contain the prefixed or suffixes have been collected by dictionaries which are available. • For example, an unknown word : • “總領隊” includes the prefix, “ocw + mcw” • “導師室” includes the suffix, “mcw + ocw”

4-1-2. Prefixed/Suffixed Word Revising • N.Y.U.S.T. • I.M. • The one-charactered prefixes/suffixes can be extracted in advance from available dictionaries.

N.Y.U.S.T. • I.M.

4-2-1. Compound Word Selection • N.Y.U.S.T. • I.M. • Familiar sequence in the document: • includes one or more common words while the compound words consists of particular words • “從教室裡衝出來” consists of the common words “教室” and “出來” • “文具用品” 100 times • “文具” 100 times • “用品” 100 times

4-2-2. Compound Word Selection • N.Y.U.S.T. • I.M. ts is the word included by tp and not a one-charactered word is the threshold • A sequences consist of the common words, should not be possible unknown words

4-2-3. Compound Word Selection • N.Y.U.S.T. • I.M. • Familiar sequences and compound words can be differentiated efficiently • “神奇寶具” 200 times • “神奇” 230 times • “寶貝” 250 times • “發生什麼” 200 times • “發生” 2000 times • “什麼” 4000 times 200/230 200/2000

5.Experimtents • N.Y.U.S.T. • I.M. • Data set : 1,285 students essays • Theme: “Recess at School” • Characters: 470,665

5-1.Experimtents-SPLR • N.Y.U.S.T. • I.M.

5-2.Experimtents-Familiar • N.Y.U.S.T. • I.M.

5-3.Experimtents-prefixed/suffixed • N.Y.U.S.T. • I.M. • Prefixed or suffixed pattern in CKIP lexicon (中央研究院資訊科學研究所-中文知識庫小組)

6.Conclusion • N.Y.U.S.T. • I.M. • efficiency • accuracy • words occur rarely • small set of training corpus

Opinion • N.Y.U.S.T. • I.M. • Information Retrieval • unknown Word • compound word • Semantic web

Small-Corpus-Based Automatic Chinese Unknown Word Extraction