A New Lexicon Mechanism for Chinese Word Segmentation

A New Lexicon Mechanism for Chinese Word Segmentation Advisor : Dr. Hsu Graduate : Kuo-min Wang 2006 PACIS .

Outline • Motivation • Objective • Introduction • A New Lexicon Mechanism • Experiments • Conclusion • Personal Opinions

Motivate • Under the development of global networking through Internet, the amount of articles in Chinese or other oriental languages is increasing rapidly. • As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus affecting the whole system in performance.

Objective • This paper propose a new solution for Chinese word segmentation problem based on lexicon named double-character-and-long-world-hash-indexing (DCLWHI). • This method can improve the speed and efficiency of word segmentation without extra memory spending, and gains the same accuracy.

Introduction • The current methods of Chinese word segmentation are divided into two kinds • Lexicon • Easily accomplished, high level arithmetic efficiency • Out of vocabulary problem (OOV) • (new words, names of people, organizations and locations) • Frequency statistic • Has the advantage on OOV problems • But the arithmetic efficiency is much lower than the lexicon based method.

A New Lexicon Mechanism • The Double-Character words hold large proportion in Chinese words. • 70% are double-character word [4] • Make a hash indexing for the first two characters of the lexicon words, then add the remaining string into a special long word table, which has a hash indexing.

A New Lexicon Mechanism • First-Double-Character-Hash-Indexing • Flag Bit(2Bytes) • If the two-character is a prefix of a word which length is N, the big N-1 of the 2 bytes will be set 1; • Exaple “圖籍” , which is a double-character word, but can’t be the prefix of other words, So the Flag Big of 圖籍 is set 0000000000000010(0x0002) • 電老 is not a Chinese word, but it can be a prefix of a word 電老虎. So the Flag Bit is 0000000000000100(0x0004) • Similar examples : 春夏(ox000A)，君子(x0006)、敢作(x0008) • Long Word Hash Indexing • Similar to the First-Double-Character-Hash-Indexing. 2-character 0000 0000 0000 0010 3-character 0000 0000 0000 0100 4-character 0000 0000 0000 1000

A New Lexicon Mechanism • Example of Search－＞君子當圖籍是電老虎 • Pick up first two characters “君子”, Flag Big is x0006 can be a 2-character or a prefix of a Treble-Character word. • Then shift to the character “當”, compute the hash value of the substring “君子當”, search in the long word • Find the marching index, confirm the string , marching succeed. • Shift to Character “圖籍” (0x002) • Shift to Character “是電” • There is no value in hash-indexing, 2 situations may happen • First, there is no value in hash-indexing, return one character “是” • Second, there is a substring in the index, but value unequally; return one character “是” • Shift to Character電老”(0x004) • Shift to Character “虎” 君子當圖籍是電老虎君子當圖籍是電老虎君子當圖籍是電老虎君子當圖籍是電老虎君子當圖籍是電老虎君子當圖籍是電老虎

Experiments • Comparison of Searching Cycles • Comparison of Memory Space Cost • Comparison of Speed

Background • Binary-Seek-by-Word • Composed of three parts • Lexicon text, word-index-table, first-character-index-table

Background (cont.) • TRIE indexing tree • is a multi-chain-table tree, the mechanism is composed of two parts: • First-character-index table and TRIE index-tree node • Didn’t need to predict the length of the word , only need to match the word by chain-tree

Background (cont.) • Binary-Seek-by-Characters • Absorbs the search-advantage in TRIE indexing tree, using searching by characters not searching by words

Background (cont.) • Summary above methods’ drawbacks • Binary-seek-by-word is using full-words marching, the efficiency is evidently low. • The design and maintenance of the TRIE tree is very complex, wastes mass memory space • Binary-seek-by characters • Improves some aspects, but it doesn’t change the data structure of the binary-seek-by-word which restrict the efficiency.

Some novel schemes • Double-Character-Hash- indexing[4] • An new searching tree improved from the TRIE indexing tree. • Composed of two parts: Hashing index, remaining strings. • Can avoids the deep searching , increases the segment speed without complex increasing.

Some novel schemes (cont.) • A new lexicon mechanism based on PATRICIA[3] • Use of the ISN (internal statement number) of the words as the key words bit-string, • Constructs the PATRICIA tree by comparing the big-string. • Advantage • The searching process only need some cycles of bit comparison and some cycles of string comparison. • Double-Array Trie[1] • Even node in the tree stands for a status of an auto-machine, • Which changes according to the difference of the variable. • This new structure actually is an improved scheme of the TRIE tree, using 2 linear arrays to express the TRIE tree

Some novel schemes (cont.) 用負的base值表示該位置為詞語。如果狀態i對應某一個詞，而且Base[i]=0，那麼令Base[i]=（-1）*i，如果Base[i]的值不是0，那麼令Base[i]=（-1）*Base[i]。得到雙陣列如下：例如設“阿根”的下標為i=8，那麼check[i]的內容是“阿”的下標，而base[i]是“阿根廷”的下標的基值。 “廷”的序列碼為x=8，那麼“阿根廷”的下標為base[i]+x=base[8]+8=12。

Some novel schemes (cont.) • Double-code scheme [1] • Basic idea is mapping the 6768 Chinese characters in GB-2312 into the sequence-code from 1 to 6768. • Every string written in Chinese can only maps to a number string, • Composed of two steps: • Switch from number-sequence into even-coding • Establish indexing mechanism

Analysis of the novel schemes • Double-Character-Hash-Indexing • Improvement of TRIE index tree, while it is easier structured and maintained than the former mechanism. • PATRICIA • Is a super arithmetic in segment speed, but it waste on the memory space and reduce the efficiency. • Double-Array Trie • When decrease or increase the lexicon, the whole double-array should be adjusted. • Double-code scheme • The extract rate of the arithmetic is not good enough, which result in a very big array, restrict the performance of the search efficiency

大白 0x00E 大白日夢大白日大白大白日大白日夢 Experiment Detail

Experiment Detail

Conclusions • Our mechanism DCLWHI farther improves the speed and efficiency of segmentation. • The scheme A has a very high process speed but costs too much memory space, while scheme B costs less storage with a high efficiency. We think it a good eclectic mechanism for Chinese word segmentation.

Opinions • Experiments are not enough to evidence this method is very well. • …..

A New Lexicon Mechanism for Chinese Word Segmentation

A New Lexicon Mechanism for Chinese Word Segmentation

Presentation Transcript

Simple Features for Chinese Word Sense Disambiguation

Translating the English legal lexicon into Chinese

A Chemehuevi Lexicon

The Second International Chinese Word Segmentation Bakeoff

A new molecular mechanism for HIV-1 latency

Optimizing Chinese Word Segmentation for MT performance

Make a new word

Rethinking Chinese Word Segmentation:

Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation

Position Graph: A New Mechanism for Concurrency Control

A computational Lexicon for Contemporary Hebrew

Chinese Word Segmentation and Statistical Machine Translation

A New Mechanism for Core-Collapse Supernova Explosions

A new mechanism for heating the solar corona

Chinese Word Segmentation Adaptation for Statistical Machine Translation

Crowdfunding : a new mechanism for donation and investment

Word Segmentation Models: Overview

A Chemehuevi Lexicon

A New Lexicon Mechanism for Chinese Word Segmentation

Chinese Word Segmentation and Statistical Machine Translation