450 likes | 600 Vues
Unsupervised Overlapping Feature Selection for C onditional R andom F ields Learning in Chinese Word Segmentation. for Rocling 2011. Ting- hao Yang, Tian-jian Jiang , Chan-hung Kuo , Richard Tzong-han Tsai, Wen-lian Hsu Institute of Information Science, Academia Sinica
E N D
Unsupervised Overlapping Feature Selection for Conditional Random FieldsLearning in Chinese Word Segmentation for Rocling 2011 • Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo • , Richard Tzong-han Tsai, Wen-lian Hsu • Institute of Information Science, Academia Sinica • Department of Computer Science & Engineering, Yuan ZeUniversity
Introduction Term Contributed Boundary Feature using Conditional Random Fields in 2010 A unified view of several unsupervised feature selection based on frequent strings
Toolkit SRILM YASA
SRILM C++ libraries The toolkit supports N-gram statistics for language model
YASA Automatically extractfrequent strings from unlabeled corpus
Extended Label [0 -9 ] + [B1|B2|B3|M|E|S]
Score N-Gram score Frequent string score Accessor variety score
Score Convert from term frequency and N-Gram frequency Logarithm ranking mechanism
Score Consider the score of outer pattern Equation of AV
Score Scores are also used for filtering overlapping pattern
Non-overlapping “塑膠原料的” score 3 conflicts with ”的生產”score 1 ”的生產” is labeled as unseen
Character-based N-Gram (CNG) Character-based N-gramextracted by SRILM Keeping overlapping information
Term Contributed Boundary (TCB) Using Frequent String from YASA Selected by forward maximum matching algorithm
Term Contributed Frequency (TCF) Using Frequent String from YASA Keep Overlapping information Converting score from frequent string
Accessor Varietybased String (AVS) Using SRILM to generate N-Grams Measure how likely a substring is a Chinese word Using logarithm ranking mechanism
AVS+TCB and AVS+TCF Compound AVS and TCB/TCF
Conditional Random Fields Undirected graphical models trained to maximize a conditional probability of random variables X and Y Feature instances are generated from template file
Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens Feature template
Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template
Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template
Conditional Random Fields Feature Function C-1, C0, C1 Previous, current, or next token C-1C0Previous and current tokens C0C1Current and next tokens C-1C1Previous and next tokens 欲速則不達 Feature template
Experiment • Data set • Academia Sinica (AS) • City University of Hong Kong (CityU) • Microsoft Research (MSR) • Peking University (PKU)
Conclusion The feature collections which contain AVS obtains better F1 TCB/TCF enhances the 6-tag approach on the Recall of Out-of-Vocabulary Only with high quality feature, overlapping label can keep useful information