Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model

but_IN stocks_VBZ kept_VBN </s> <s> but_CC stocks_NNS kept_VBD falling_VBG Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model Jia Cui, Yi Su, Keith Hall, Frederick Jelinek @clsp.jhu Example: A sentence in bigram METLM Abstract: We propose a novel language model METLM (maximum entropy token-based language model) capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple. Using tokens as hidden states, our model is effectively a hidden Markov model (HMM) with ME transition distributions. We investigated different types of labels with a wide range of linguistic implications. These models outperform Kneser-Ney smoothed n-gram models both in terms of perplexity on standard datasets and in terms of word error rate for a large vocabulary speech recognition system. token ME Training With Labeled Training Data: Data Sparseness and Sharing: “Colin plays chess” has no Google results, but it is possible. WW: kept falling W-T: falling-VBG WT: kept VBG TWT: NNS kept VBG AA: WT,TW,TT,W-T,T …… But CC stocks NNS kept VBD falling VBG Colin plays chess chess he plays chess Colin takes basketball Colin plays • Data sharing depends on knowledge: • Lexical: Colin plays and he plays share the word plays • Syntactic: takes and plays are both VERBS • Semantic: basketball and chess are both SPORTS • ... ... Word/label based features will not increase data sparseness. Word Classes/Labels: PI-CLS: word classification using algorithm proposed by Brown et. al, 1992 PD-CLS: position-dependent word classes, classifying words at three positions simultaneously, classes generated by Ahamad Emami Proximity-based word classes: word distances computed by Dekang Lin (stock,C1= ={cost, currency, credit, salary, refund, hourly}) Dependency- based word classes: word distances computed by Dekang Lin, (stock,C2={bond, stock, cash, capacity, decoration}) Topic-based word classes: distances computed by Yonggang Deng (stock,C3={indexes, exchange, Chicago, crash, broker, unfolded} Experiments of Perplexities: • Data: Treebank WSJ 24 sections • Develop: 0-19 sections, 41K sentences,1M words • Held: 20-21 sections, 4.3K sentences,110K words • Test: 22-23 sections,4.2K sentences,106K words • 10K vocabulary baseline Experiments on ASR system: Fisher Data, 22M training data, Dialog, 4167 reference sentences. Lattice re-scoring: 2.7M predictions Use dominant POS tags Basic Features + AA, WTW,WWT, WWTW,WTWW • Conclusions: • Addressing data sharing in both history and future . • Enabling Unlabeled training • Effectively applying Syntactic word labels to improve WER. • A platform to integrate different word labels. • Computationally expensive

Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model

Investigating Linguistic Knowledge In A Maximum Entropy Token-Based Language Model

Presentation Transcript

A brief maximum entropy tutorial

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

Maximum Entropy Model (I)

Maximum Entropy

MaxImum entropy

Maximum Entropy

Maximum Entropy Model (I)

Maximum Entropy Model (II)

Natural Language Learning: MaxImum entropy

*Introduction to Natural Language Processing (600.465) Maximum Entropy

The Latent Maximum Entropy Principle for Language Modeling

A Maximum Entropy-based Model for Answer Extraction

Maximum Entropy Model

Segmentation via Maximum Entropy Model

Maximum Entropy Discrimination

Maximum Entropy Model

MAXIMUM ENTROPY MARKOV MODEL

Maximum Entropy

Maximum Entropy Model (II)

A Maximum Entropy Approach to Natural Language Processing

Language and Linguistic Knowledge

Maximum Entropy Discrimination