150 likes | 264 Vues
This paper presents a novel hybrid language modeling approach that combines traditional n-gram models with dependency grammar to address challenges like data sparseness and long dependencies. We first discuss the limitations of n-gram models and propose our hybrid method. The effectiveness of our model is evaluated using the Brown Corpus, which consists of training, development, and test sets, and we assess various metrics like perplexity and sentence classification performance. Results indicate the effectiveness of our approach in improving language modeling accuracy while highlighting the need for further feature exploration.
E N D
Eran Chinthaka, Ikhyun Park Statistical language modeling combining n-gram and dependency grammar
Introduction • Statistical language models and ngrams • Problems with ngram models • Data sparseness • Long dependencies • Proposed Solution • Use a hybrid model of ngram and dependency grammar for language model
Process • Evaluator • Test Data • (Good and Bad) • Optimal Parameters • Perplexity
Training Data System Architecture
Experimental Setup • Data • Brown Corpus • 28671 -- Train sentences • 9557 -- Development Sentences • 9556 -- Test Sentences • Tools • Smoother and Language Model Builder • CMU-Cambridge Statistical Language Modeling Toolkit v2 (http://www.speech.cs.cmu.edu/SLM/toolkit.html) • Dependency Parser • Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml)
Sentence Evaluation • Ngram Score • Dependency Score • Combined Score
Smoothing – Absolute Discounting • Ngram Language Model if if else if else if else
Smoothing – Absolute Discounting • Dependency Language Model if else
Assessment • Perplexity (Ngram only) • Perplexity (Combined) Inappropriate
Assessment • Classification of sentences (good vs bad) • Bad sentence generation • Shuffle good sentences • Eg :The election will be Dec. 4 from 8 a.m. to 8 p.m. . The election will be 8 8 from 4 a.m. to Dec. p.m. . • Shuffle degree = 7 (number of lost bigrams)
Results • Distribution of Sentences Ngram Avg. Shuffle: 12.357225 Dependency
Results Avg. Shuffle: 12.357225 • Classification (ngram vs. ngram+dep) False Reject NOT Improved -*- ngram -*- ngram+dep. False Accept
Discussion • Why no improvement • Insufficient feature exploration • Statistical nature of dependency parser • Any ideas?