html5-img
1 / 7

Classifying Reading Levels with Statistical Language Models

Can we accurately classify English text at the appropriate reading level? This research uses statistical language models and a dataset of English novels from different reading lists to explore this problem. The approach involves building language models for each reading level and classifying new text based on the most likely generating model. Results show that language models can capture information that helps differentiate between reading levels. Additionally, using these models as features in a multinomial logistic regression model improves performance. Future work includes exploring higher-order language models and investigating language model overfitting.

towles
Télécharger la présentation

Classifying Reading Levels with Statistical Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifying Reading Levels with Statistical Language Models Johnson Hsieh Sameer Shariff

  2. Problem • Given a passage of English text, can we classify the text at the appropriate reading level? • Dataset – English novels from various reading lists • 5th-7th grade - 15 books • Tom Sawyer, Black Beauty, etc. • 8th-10th grade - 16 books • A Tale of Two Cities, The Call of the Wild, etc. • 11th-12th grade - 17 books • Pride and Prejudice, The Awakening, etc.

  3. Approach • Build language models for each class • Classify new text based on model that was most likely to generate this text (generative model) • Model 1 • Classify text based purely on these language models with some interesting smoothing techniques • Model 2 • Build a discriminative multinomial logistic regression model that uses these language models as just one of many features

  4. Data Separability • A hard problem

  5. Language Model Results Accuracy = (# of books predicted correctly)/(total # of books) Weighted Accuracy = ((# predicted correctly) + 0.5 * (# off by one))/(total # of books)

  6. Multinomial Logistic Regression Results • Without language model: • With language model:

  7. Conclusions and Future Work • Statistical language models do capture information that can help differentiate between different reading levels, better than traditional measures such as Flesch-Kinkaid • Multinomial logistic regression models with additional features outperform the pure language model approach, though using the language model as a feature greatly improves performance • Future Work • Explore higher order language models • Investigate language model overfitting

More Related