Detecting Missing Hyphens in Learner Text
This study presents a Logistic Regression Model for identifying missing hyphens in learner texts, helping students improve writing quality. The model is trained on error-corrected data from sources like Wikipedia and tested on both artificial and learner-generated errors, outperforming existing baselines in terms of precision. The evaluation includes observations on error frequency, lexical impact, and independent event occurrence. Native speaker judgments support the model's accuracy.
Detecting Missing Hyphens in Learner Text
E N D
Presentation Transcript
Detecting Missing Hyphens in Learner Text Aoife Cahill*, Martin Chodorow**, Susanne Wolff* and Nitin Madnani* *Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA {acahill, swolff, nmadnani}@ets.org **Hunter College and the Graduate Center, City University of New York, NY 10065, USA martin.chodorow@hunter.cuny.edu
Outline • Motivation • Baselines • New Model • Experiments and Results • Conclusion
Motivation • Hyphen errors are infrequent • But are an important consideration for students aiming to improve the overall quality of their writing Dogs are lucky… most of them have built in fur coats! Brrrr! From: http://daughternumberthree.blogspot.com
Motivation • Missing hyphen errors are not all lexical • Schools may have more after school sports. • I went to the dentist after school today. • Language Learner text introduces additional complications • My father like play basketball with me.
Baselines • Baseline 1: Collins Dictionary [5,246] • predicts a missing hyphen between bigrams that appear hyphenated in the dictionary • Baseline 2: Wiki (counts) [1,095] • predicts a missing hyphen between bigrams that occur hyphenated more than 1,000 times in Wikipedia • Baseline 3: Wiki (probs) [673,269] • predicts a missing hyphen between bigrams where the probability of the hyphenated form as estimated from Wikipedia is greater than 0.66
New Model • Logistic Regression Model • assigns a probability to the likelihood of a hyphen occurring between wi and wi+1
Data • Training • Well-edited text (San Jose Mercury News) • Error-corrected data mined from Wikipedia Revisions • Combination • Test • Artificial errors: Brown corpus • Learner text: CLC-FCE corpus, TOEFL/GRE essays
Evaluation on Artificial Errors • Brown Corpus: 24,243 sentences, automatically remove hyphens from 2,072 words • Each system makes a prediction for all bigrams about whether a hyphen should appear between the pair of words • precision: how many of the missing hyphen errors predicted by the system were true errors • recall: how many of the artificially removed hyphens the system detected as errors • f-score: the harmonic mean of precision and recall
Evaluation on Learner Text (1) • CLC-FCE corpus: 173 instances of missing hyphen errors
Evaluation on Learner Text (1) • Some observations: • Very low frequency error (173) • Dominated by one lexical item: make-up • Errors are not independent events
Evaluation on Learner Text (2) • Precision-only manual evaluation • Random sample of 100 errors per system detected in 1,000 student essays • 2 native speaker judgements (0.79)
Evaluation on Learner Text (2) • Native Speaker Judgements (Precision)
Conclusions • Logistic Regression Model for predicting missing hyphens in learner text • Trained on: • A corpus of well-edited text • A corpus of automatically mined corrections • In general, the classifiers outperform the baselines, especially in terms of precision Thanks! Questions? Comments? http://blog.ezinearticles.com