Detecting Missing Hyphens in Learner Text
This study focuses on identifying missing hyphen errors in learner texts, crucial for enhancing writing quality. Missing hyphens are often overlooked but can significantly affect clarity. We present a novel logistic regression model trained on well-edited text and artificial errors from the Brown corpus and CLC-FCE. Our model outperforms traditional baselines, demonstrating improved precision in detecting hyphen-related errors in learner submissions, particularly addressing common lexical challenges. Through evaluations, we highlight the importance of accurate hyphen usage in academic writing.
Detecting Missing Hyphens in Learner Text
E N D
Presentation Transcript
Detecting Missing Hyphens in Learner Text Aoife Cahill*, Martin Chodorow**, Susanne Wolff* and Nitin Madnani* *Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA {acahill, swolff, nmadnani}@ets.org **Hunter College and the Graduate Center, City University of New York, NY 10065, USA martin.chodorow@hunter.cuny.edu
Outline • Motivation • Baselines • New Model • Experiments and Results • Conclusion
Motivation • Hyphen errors are infrequent • But are an important consideration for students aiming to improve the overall quality of their writing Dogs are lucky… most of them have built in fur coats! Brrrr! From: http://daughternumberthree.blogspot.com
Motivation • Missing hyphen errors are not all lexical • Schools may have more after school sports. • I went to the dentist after school today. • Language Learner text introduces additional complications • My father like play basketball with me.
Baselines • Baseline 1: Collins Dictionary [5,246] • predicts a missing hyphen between bigrams that appear hyphenated in the dictionary • Baseline 2: Wiki (counts) [1,095] • predicts a missing hyphen between bigrams that occur hyphenated more than 1,000 times in Wikipedia • Baseline 3: Wiki (probs) [673,269] • predicts a missing hyphen between bigrams where the probability of the hyphenated form as estimated from Wikipedia is greater than 0.66
New Model • Logistic Regression Model • assigns a probability to the likelihood of a hyphen occurring between wi and wi+1
Data • Training • Well-edited text (San Jose Mercury News) • Error-corrected data mined from Wikipedia Revisions • Combination • Test • Artificial errors: Brown corpus • Learner text: CLC-FCE corpus, TOEFL/GRE essays
Evaluation on Artificial Errors • Brown Corpus: 24,243 sentences, automatically remove hyphens from 2,072 words • Each system makes a prediction for all bigrams about whether a hyphen should appear between the pair of words • precision: how many of the missing hyphen errors predicted by the system were true errors • recall: how many of the artificially removed hyphens the system detected as errors • f-score: the harmonic mean of precision and recall
Evaluation on Learner Text (1) • CLC-FCE corpus: 173 instances of missing hyphen errors
Evaluation on Learner Text (1) • Some observations: • Very low frequency error (173) • Dominated by one lexical item: make-up • Errors are not independent events
Evaluation on Learner Text (2) • Precision-only manual evaluation • Random sample of 100 errors per system detected in 1,000 student essays • 2 native speaker judgements (0.79)
Evaluation on Learner Text (2) • Native Speaker Judgements (Precision)
Conclusions • Logistic Regression Model for predicting missing hyphens in learner text • Trained on: • A corpus of well-edited text • A corpus of automatically mined corrections • In general, the classifiers outperform the baselines, especially in terms of precision Thanks! Questions? Comments? http://blog.ezinearticles.com