Detecting Missing Hyphens in Learner Text

Detecting Missing Hyphens in Learner Text Aoife Cahill*, Martin Chodorow**, Susanne Wolff* and Nitin Madnani* *Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA {acahill, swolff, nmadnani}@ets.org **Hunter College and the Graduate Center, City University of New York, NY 10065, USA martin.chodorow@hunter.cuny.edu

Outline • Motivation • Baselines • New Model • Experiments and Results • Conclusion

Motivation • Hyphen errors are infrequent • But are an important consideration for students aiming to improve the overall quality of their writing Dogs are lucky… most of them have built in fur coats! Brrrr! From: http://daughternumberthree.blogspot.com

Motivation • Missing hyphen errors are not all lexical • Schools may have more after school sports. • I went to the dentist after school today. • Language Learner text introduces additional complications • My father like play basketball with me.

Baselines • Baseline 1: Collins Dictionary [5,246] • predicts a missing hyphen between bigrams that appear hyphenated in the dictionary • Baseline 2: Wiki (counts) [1,095] • predicts a missing hyphen between bigrams that occur hyphenated more than 1,000 times in Wikipedia • Baseline 3: Wiki (probs) [673,269] • predicts a missing hyphen between bigrams where the probability of the hyphenated form as estimated from Wikipedia is greater than 0.66

New Model • Logistic Regression Model • assigns a probability to the likelihood of a hyphen occurring between wi and wi+1

Data • Training • Well-edited text (San Jose Mercury News) • Error-corrected data mined from Wikipedia Revisions • Combination • Test • Artificial errors: Brown corpus • Learner text: CLC-FCE corpus, TOEFL/GRE essays

Evaluation on Artificial Errors • Brown Corpus: 24,243 sentences, automatically remove hyphens from 2,072 words • Each system makes a prediction for all bigrams about whether a hyphen should appear between the pair of words • precision: how many of the missing hyphen errors predicted by the system were true errors • recall: how many of the artificially removed hyphens the system detected as errors • f-score: the harmonic mean of precision and recall

Artificial Error Results

Evaluation on Learner Text (1) • CLC-FCE corpus: 173 instances of missing hyphen errors

Evaluation on Learner Text (1) • Some observations: • Very low frequency error (173) • Dominated by one lexical item: make-up • Errors are not independent events

Evaluation on Learner Text (2) • Precision-only manual evaluation • Random sample of 100 errors per system detected in 1,000 student essays • 2 native speaker judgements (0.79)

Evaluation on Learner Text (2) • Native Speaker Judgements (Precision)

Conclusions • Logistic Regression Model for predicting missing hyphens in learner text • Trained on: • A corpus of well-edited text • A corpus of automatically mined corrections • In general, the classifiers outperform the baselines, especially in terms of precision Thanks! Questions? Comments? http://blog.ezinearticles.com

Brown Corpus: Precision/Recall

CLC-FCE: Precision/Recall

Detecting Missing Hyphens in Learner Text

Detecting Missing Hyphens in Learner Text

Presentation Transcript

Hyphens:

Hyphens

Detecting flames and insults in text

hyphens

Apostrophes/Hyphens

Homophones and Hyphens

Hyphens

Hyphens

Dashes vs. Hyphens

Hyphens (-)

HYPHENS

HYPHENS

Using Hyphens

Is your text messaging missing you in its life?

Detecting active subnetworks in interaction graphs with missing data

Detecting Terrorist Activities via Text Analytics

HYPHENS

Detecting Missing Hyphens in Learner Text

Detecting active subnetworks in metabolic interaction graphs with missing data

Hyphens!

Detecting active subnetworks in molecular interaction networks with missing data