160 likes | 288 Vues
The Montclair Electronic Language Learner Database (MELD) is a pioneering corpus of non-native speaker (NNS) writing, initiated in the early 1990s. Focused on ESL student essays, MELD aims to support language acquisition research and second language pedagogy by annotating text for grammatical errors and providing publicly available data. The database features over 44,000 words of annotated text and addresses critical gaps in NNS corpus creation. Future development includes tools for error analysis and instructional materials, empowering better understanding of ESL writing characteristics.
E N D
The Montclair Electronic Language Learner Database(MELD) www.chss.montclair.edu/linguistics/MELD/ Eileen Fitzpatrick & Steve Seegmiller Montclair State University
Non-native speaker (NNS) corpora • Begun in early 1990’s • Data • written performance only • essays of students of English as a foreign language • Corpus development (academic) • in Europe: Louvain, Lodz, Uppsala • in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation • Lodz: part of speech • HKUST, Lodz: error tags
Gaps in NNS Corpus Creation • No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types
MELD Goals • Initial Goals • Collect ESL student writing • Tag writing for error • Provide publicly available NNS data • Initial Goals support • 2nd language pedagogy • Language acquisition research • tool building (grammar checkers, student editing aids, parallel texts from NS and NNS)
MELD Overview • Data • 44477 words of text annotated • 53826 more words of raw data • language, education data for each student author • upper level ESL students • Tools written to • link essays to student background data • produce an error-free version from tagged text • allow fast entry of background data
Annotation • Annotators “reconstruct” a grammatical form {error/reconstruction} school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens • Agreement between annotators is an issue
Error Classification from a Predetermined List • Benefit • annotators agree on what an error is: only those items in the classification scheme • Problems • annotators have to learn a classification scheme • the existence of a classification scheme means that the annotators can misclassify • errors not in the scheme will be missed
Error Identification & Reconstruction • Benefits • speed in annotating since there is no classification scheme to learn • no chance of misclassifying • less common errors will be captured • a reconstructed text can be more easily parsed and tagged for part of speech • Question • How well can we agree on what is an error?
Agreement Measures • Reliability: What percentage of the errors do both taggers tag?T1 T2 (T1 +T2)/2 • Precision: What percentage of the non-expert’s (T2) tags are accurate?T1 T2 T2 • Recall: What percent of true errors did the non-expert (T2) find?T1 T2 T1 1 -
Agreement Measures Non-expert Expert High precision Low Recall Low Reliability
Agreement Measures J&L Essay Recall Precision Reliability 1-10 .54 .58 .39 11-22 .57 .78 .49 J&N Essay Recall Precision Reliability 1-10 .58 .48 .23 11-22 .37 .54 .27 L&N Essay Recall Precision Reliability 1-10 .65 .70 .37 11-22 .60 .78 .36
Conclusions on Tagging Agreement • Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored
The Future • Immediate • Internet access to data and tools • an error concordancer • automatic part of speech and syntactic markup • data from different ESL skill levels • Long Range • statistical tool to correlate error frequency with student background • student editing aid • grammar checker • NNS speech data
Some Possible Applications • Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1
L1 Spanish tense 1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305 L1 Gujarati tense 5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 1 {left/leave} 2 {is/was} 1 {kept/keeps} 2 {have/had} 1 {involved/involves} 2 {had/have} 1 {get/got} 1 {would start/started} 1 {do/did} 1 {will/0} 1 {can/could} 1 {will/were to} 1 {are/were} 1 {was/were} 1 {wanted/want} 1 {spend/spent} TOTAL: 31 Word Ct: 2500 Writing Characteristics by L1
Acknowledgments Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta