Analyzing Learner Corpora: Error Annotation and Integration into CALL Programs
E N D
Presentation Transcript
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010
Overview • Analyzing raw corpora • Error annotation • Issues in corpus annotation • Granger (2003)
Analyzing raw corpora • Concordancing software • GOLD • AntConc • Other software • CLAN
Issues in corpus annotation • Annotation scheme and format • Annotation procedure • Annotation quality
Annotation scheme and format • What are the categories you are using? • Linguistically consensual • Overspecification vs. underspecification • Use short, meaningful codes for your categories • Annotation format considerations • Compatible with annotation scheme • Facilitates corpus query
Annotation procedure and quality • Annotator training • Scheme and format • Problematic cases and disagreements • Computer-assisted manual annotation • Stanford annotation tool • UAM Corpus Tool and NoteTab • Inter-annotator agreement • Cohen’s Kappa • Online Kappa calculator
Granger (2003) • Learner corpora • Error annotation • Error statistics and analysis • Integration of results into CALL • Conclusion
Learner corpora • What is a learner corpus? • Difference from traditional data in SLA • Difference from native language data • Frequencies • Errors • From error annotation to error detection
Computer-aided error annotation • Dagneaux, Denness and Granger (1998) • Manual correction of L2 French corpus • Elaboration of an error tagging system • Insertion of error tags and corrections • Retrieval of lists of error types and statistics • Concordance-based error analysis • Tagging system • Informative but manageable • Reusable, flexible, consistent
Error tagging system • Dulay, Burt & Krashen (1982) • System based on linguistic categories (e.g., syntax) • Surface structure alternations (e.g., omission) • Granger’s (2003) three-dimensional taxonomy • Error domain • Error category • Word category
Error tagging system (cont.) • Error domain and category • General level: grammatical, lexical, etc. • Domains subdivided into error categories • Table 1, page 468 • Word category • A POS tagset with 11 major and 54 sub-categories • Makes it possible to sort errors by POS categories
Error tagging system (cont.) • Correct forms inserted next to erroneous forms • Facilitates interpretation of error annotations • Allows for automatic sorting on correct forms • Tag insertion using a menu-driven editor
Error statistics and analysis • Error frequency by domain or (word) category • Highest ranked domains: grammar and form • Error trigrams • Concordancers for searching error codes • AntConc • WordSmith Tools
Integrating results into CALL • Goal: a hypermedia CALL program • Using NLP and Communicative approaches to SLA • Traditional and NLP-enabled exercises • Automatic error diagnosis and feedback generation • Error statistics and analysis used to • Select linguistic areas to focus on • Adapt exercises as a function of attested error types • Adapt NLP tools for error diagnosis
Integrating results into CALL (cont.) • Most error-prone linguistic areas • Tense and mood, agreement • Articles, complementation, prepositions • Adapting exercises • Exercises reflect type of error-prone context • Formal errors through dictation and exercises targeting specific difficulties • Attention to punctuation
Integrating results into CALL (cont.) • Adapting NLP tools for error diagnosis • Spell checker and parser • Handles orthographic, grammatical, syntactic, and lexical errors • Not punctuation, semantic, and tense errors
Granger (2003) summary • Effective 3-tier error annotation system • Limited number of categories per tier • Versatile automated data manipulation • Limitations of error-tagging • Element of subjectivity in annotation • Focuses on misuse • Usefulness of error-tagged learner corpus • Error statistics helps understand learner interlang • Helps adapt pedagogical materials and programs
Activity • Using the Stanford annotation tool • Annotate a short text using your own scheme, or • Annotate a short learner text using Granger’s (2003) scheme • Query the annotated text using AntConc