Detecting Erroneous Sentences using Automatically Mined Sequential Patterns
This study presents a novel approach to identify erroneous sentences using categorized sequential patterns mined from data. It focuses on automating error detection in English, especially for ESL learners. We employ classification models such as SVM and Naive Bayes, leveraging labeled sequential patterns (LSPs) to enhance the accuracy of error identification. The approach demonstrates significant potential in providing constructive feedback and improving sentence quality in bilingual contexts and machine translation evaluations.
Detecting Erroneous Sentences using Automatically Mined Sequential Patterns
E N D
Presentation Transcript
Detecting Erroneous Sentences using Automatically Mined Sequential Patterns Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.12.04
Outlines • Introduction • Related Work • Proposed Technique • Experimental Evaluation • Conclusions and Future Work
Introduction • Summary • Problem: Identifying erroneous/correct sentences • Algorithm: Classification (SVM, NB) • Approach: Sequential patterns (Data Mining) • Applications • Providing feedback for writers of English as a Second Language (ESL) • Controlling the quality of parallel bilingual sentences mined from the Web • Evaluating the MT results
Introduction (cont.) • The common mistakes (Yukio et al.,2001; Gui and Yang, 2003) made by ESL learners • spelling, verb formation • lexical collocation, tense, agreement, wrong Part-Of-Speech (POS), article usage • sentence structure (grammar structure) • Example • “If Maggie willgo to supermarket, she willbuy a bag for you.” • The pattern: “if...will...will” (would ) • N-grams: considering only continuous sequence of words, very expensive if N > 3
Related Work • Category 1: the use of hand-crafted rules • Heidorn, 2000; Michaud et al., 2000; Bender et al., 2004 • Difficulties • Expensive to write rules manually • difficult to produce and maintain a large number of non-conflicting rules to cover a wide range of grammatical errors • making different errors by different first-language backgrounds and skill levels • hard to write rules for some grammatical errors
Related Work (cont.) • Category 2: statistical approaches • Chodorow and Leacock, 2000; Izumi et al., 2003; Brockett et al., 2006; Nagata et al., 2006 • Problems • focusing on some pre-defined errors • the reported results being not attractive • the need of errors to be specified and tagged in the training sentences • the need of parallel tagged data
Proposed Technique • Classification model • Using SVM (light SVM) • Features • Labeled Sequential Patterns (LSP) – 1 feature • Complementary features • Lexical Collocation (LC) – 3 features • Perplexity from Language Model (PLM) – 2 features • Syntactic Score (SC) – 1 feature • Function Word Density (FWD) – 5 features
Proposed Technique —LSP (1) • A labeled sequential pattern (LSP), p, is in the form of <LHS, c> • LHS is a sequence <a1, ..., am> • ai is named “item”. • c is a class label (correct/incorrect here) • Sequence database D • The collection of LSPs
Proposed Technique —LSP (2) • “Contain” relation (subsequence) • a sequence s1 =< a1, ..., am >is contained in a sequence s2 =< b1, ..., bn >if there exist integers i1, ...imsuch that 1 <= i1 < i2 < ... < im <= nand aj = bijfor all j in {1, ...,m}. • A=<abcdefgh> has a subsequence B=<bdeg> • A contains B. • A LSP p1 is contained by p2 if the sequence p1.LHSis contained by p2.LHSand p1.c = p2.c.
Proposed Technique —LSP (3) • A LSP p is attached with two measures, supportand confidence. • The support of p (the generality of the pattern p) • denoted by sup(p) • the percentage of tuples in database D that contain the LSP p • the confidence of p (predictive ability of p) • Denoted by conf(p) • Computed as
Proposed Technique —LSP (4) • Example: • t1 = (< a, d, e, f >,E) • t2 = (< a, f, e, f >,E) • t3 = (< d, a, f >,C) • One example LSP p1 = (< a, e, f >, E) • is contained in t1 and t2 • sup(p1) = 2/3 = 66.7%, • conf(p1)=(2/3)/(2/3) = 100% • LSP p2 = (< a, f >, E) • sup(p2) = 3/3 = 100%, • conf(p2)= (2/3)/(3/3) = 66.7%
Proposed Technique —LSP (5) • Generating Sequence Database • applying Part-Of-Speech (POS) tagger to tag each training sentence • MXPOST-Maximum Entropy Part of Speech Tagger Toolkit3 for POS tags • keeping function words and time words • each sentence together with its label becomes a database tuple • “In the past, John was kind to his sister” • “In the past, NNP was JJ to his NN” • LSP Examples • (<a, NNS>, Error), NNS: plural noun • (<yesterday, is>, Error)
Proposed Technique —LSP (6) • Mining LSPs • adapting the frequent sequence mining algorithm in (Pei et al., 2001) • setting minimumsupportat 0.1% and minimum confidence at 75% • Converting LSPs to Features • the corresponding feature being set at 1 if a sentence includes a LSP
Proposed Technique —LSP (7) • LSPs for erroneous sentences • “<this, NNS>” (“this booksis stolen.”) • “<past, is>” ( “in the past, John iskind to his sister.”) • “<one, of, NN>” ( “it is one of important working language” • “<although, but>” (“although he likes it, buthe can’t buy it.”) • “<only, if, I, am>” (“only if my teacher has given permission, I am allowed to enter this room.”) • LSPs for correct sentences • “<would, VB>” (“he would buy it.”), • “<VBD, yeserday>” (“I bought this book yesterday.”)
Proposed Technique —Other Linguistic Features (1) • Lexical Collocation (LC) • Lexical collocation (“strong tea”/濃茶, not “powerful tea”) • collecting five types of collocations • verb-object, adjective-noun, verb-adverb, subject-verb, and preposition-object from a general English corpus • Correct LCs • extracting collocations of high frequency • Erroneous LC candidates • generated by replacing the word in correct collocations with its confusion words, obtained from WordNet • Consulted by experts to see if a candidate is a true erroneous collocation
Proposed Technique —Other Linguistic Features (2) • computing three LC features for each sentence • (1) • m is the number of CLs • n is the number of collocations in each sentence • Probability p(coi) of each CL coiis calculated using the method (Lu and Zhou, 2004) • (2) the ratio of the number of unknown collocations (neither correct LCs nor erroneous LCs) to the number of collocations in each sentence • (3) the ratio of the number of erroneous LCs to the number of collocations in each sentence
Proposed Technique —Other Linguistic Features (3) • Perplexity from Language Model (PLM) • extracted from a trigram language • Using the SRILM-SRI Language Modeling Toolkit (Stolcke, 2002) • Calculating two values for each sentence: • lexicalized trigram perplexity • POS trigram perplexity • The erroneous sentences would have higher perplexity
Proposed Technique —Other Linguistic Features (4) • Syntactic Score (SC) • using a statistical parser Toolkit (Collins, 1997) • assigning each sentence a parser’s score • the related log probability of parsing • Assuming that erroneous sentences with undesirable sentence structures are more likely to receive lower scores
Proposed Technique —Other Linguistic Features (5) • Function Word Density (FWD) • the ratio of function words to content words • inspired by the work (Corston-Oliver et al., 2001) • Be effective to distinguish between human references and machine outputs • seven kinds of function words
Experimental Evaluation (1) – Experimental setup • Classification model: SVM • For a non-binary feature X: its value x is normalized by z-score. • Two data sets: Japanese Corpus (JC) and Chinese Corpus (CC)
Experimental Evaluation (3) ALEK (Chodorow and Leacock, 2000) from Educational Testing Service (ETS) 694 parallel-sentences 1671 non-parallel sentences Different cultures (Japanese/Chinese as first language)
Experimental Evaluation (4) • Two LDC data, low-ranked and high-ranked data • 14,604 low ranked (score 1-3) MTs • 808 high ranked (score 3-5) MTs • Both with corresponding human reference translations • human references (Correct), MT (erroneous)
Conclusions and Future Work • Conclusions • This paper proposed to mine LSPs as the input of classification models. • LSPs were shown to be much more effective than the other linguistic features. • Other features were also beneficial. • Future work • To use LSPs to provide detailed feedback for ESL learners • To integrate the features effectively • To further investigate the application for MT evaluation