Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.12.04

Outlines • Introduction • Related Work • Proposed Technique • Experimental Evaluation • Conclusions and Future Work

Introduction • Summary • Problem: Identifying erroneous/correct sentences • Algorithm: Classification (SVM, NB) • Approach: Sequential patterns (Data Mining) • Applications • Providing feedback for writers of English as a Second Language (ESL) • Controlling the quality of parallel bilingual sentences mined from the Web • Evaluating the MT results

Introduction (cont.) • The common mistakes (Yukio et al.,2001; Gui and Yang, 2003) made by ESL learners • spelling, verb formation • lexical collocation, tense, agreement, wrong Part-Of-Speech (POS), article usage • sentence structure (grammar structure) • Example • “If Maggie willgo to supermarket, she willbuy a bag for you.” • The pattern: “if...will...will” (would ) • N-grams: considering only continuous sequence of words, very expensive if N > 3

Related Work • Category 1: the use of hand-crafted rules • Heidorn, 2000; Michaud et al., 2000; Bender et al., 2004 • Difficulties • Expensive to write rules manually • difficult to produce and maintain a large number of non-conflicting rules to cover a wide range of grammatical errors • making different errors by different first-language backgrounds and skill levels • hard to write rules for some grammatical errors

Related Work (cont.) • Category 2: statistical approaches • Chodorow and Leacock, 2000; Izumi et al., 2003; Brockett et al., 2006; Nagata et al., 2006 • Problems • focusing on some pre-defined errors • the reported results being not attractive • the need of errors to be specified and tagged in the training sentences • the need of parallel tagged data

Proposed Technique • Classification model • Using SVM (light SVM) • Features • Labeled Sequential Patterns (LSP) – 1 feature • Complementary features • Lexical Collocation (LC) – 3 features • Perplexity from Language Model (PLM) – 2 features • Syntactic Score (SC) – 1 feature • Function Word Density (FWD) – 5 features

Proposed Technique —LSP (1) • A labeled sequential pattern (LSP), p, is in the form of <LHS, c> • LHS is a sequence <a1, ..., am> • ai is named “item”. • c is a class label (correct/incorrect here) • Sequence database D • The collection of LSPs

Proposed Technique —LSP (2) • “Contain” relation (subsequence) • a sequence s1 =< a1, ..., am >is contained in a sequence s2 =< b1, ..., bn >if there exist integers i1, ...imsuch that 1 <= i1 < i2 < ... < im <= nand aj = bijfor all j in {1, ...,m}. • A=<abcdefgh> has a subsequence B=<bdeg> • A contains B. • A LSP p1 is contained by p2 if the sequence p1.LHSis contained by p2.LHSand p1.c = p2.c.

Proposed Technique —LSP (3) • A LSP p is attached with two measures, supportand confidence. • The support of p (the generality of the pattern p) • denoted by sup(p) • the percentage of tuples in database D that contain the LSP p • the confidence of p (predictive ability of p) • Denoted by conf(p) • Computed as

Proposed Technique —LSP (4) • Example: • t1 = (< a, d, e, f >,E) • t2 = (< a, f, e, f >,E) • t3 = (< d, a, f >,C) • One example LSP p1 = (< a, e, f >, E) • is contained in t1 and t2 • sup(p1) = 2/3 = 66.7%, • conf(p1)=(2/3)/(2/3) = 100% • LSP p2 = (< a, f >, E) • sup(p2) = 3/3 = 100%, • conf(p2)= (2/3)/(3/3) = 66.7%

Proposed Technique —LSP (5) • Generating Sequence Database • applying Part-Of-Speech (POS) tagger to tag each training sentence • MXPOST-Maximum Entropy Part of Speech Tagger Toolkit3 for POS tags • keeping function words and time words • each sentence together with its label becomes a database tuple • “In the past, John was kind to his sister” • “In the past, NNP was JJ to his NN” • LSP Examples • (<a, NNS>, Error), NNS: plural noun • (<yesterday, is>, Error)

Proposed Technique —LSP (6) • Mining LSPs • adapting the frequent sequence mining algorithm in (Pei et al., 2001) • setting minimumsupportat 0.1% and minimum confidence at 75% • Converting LSPs to Features • the corresponding feature being set at 1 if a sentence includes a LSP

Proposed Technique —LSP (7) • LSPs for erroneous sentences • “<this, NNS>” (“this booksis stolen.”) • “<past, is>” ( “in the past, John iskind to his sister.”) • “<one, of, NN>” ( “it is one of important working language” • “<although, but>” (“although he likes it, buthe can’t buy it.”) • “<only, if, I, am>” (“only if my teacher has given permission, I am allowed to enter this room.”) • LSPs for correct sentences • “<would, VB>” (“he would buy it.”), • “<VBD, yeserday>” (“I bought this book yesterday.”)

Proposed Technique —Other Linguistic Features (1) • Lexical Collocation (LC) • Lexical collocation (“strong tea”/濃茶, not “powerful tea”) • collecting five types of collocations • verb-object, adjective-noun, verb-adverb, subject-verb, and preposition-object from a general English corpus • Correct LCs • extracting collocations of high frequency • Erroneous LC candidates • generated by replacing the word in correct collocations with its confusion words, obtained from WordNet • Consulted by experts to see if a candidate is a true erroneous collocation

Proposed Technique —Other Linguistic Features (2) • computing three LC features for each sentence • (1) • m is the number of CLs • n is the number of collocations in each sentence • Probability p(coi) of each CL coiis calculated using the method (Lu and Zhou, 2004) • (2) the ratio of the number of unknown collocations (neither correct LCs nor erroneous LCs) to the number of collocations in each sentence • (3) the ratio of the number of erroneous LCs to the number of collocations in each sentence

Proposed Technique —Other Linguistic Features (3) • Perplexity from Language Model (PLM) • extracted from a trigram language • Using the SRILM-SRI Language Modeling Toolkit (Stolcke, 2002) • Calculating two values for each sentence: • lexicalized trigram perplexity • POS trigram perplexity • The erroneous sentences would have higher perplexity

Proposed Technique —Other Linguistic Features (4) • Syntactic Score (SC) • using a statistical parser Toolkit (Collins, 1997) • assigning each sentence a parser’s score • the related log probability of parsing • Assuming that erroneous sentences with undesirable sentence structures are more likely to receive lower scores

Proposed Technique —Other Linguistic Features (5) • Function Word Density (FWD) • the ratio of function words to content words • inspired by the work (Corston-Oliver et al., 2001) • Be effective to distinguish between human references and machine outputs • seven kinds of function words

Experimental Evaluation (1) – Experimental setup • Classification model: SVM • For a non-binary feature X: its value x is normalized by z-score. • Two data sets: Japanese Corpus (JC) and Chinese Corpus (CC)

Experimental Evaluation (2)

Experimental Evaluation (3) ALEK (Chodorow and Leacock, 2000) from Educational Testing Service (ETS) 694 parallel-sentences 1671 non-parallel sentences Different cultures (Japanese/Chinese as first language)

Experimental Evaluation (4) • Two LDC data, low-ranked and high-ranked data • 14,604 low ranked (score 1-3) MTs • 808 high ranked (score 3-5) MTs • Both with corresponding human reference translations • human references (Correct), MT (erroneous)

Conclusions and Future Work • Conclusions • This paper proposed to mine LSPs as the input of classification models. • LSPs were shown to be much more effective than the other linguistic features. • Other features were also beneficial. • Future work • To use LSPs to provide detailed feedback for ESL learners • To integrate the features effectively • To further investigate the application for MT evaluation

Thanks!!

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns