1 / 25

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns. Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.12.04. Outlines. Introduction Related Work Proposed Technique Experimental Evaluation Conclusions and Future Work. Introduction. Summary

sheryl
Télécharger la présentation

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Erroneous Sentences using Automatically Mined Sequential Patterns Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.12.04

  2. Outlines • Introduction • Related Work • Proposed Technique • Experimental Evaluation • Conclusions and Future Work

  3. Introduction • Summary • Problem: Identifying erroneous/correct sentences • Algorithm: Classification (SVM, NB) • Approach: Sequential patterns (Data Mining) • Applications • Providing feedback for writers of English as a Second Language (ESL) • Controlling the quality of parallel bilingual sentences mined from the Web • Evaluating the MT results

  4. Introduction (cont.) • The common mistakes (Yukio et al.,2001; Gui and Yang, 2003) made by ESL learners • spelling, verb formation • lexical collocation, tense, agreement, wrong Part-Of-Speech (POS), article usage • sentence structure (grammar structure) • Example • “If Maggie willgo to supermarket, she willbuy a bag for you.” • The pattern: “if...will...will” (would ) • N-grams: considering only continuous sequence of words, very expensive if N > 3

  5. Related Work • Category 1: the use of hand-crafted rules • Heidorn, 2000; Michaud et al., 2000; Bender et al., 2004 • Difficulties • Expensive to write rules manually • difficult to produce and maintain a large number of non-conflicting rules to cover a wide range of grammatical errors • making different errors by different first-language backgrounds and skill levels • hard to write rules for some grammatical errors

  6. Related Work (cont.) • Category 2: statistical approaches • Chodorow and Leacock, 2000; Izumi et al., 2003; Brockett et al., 2006; Nagata et al., 2006 • Problems • focusing on some pre-defined errors • the reported results being not attractive • the need of errors to be specified and tagged in the training sentences • the need of parallel tagged data

  7. Proposed Technique • Classification model • Using SVM (light SVM) • Features • Labeled Sequential Patterns (LSP) – 1 feature • Complementary features • Lexical Collocation (LC) – 3 features • Perplexity from Language Model (PLM) – 2 features • Syntactic Score (SC) – 1 feature • Function Word Density (FWD) – 5 features

  8. Proposed Technique —LSP (1) • A labeled sequential pattern (LSP), p, is in the form of <LHS, c> • LHS is a sequence <a1, ..., am> • ai is named “item”. • c is a class label (correct/incorrect here) • Sequence database D • The collection of LSPs

  9. Proposed Technique —LSP (2) • “Contain” relation (subsequence) • a sequence s1 =< a1, ..., am >is contained in a sequence s2 =< b1, ..., bn >if there exist integers i1, ...imsuch that 1 <= i1 < i2 < ... < im <= nand aj = bijfor all j in {1, ...,m}. • A=<abcdefgh> has a subsequence B=<bdeg> • A contains B. • A LSP p1 is contained by p2 if the sequence p1.LHSis contained by p2.LHSand p1.c = p2.c.

  10. Proposed Technique —LSP (3) • A LSP p is attached with two measures, supportand confidence. • The support of p (the generality of the pattern p) • denoted by sup(p) • the percentage of tuples in database D that contain the LSP p • the confidence of p (predictive ability of p) • Denoted by conf(p) • Computed as

  11. Proposed Technique —LSP (4) • Example: • t1 = (< a, d, e, f >,E) • t2 = (< a, f, e, f >,E) • t3 = (< d, a, f >,C) • One example LSP p1 = (< a, e, f >, E) • is contained in t1 and t2 • sup(p1) = 2/3 = 66.7%, • conf(p1)=(2/3)/(2/3) = 100% • LSP p2 = (< a, f >, E) • sup(p2) = 3/3 = 100%, • conf(p2)= (2/3)/(3/3) = 66.7%

  12. Proposed Technique —LSP (5) • Generating Sequence Database • applying Part-Of-Speech (POS) tagger to tag each training sentence • MXPOST-Maximum Entropy Part of Speech Tagger Toolkit3 for POS tags • keeping function words and time words • each sentence together with its label becomes a database tuple • “In the past, John was kind to his sister” • “In the past, NNP was JJ to his NN” • LSP Examples • (<a, NNS>, Error), NNS: plural noun • (<yesterday, is>, Error)

  13. Proposed Technique —LSP (6) • Mining LSPs • adapting the frequent sequence mining algorithm in (Pei et al., 2001) • setting minimumsupportat 0.1% and minimum confidence at 75% • Converting LSPs to Features • the corresponding feature being set at 1 if a sentence includes a LSP

  14. Proposed Technique —LSP (7) • LSPs for erroneous sentences • “<this, NNS>” (“this booksis stolen.”) • “<past, is>” ( “in the past, John iskind to his sister.”) • “<one, of, NN>” ( “it is one of important working language” • “<although, but>” (“although he likes it, buthe can’t buy it.”) • “<only, if, I, am>” (“only if my teacher has given permission, I am allowed to enter this room.”) • LSPs for correct sentences • “<would, VB>” (“he would buy it.”), • “<VBD, yeserday>” (“I bought this book yesterday.”)

  15. Proposed Technique —Other Linguistic Features (1) • Lexical Collocation (LC) • Lexical collocation (“strong tea”/濃茶, not “powerful tea”) • collecting five types of collocations • verb-object, adjective-noun, verb-adverb, subject-verb, and preposition-object from a general English corpus • Correct LCs • extracting collocations of high frequency • Erroneous LC candidates • generated by replacing the word in correct collocations with its confusion words, obtained from WordNet • Consulted by experts to see if a candidate is a true erroneous collocation

  16. Proposed Technique —Other Linguistic Features (2) • computing three LC features for each sentence • (1) • m is the number of CLs • n is the number of collocations in each sentence • Probability p(coi) of each CL coiis calculated using the method (Lu and Zhou, 2004) • (2) the ratio of the number of unknown collocations (neither correct LCs nor erroneous LCs) to the number of collocations in each sentence • (3) the ratio of the number of erroneous LCs to the number of collocations in each sentence

  17. Proposed Technique —Other Linguistic Features (3) • Perplexity from Language Model (PLM) • extracted from a trigram language • Using the SRILM-SRI Language Modeling Toolkit (Stolcke, 2002) • Calculating two values for each sentence: • lexicalized trigram perplexity • POS trigram perplexity • The erroneous sentences would have higher perplexity

  18. Proposed Technique —Other Linguistic Features (4) • Syntactic Score (SC) • using a statistical parser Toolkit (Collins, 1997) • assigning each sentence a parser’s score • the related log probability of parsing • Assuming that erroneous sentences with undesirable sentence structures are more likely to receive lower scores

  19. Proposed Technique —Other Linguistic Features (5) • Function Word Density (FWD) • the ratio of function words to content words • inspired by the work (Corston-Oliver et al., 2001) • Be effective to distinguish between human references and machine outputs • seven kinds of function words

  20. Experimental Evaluation (1) – Experimental setup • Classification model: SVM • For a non-binary feature X: its value x is normalized by z-score. • Two data sets: Japanese Corpus (JC) and Chinese Corpus (CC)

  21. Experimental Evaluation (2)

  22. Experimental Evaluation (3) ALEK (Chodorow and Leacock, 2000) from Educational Testing Service (ETS) 694 parallel-sentences 1671 non-parallel sentences Different cultures (Japanese/Chinese as first language)

  23. Experimental Evaluation (4) • Two LDC data, low-ranked and high-ranked data • 14,604 low ranked (score 1-3) MTs • 808 high ranked (score 3-5) MTs • Both with corresponding human reference translations • human references (Correct), MT (erroneous)

  24. Conclusions and Future Work • Conclusions • This paper proposed to mine LSPs as the input of classification models. • LSPs were shown to be much more effective than the other linguistic features. • Other features were also beneficial. • Future work • To use LSPs to provide detailed feedback for ESL learners • To integrate the features effectively • To further investigate the application for MT evaluation

  25. Thanks!!

More Related