1 / 27

A New Approach for Cross-Language Plagiarism Analysis

A New Approach for Cross-Language Plagiarism Analysis. Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010. Outline. Introduction Related Work The Proposed Approach Experiments Summary and Future Work. Introduction.

fawzi
Télécharger la présentation

A New Approach for Cross-Language Plagiarism Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Approach for Cross-Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010

  2. Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

  3. Introduction • Plagiarism is one of the most serious forms of academic misconduct • It is defined as “the use of another person's written work without acknowledging the source” • A study with over 80,000 students in the US and Canada found that many of them have already commited a plagiarism offense • 36% of undergraduate students • 24% of graduate students • Several types • Word-for-word, paraphrasing, text translation, etc.

  4. Introduction • Cross-language plagiarism is becoming more commom • Evolution of automatic translation systems • Increasing availability of textual content in many different languages • Common scenario • A student downloads a paper, translates it using a automatic translation tool, corrects some translation errors and presents it as his own work • It can also involve self-plagiarism • Usually aims at increasing the number of publications

  5. Introduction • What is the task? • Detect the plagiarized passages in the suspicious documents and their corresponding text fragments in the source documents even if the documents are written in different languages • Known as External plagiarism analysis

  6. Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

  7. Related Work • Monolingual Plagiarism Analysis • Fingerprints, fuzzy-fingerprints, ... • Cross-Language Plagiarism Analisys • Statistical bilingual dictionary + bilingual text alignment • Use EuroWordNet to transform words into a language independent representation • PAN competition • Enables different methods to be compared against each other • It was held as an evaluation lab in conjunction with CLEF 2010

  8. Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

  9. The Proposed Approach Original Documents Suspicious Documents Language Normalization (1) Training Corpus Norm. Susp. Documents Norm. Orig. Documents for each Feature Selection + Classifier Training (3) Index Suspicious Document Classification Model Retrieval (2) Candidate Documents Plagiarism Analysis (4) Preliminary Result Post-Processing Final Result (5)

  10. (1) Language Normalization • All documents are converted into a common language • English was chosen • More translation resources • One of the easiest languages to translate into • Used a language guesser and an automatic translation tool

  11. The Proposed Approach Original Documents Suspicious Documents Language Normalization (1) Training Corpus Norm. Susp. Documents Norm. Orig. Documents for each Feature Selection + Classifier Training (3) Index Suspicious Document Classification Model Retrieval (2) Candidate Documents Plagiarism Analysis (4) Preliminary Result Post-Processing Final Result (5)

  12. (2) Retrieval of Candidate Documents • Problem: It is not feasible to perform exhaustive comparisons • Solution: Use passages from the suspicious document as a query to be sent to an IR system • Note that documents are divided into subdocuments (paragraphs) in order to reduce the amount of text that must be analyzed • At the end of this phase, we have a list of at most ten candidate subdocuments for each passage in the suspicious document

  13. The Proposed Approach Original Documents Suspicious Documents Language Normalization (1) Training Corpus Norm. Susp. Documents Norm. Orig. Documents for each Feature Selection + Classifier Training (3) Index Suspicious Document Classification Model Retrieval (2) Candidate Documents Plagiarism Analysis (4) Preliminary Result Post-Processing Final Result (5)

  14. (3) Feature Selection and Classifier Training • The goal is to build a classification model that can learn how to distinguish between a plagiarized and a non-plagiarized text passage • Annotated synthetic examples used for training • J48 classification algorithm • Features • The cosine similarity between the suspicious passage and the candidate subdocument • The similarity score assigned by the IR system • The position of the candidate subdocument in the rank generated • The length (in characters) of the suspicious and the candidate subdocument

  15. The Proposed Approach Original Documents Suspicious Documents Language Normalization (1) Training Corpus Norm. Susp. Documents Norm. Orig. Documents for each Feature Selection + Classifier Training (3) Index Suspicious Document Classification Model Retrieval (2) Candidate Documents Plagiarism Analysis (4) Preliminary Result Post-Processing Final Result (5)

  16. (4) Plagiarism Analysis • Submit the test instances to the trained classifier and let it decide whether the suspicious passage is, in fact, plagiarized from one of the candidate subdocuments Suspicious Document ... Passage 1 Passage 2 Passage 5 Index Retrieval SubDoc 1 Classifier SubDoc 2 ... Plagiarized Or Non-Plagiarized class labels SubDoc 10

  17. The Proposed Approach Original Documents Suspicious Documents Language Normalization (1) Training Corpus Norm. Susp. Documents Norm. Orig. Documents for each Feature Selection + Classifier Training (3) Index Suspicious Document Classification Model Retrieval (2) Candidate Documents Plagiarism Analysis (4) Preliminary Result Post-Processing Final Result (5)

  18. (5) Result Post-Processing • Join the contiguous plagiarized passages detected by the method in order to decrease the granularity score • The granularity score is a measure that assesses whether the plagiarism method reports a plagiarized passage as a whole or as several small plagiarized passages

  19. Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

  20. Experiments • Multilingual Test Collection • ECLaPA collection assembled from the Europarl Parallel Corpus (English, Portuguese and French) • An analogous monolingual corpus was also assembled • Available at http://www.inf.ufrgs.br/~viviane/eclapa.html • Terrier IR System (Porter Stemmer + Stop-Word Removal) • Weka (J48 classification algorithm) • Google Translator (as language guesser) • LEC Power Translator • Evaluation Measures (PAN competition)

  21. Experiments - Results • Monolingual vs. Multilingual • Recall was the most affected measure • Loss of information due to the translation process • 86% of the overall score of the monolingual baseline

  22. Experiments - Results Detailed analysis The larger the passage the easier the detection Plagiarized passages detected Monolingual 90% vs. 77% Multilingual

  23. Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

  24. Summary • We proposed and evaluated a new method for CLPA • Used a classification algorithm in order to decide whether a text passage is pagiarized or not • We assembled an artificial cross-language plagiarism test collection to evaluate the method • It is freely available • Cross-language experiment achieved 86% of the performance of the monolingual baseline

  25. Future Work Improve the time spent during the analysis of each suspicious document Analyze each suspicious passage in a different computer? Test other features during the classifier training phase Evaluate the method while detecting plagiarism between documents written in unrelated languages English vs. Chinese/Japanese Many plagiarism cases happen between these pairs of languages Citation analysis

  26. Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

  27. A New Approach for Cross-Language Plagiarism Analysis Questions? Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010

More Related