A New Approach for Cross-Language Plagiarism Analysis

A New Approach for Cross-Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

Introduction • Plagiarism is one of the most serious forms of academic misconduct • It is defined as “the use of another person's written work without acknowledging the source” • A study with over 80,000 students in the US and Canada found that many of them have already commited a plagiarism offense • 36% of undergraduate students • 24% of graduate students • Several types • Word-for-word, paraphrasing, text translation, etc.

Introduction • Cross-language plagiarism is becoming more commom • Evolution of automatic translation systems • Increasing availability of textual content in many different languages • Common scenario • A student downloads a paper, translates it using a automatic translation tool, corrects some translation errors and presents it as his own work • It can also involve self-plagiarism • Usually aims at increasing the number of publications

Introduction • What is the task? • Detect the plagiarized passages in the suspicious documents and their corresponding text fragments in the source documents even if the documents are written in different languages • Known as External plagiarism analysis

Related Work • Monolingual Plagiarism Analysis • Fingerprints, fuzzy-fingerprints, ... • Cross-Language Plagiarism Analisys • Statistical bilingual dictionary + bilingual text alignment • Use EuroWordNet to transform words into a language independent representation • PAN competition • Enables different methods to be compared against each other • It was held as an evaluation lab in conjunction with CLEF 2010

The Proposed Approach Original Documents Suspicious Documents Language Normalization (1) Training Corpus Norm. Susp. Documents Norm. Orig. Documents for each Feature Selection + Classifier Training (3) Index Suspicious Document Classification Model Retrieval (2) Candidate Documents Plagiarism Analysis (4) Preliminary Result Post-Processing Final Result (5)

(1) Language Normalization • All documents are converted into a common language • English was chosen • More translation resources • One of the easiest languages to translate into • Used a language guesser and an automatic translation tool

(2) Retrieval of Candidate Documents • Problem: It is not feasible to perform exhaustive comparisons • Solution: Use passages from the suspicious document as a query to be sent to an IR system • Note that documents are divided into subdocuments (paragraphs) in order to reduce the amount of text that must be analyzed • At the end of this phase, we have a list of at most ten candidate subdocuments for each passage in the suspicious document

(3) Feature Selection and Classifier Training • The goal is to build a classification model that can learn how to distinguish between a plagiarized and a non-plagiarized text passage • Annotated synthetic examples used for training • J48 classification algorithm • Features • The cosine similarity between the suspicious passage and the candidate subdocument • The similarity score assigned by the IR system • The position of the candidate subdocument in the rank generated • The length (in characters) of the suspicious and the candidate subdocument

(4) Plagiarism Analysis • Submit the test instances to the trained classifier and let it decide whether the suspicious passage is, in fact, plagiarized from one of the candidate subdocuments Suspicious Document ... Passage 1 Passage 2 Passage 5 Index Retrieval SubDoc 1 Classifier SubDoc 2 ... Plagiarized Or Non-Plagiarized class labels SubDoc 10

(5) Result Post-Processing • Join the contiguous plagiarized passages detected by the method in order to decrease the granularity score • The granularity score is a measure that assesses whether the plagiarism method reports a plagiarized passage as a whole or as several small plagiarized passages

Experiments • Multilingual Test Collection • ECLaPA collection assembled from the Europarl Parallel Corpus (English, Portuguese and French) • An analogous monolingual corpus was also assembled • Available at http://www.inf.ufrgs.br/~viviane/eclapa.html • Terrier IR System (Porter Stemmer + Stop-Word Removal) • Weka (J48 classification algorithm) • Google Translator (as language guesser) • LEC Power Translator • Evaluation Measures (PAN competition)

Experiments - Results • Monolingual vs. Multilingual • Recall was the most affected measure • Loss of information due to the translation process • 86% of the overall score of the monolingual baseline

Experiments - Results Detailed analysis The larger the passage the easier the detection Plagiarized passages detected Monolingual 90% vs. 77% Multilingual

Summary • We proposed and evaluated a new method for CLPA • Used a classification algorithm in order to decide whether a text passage is pagiarized or not • We assembled an artificial cross-language plagiarism test collection to evaluate the method • It is freely available • Cross-language experiment achieved 86% of the performance of the monolingual baseline

Future Work Improve the time spent during the analysis of each suspicious document Analyze each suspicious passage in a different computer? Test other features during the classifier training phase Evaluate the method while detecting plagiarism between documents written in unrelated languages English vs. Chinese/Japanese Many plagiarism cases happen between these pairs of languages Citation analysis

A New Approach for Cross-Language Plagiarism Analysis Questions? Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010

A New Approach for Cross-Language Plagiarism Analysis

A New Approach for Cross-Language Plagiarism Analysis

Presentation Transcript

A New Approach for New Business?

A New Approach for Classification :

Floodplain Management: a new approach for a new era

Plagiarism – the importance of a preventative approach

Quality Assurance Plagiarism – Cross Check

Plagiarism: Clarifying OSU's Approach

A New Approach

A NEW APPROACH

A New Approach

A New Approach

A New Approach To Cross-Modal Multimedia Retrieval

NEEDED: A New Approach to Second Language Learning

CROSS-IMPACT ANALYSIS for

A New Approach for Visual Cryptography

a cross-disciplinary approach to genre analysis

A NEW APPROACH

A New Approach

Parenthood: a cross-cultural approach

Need for a New Approach

Flow injection: A new approach in analysis

A new approach

A new approach