Enhanced Authorship Attribution Using Probabilistic Context-Free Grammars for Document Analysis

Authorship Attribution Using Probabilistic Context-Free Grammars Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin

Authorship Attribution • Task of identifying the author of a document • Applications • Forensics(Luckyx and Daelemans, 2008) • Cyber crime investigation (Zheng et al., 2009) • Automatic plagiarism detection (Stamatatos, 2009) • The Federalist papers study (Monsteller and Wallace, 1984) • The Federalist papers are a set of essays of the US constitution • Authorship of these papers were unknown at the time of publication • Statistical analysis was used to find the authors of these documents

Existing Approaches • Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al., 1999; Diederich et al., 2000; Luyckx and Daelemans, 2008) • Character-level n-grams (Peng et al., 2003) • Syntactic features from parse trees (Baayen et al., 1996) • Limitations • Capture mostly lexical information • Do not necessarily capture the author’s syntactic style

Our Approach • Using probabilistic context-free grammar (PCFG) to capture the syntactic style of the author • Construct a PCFG based on the documents written by the author and use it as a language model for classification • Requires annotated parse trees of the documents How do we obtain these annotated parse trees?

Algorithm – Step 1 Training documents ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. Treebank each document using a statistical parser trained on a generic corpus • Stanford parser(Klein and Manning, 2003) • WSJ or Brown corpus from Penn Treebank(http://www.cis.upenn.edu/~treebank) ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. Bob Mary John Alice

Algorithm – Step 2 Probabilistic Context-Free Grammars S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 . . . S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 . . . S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 . . . S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 . . . Bob Mary John Alice Train a PCFG for each author using the treebanked documents from Step 1

Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 Test document .6 Alice ………………….. ….…….. S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 John

Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 Test document .6 Alice ………………….. ….…….. Multiply the probability of the top parse for each sentence in the test document S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 John

Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 Test document .6 Alice ………………….. ….…….. Multiply the probability of the top parse for each sentence in the test document S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 Label for the test document John

Experimental Evaluation

Data Blue – News articlesRed – Literary works Data sets available at www.cs.utexas.edu/users/sindhu/acl2010

Methodology • Bag-of-words model (baseline) • Naïve Bayes, MaxEnt • N-gram models (baseline) • N=1,2,3 • Basic PCFG model • PCFG-I (Interpolation)

Basic PCFG • Train PCFG based only on the documents written by the author • Poor performance when few documents are available for training • Increase the number of documents in the training set • Forensics - Do not always have access to a number of documents written by the same author • Need for alternate techniques when few documents are available for training

PCFG-I • Uses the method of interpolationfor smoothing • Augment the training data by adding sections of WSJ/Brown corpus • Up-sample data for the author

Results

Performance of Baseline Models Accuracy in % Dataset Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets

Performance of PCFG and PCFG-I Accuracy in % Dataset PCFG-I performs better than the basic PCFG model on most data sets

PCFG Models vs. Baseline Models Accuracy in % Dataset Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets

PCFG-E • PCFG models do not always outperform N-gram models • Lexical features from N-gram models useful for distinguishing between authors • PCFG-E(Ensemble) • PCFG-I (best PCFG model) • Bigram model (best N-gram model) • MaxEnt based bag-of-words (discriminative classifier)

Performance of PCFG-E Accuracy in % Dataset PCFG-E outperforms or matches with the best baseline on all data sets

Significance of PCFG (PCFG-E – PCFG-I) Accuracy in % Dataset Drop in performance on removing PCFG-I from PCFG-E on most data sets

Conclusions • PCFGs are useful for capturing the author’s syntactic style • Novel approach for authorship attribution using PCFGs • Both syntactic and lexical information is necessary to capture author’s writing style

Thank You

Enhanced Authorship Attribution Using Probabilistic Context-Free Grammars for Document Analysis

Enhanced Authorship Attribution Using Probabilistic Context-Free Grammars for Document Analysis

Presentation Transcript

Context-Free Grammars

Authorship Attribution

Context-Free Grammars

Context-Free Grammars

Context-Free Grammars

Context-Free Grammars

Context Free Grammars

Probabilistic Context Free Grammars

Context-free Grammars

Natural Language Processing : Probabilistic Context Free Grammars

Natural Language Processing : Probabilistic Context Free Grammars

Authorship Attribution

Authorship Attribution

Context Free Grammars

Context-Free Grammars

Context-Free Grammars

CONTEXT-FREE GRAMMARS

Context-Free Grammars

Context-Free Grammars