Authorship Attribution

Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006

Outline • Overview • What is Authorship Attribution? • Brief History • Where and How to use it? • Stylometry • Style Markers • Classification Methods • Naïve Bayes • Support Vector Machine • k-Nearest Neighbor CS533 Information Retrieval Systems

What is Authorship Attribution? • The way of determining who wrote a text when it is unclear who wrote it. • It is useful when two or more people claim to have written something or when no one is willing (or able) to stay that (s)he wrote the piece • In a typical scenario, a set of documents with known authorship are used for training; the problem is then to identify which of these authors wrote unattributed documents. CS533 Information Retrieval Systems

A Brief History • The advent of non-traditional authorship attribution techniques can be traced back to 1887, when Mendenhall first created the idea of counting features such as word length. • His work was followed by work from Yule (1938) and Morton(1965) with the use of sentence lengths to judge authorship CS533 Information Retrieval Systems

Where to use it? • Authorship Attribution can be used in a broad range of applications • To analyze anonymous or disputed documents/books, such as the plays of Shakespeare (shakespeareauthorship.com) • Plagiarism detection - it can be used to establish whether claimed authorship is valid. CS533 Information Retrieval Systems

Where to use it? (Cont’d) • Criminal Investigation - Ted Kaczynski was targeted as a primary suspect in the Unabomber case, because authorship attribution methods determined that he could have written the Unabomber’s manifesto • Forensic investigations - Verifying the authorship of e-mails and newsgroup messages, or identifying the source of a piece of intelligence. CS533 Information Retrieval Systems

Motivation • So many publications existed, but no detailed work has been given for Turkish literature • Idea Originated from: “Kayıp Yazarın İzi, Elias’ın Gizi” by S. Oğuzertem • Our work is going to support his idea? CS533 Information Retrieval Systems

How to do it? • When an author writes they use certain words unconsciously. • Find some underlying ‘fingerprint’ for an authors style. • The fundamental assumption of authorship attribution is that each author has habits in wording that make their writing unique. CS533 Information Retrieval Systems

How to do it? (Cont’d) • It is well known that certain writers can be quickly identified by their writing style. • Extract features from text that distinguish one author from another • Apply some statistical or machine learning technique given training data • Showing examples and counterexamples of an author's work CS533 Information Retrieval Systems

How to do it – Problems? • Highly interdisciplinary area • Expertise in linguistics, statistics, text authentication, literature? • Too many style measures to apply? • Statistical method – complicated or so simple? Also too many exist in the literature as well CS533 Information Retrieval Systems

How to do it? (Cont’d) • Determine style markers. • Parse all of the documents and extract the features • Combine the results in order to get certain characteristics about the authors • Apply each of the statistical/machine learning approaches to assign a given document to the most likely author. CS533 Information Retrieval Systems

Stylometry • The science of measuring literary style • What are the distinguishing styles? • Study the rarest, most striking features of the writer? • Study how writers use bread-and-butter words (e.g. "to", "with" etc. in English)? CS533 Information Retrieval Systems

Stylometry • "People's unconscious use of everyday words comes out with a certain stamp", David Holmes - stylometrist at the College of New Jersey • "Rare words are noticeable words, which someone else might pick up or echo unconsciously. It's much harder for someone to imitate my frequency pattern of 'but' and 'in'.", John Burrows - emeritus English professor of the University of Newcastle in Australia CS533 Information Retrieval Systems

Style Markers in Our Study • Frequency of Most Frequent Words • Token and Type Lengths • Token : All words • Type : Unique words • For the sentence “I cannot bear to see a bear” • 7 tokens, 6 (context-free) types • Sentence Lengths • Syllable Count in Tokens • Syllable Count in Types CS533 Information Retrieval Systems

Style Markers in General • Some commonly used style markers • Average sentence length • Average syllables per word • Average word length • Distribution of parts of speech • Function word usage • The Type-Token ratio • Word frequencies • Vocabulary distributions CS533 Information Retrieval Systems

Test Set CS533 Information Retrieval Systems

Classification Methods • How the style markers are used? • Several methods exist such as • k-NN (k Nearest Neighbor) • Bayesian analysis • SVM (Support Vector Machines) • PCA (Principal Components Analysis) • Markovian Models • Neural Networks • Decision Trees • We are planning to use • Naïve Bayes • SVM • K-NN CS533 Information Retrieval Systems

Naïve Bayes Approach • In general each style marker is considered to be a feature or a feature set • Existing text whose author is known is used for training • Several choices are possible to find out the distributions of the feature values in a text with a known author such as • Maximum likelihood estimation • Bayes Density Estimation • Maximization-Estimation etc. CS533 Information Retrieval Systems

Naïve Bayes Approach • Values of the features (x) for the unattributed text is found • Since the probability densities are known for each author, Bayes formula is used to find the author of the “anonymous” text • A* = argmaxAi(P(Ai|x) = p(x|Ai) P(Ai)) CS533 Information Retrieval Systems

An OversimplifiedSample Scenario • Assume that • There are texts from two authors (two classes) • As the style marker only the number of words with 3 characters is used (one feature) • Classifier is trained with the text • pdf's obtained CS533 Information Retrieval Systems

An OversimplifiedSample Scenario • Assume that the unattributed text has 10 words with 3 characters • Check whether the author 1 or the author 2 has higher probability of having 10 words with 3 characters • The unattributed text is assigned to the author with a higher probability for 10 words with 3 characters CS533 Information Retrieval Systems

Support Vector Machines (SVMs) • Supervised learning method for classification and regression • Quite popular and successful in Text Categorization (Joachim et al.) • Seeks for an hyper plane separating two classes by: • Maximizing the margin • Minimizing the classification error • Solution is obtained using quadratic optimization techniques CS533 Information Retrieval Systems

denotes +1 denotes -1 Support Vector Machines (SVMs) Sample adapted from Andrew Moore’s SVM slides CS533 Information Retrieval Systems

denotes +1 denotes -1 Support Vector Machines (SVMs) CS533 Information Retrieval Systems

denotes +1 denotes -1 Support Vector Machines (SVMs) Margin CS533 Information Retrieval Systems

denotes +1 denotes -1 Support Vector Machines (SVMs) Maximum margin linear classifier, simplest SVM Support vectors lie on the margin and carry all the relevant information Support Vectors define the hyperplane CS533 Information Retrieval Systems

Support Vector Machines (SVMs) CS533 Information Retrieval Systems

denotes +1 denotes -1 x=0 Support Vector Machines (SVMs) How to find the hyperplane? CS533 Information Retrieval Systems

denotes +1 denotes -1 x=0 Support Vector Machines (SVMs) Move training data into higher dimension with kernel functions CS533 Information Retrieval Systems

denotes +1 denotes -1 x=0 Support Vector Machines (SVMs) The hyperplane may not be linear in the original space CS533 Information Retrieval Systems

Support Vector Machines (SVMs) • Basis functions are of the form: • Common kernel functions: • Polynomial • Sigmoidal • Radial basis CS533 Information Retrieval Systems

Multi-class SVM • SVM only works for binary classification, how to handle multi-class (N classes) cases? • Create N SVMs • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” • : • SVM N learns “Output==N” vs “Output != N” • While predicting the output, assign the label of the SVM which puts the input point into furthest positive region CS533 Information Retrieval Systems

SVM Issues • Choice of kernel functions • Computational complexity of the optimization problem CS533 Information Retrieval Systems

k-Nearest Neighbour Classification Method • Key idea: keep all the training instances • Given query example, take vote amongst its k neighbours • Neighbours are determined by using a distance function CS533 Information Retrieval Systems

(k=4) (k=1) Probability interpretation: estimate p(y|x) as k-Nearest Neighbour Classification Method Sample adapted from Rong Jin’s slides CS533 Information Retrieval Systems

k-Nearest Neighbour Classification Method • Advantages: • Training is really fast • Can learn complex target functions • Disadvantages • Slow at query time: Efficient data structures are needed to speed up the query CS533 Information Retrieval Systems

How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal CS533 Information Retrieval Systems

(k=1) How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal CS533 Information Retrieval Systems

How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 1 CS533 Information Retrieval Systems

k = 2 How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 3 Err(2) = 2 Err(3) = 6 CS533 Information Retrieval Systems

Future Work & Conclusion • Preliminary features distributions seem discriminative • Will apply classification methods on the feature set • Will rank the features’ success rate • May come up with new style markers CS533 Information Retrieval Systems

Authorship Attribution