1 / 50

Authorship Attribution

Authorship Attribution. CS533 – Information Retrieval Systems Metin KO Ç Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006. Outline. Overview What is Authorship Attribution? Brief History Where and How to use it? Stylometry Style Markers Classification Methods Naïve Bayes

Télécharger la présentation

Authorship Attribution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006

  2. Outline • Overview • What is Authorship Attribution? • Brief History • Where and How to use it? • Stylometry • Style Markers • Classification Methods • Naïve Bayes • Support Vector Machine • k-Nearest Neighbor CS533 Information Retrieval Systems

  3. What is Authorship Attribution? • The way of determining who wrote a text when it is unclear who wrote it. • It is useful when two or more people claim to have written something or when no one is willing (or able) to stay that (s)he wrote the piece • In a typical scenario, a set of documents with known authorship are used for training; the problem is then to identify which of these authors wrote unattributed documents. CS533 Information Retrieval Systems

  4. A Brief History • The advent of non-traditional authorship attribution techniques can be traced back to 1887, when Mendenhall first created the idea of counting features such as word length. • His work was followed by work from Yule (1938) and Morton(1965) with the use of sentence lengths to judge authorship CS533 Information Retrieval Systems

  5. Where to use it? • Authorship Attribution can be used in a broad range of applications • To analyze anonymous or disputed documents/books, such as the plays of Shakespeare (shakespeareauthorship.com) • Plagiarism detection - it can be used to establish whether claimed authorship is valid. CS533 Information Retrieval Systems

  6. Where to use it? (Cont’d) • Criminal Investigation - Ted Kaczynski was targeted as a primary suspect in the Unabomber case, because authorship attribution methods determined that he could have written the Unabomber’s manifesto • Forensic investigations - Verifying the authorship of e-mails and newsgroup messages, or identifying the source of a piece of intelligence. CS533 Information Retrieval Systems

  7. Motivation • So many publications existed, but no detailed work has been given for Turkish literature • Idea Originated from: “Kayıp Yazarın İzi, Elias’ın Gizi” by S. Oğuzertem • Our work is going to support his idea? CS533 Information Retrieval Systems

  8. How to do it? • When an author writes they use certain words unconsciously. • Find some underlying ‘fingerprint’ for an authors style. • The fundamental assumption of authorship attribution is that each author has habits in wording that make their writing unique. CS533 Information Retrieval Systems

  9. How to do it? (Cont’d) • It is well known that certain writers can be quickly identified by their writing style. • Extract features from text that distinguish one author from another • Apply some statistical or machine learning technique given training data • Showing examples and counterexamples of an author's work CS533 Information Retrieval Systems

  10. How to do it – Problems? • Highly interdisciplinary area • Expertise in linguistics, statistics, text authentication, literature? • Too many style measures to apply? • Statistical method – complicated or so simple? Also too many exist in the literature as well CS533 Information Retrieval Systems

  11. How to do it? (Cont’d) • Determine style markers. • Parse all of the documents and extract the features • Combine the results in order to get certain characteristics about the authors • Apply each of the statistical/machine learning approaches to assign a given document to the most likely author. CS533 Information Retrieval Systems

  12. Stylometry • The science of measuring literary style • What are the distinguishing styles? • Study the rarest, most striking features of the writer? • Study how writers use bread-and-butter words (e.g. "to", "with" etc. in English)? CS533 Information Retrieval Systems

  13. Stylometry • "People's unconscious use of everyday words comes out with a certain stamp", David Holmes - stylometrist at the College of New Jersey • "Rare words are noticeable words, which someone else might pick up or echo unconsciously. It's much harder for someone to imitate my frequency pattern of 'but' and 'in'.", John Burrows - emeritus English professor of the University of Newcastle in Australia CS533 Information Retrieval Systems

  14. Style Markers in Our Study • Frequency of Most Frequent Words • Token and Type Lengths • Token : All words • Type : Unique words • For the sentence “I cannot bear to see a bear” • 7 tokens, 6 (context-free) types • Sentence Lengths • Syllable Count in Tokens • Syllable Count in Types CS533 Information Retrieval Systems

  15. Style Markers in General • Some commonly used style markers • Average sentence length • Average syllables per word • Average word length • Distribution of parts of speech • Function word usage • The Type-Token ratio • Word frequencies • Vocabulary distributions CS533 Information Retrieval Systems

  16. Test Set CS533 Information Retrieval Systems

  17. Test Set CS533 Information Retrieval Systems

  18. Test Set CS533 Information Retrieval Systems

  19. Test Set CS533 Information Retrieval Systems

  20. Classification Methods • How the style markers are used? • Several methods exist such as • k-NN (k Nearest Neighbor) • Bayesian analysis • SVM (Support Vector Machines) • PCA (Principal Components Analysis) • Markovian Models • Neural Networks • Decision Trees • We are planning to use • Naïve Bayes • SVM • K-NN CS533 Information Retrieval Systems

  21. Naïve Bayes Approach • In general each style marker is considered to be a feature or a feature set • Existing text whose author is known is used for training • Several choices are possible to find out the distributions of the feature values in a text with a known author such as • Maximum likelihood estimation • Bayes Density Estimation • Maximization-Estimation etc. CS533 Information Retrieval Systems

  22. Naïve Bayes Approach • Values of the features (x) for the unattributed text is found • Since the probability densities are known for each author, Bayes formula is used to find the author of the “anonymous” text • A* = argmaxAi(P(Ai|x) = p(x|Ai) P(Ai)) CS533 Information Retrieval Systems

  23. An OversimplifiedSample Scenario • Assume that • There are texts from two authors (two classes) • As the style marker only the number of words with 3 characters is used (one feature) • Classifier is trained with the text • pdf's obtained CS533 Information Retrieval Systems

  24. An OversimplifiedSample Scenario • Assume that the unattributed text has 10 words with 3 characters • Check whether the author 1 or the author 2 has higher probability of having 10 words with 3 characters • The unattributed text is assigned to the author with a higher probability for 10 words with 3 characters CS533 Information Retrieval Systems

  25. Support Vector Machines (SVMs) • Supervised learning method for classification and regression • Quite popular and successful in Text Categorization (Joachim et al.) • Seeks for an hyper plane separating two classes by: • Maximizing the margin • Minimizing the classification error • Solution is obtained using quadratic optimization techniques CS533 Information Retrieval Systems

  26. denotes +1 denotes -1 Support Vector Machines (SVMs) Sample adapted from Andrew Moore’s SVM slides CS533 Information Retrieval Systems

  27. denotes +1 denotes -1 Support Vector Machines (SVMs) CS533 Information Retrieval Systems

  28. denotes +1 denotes -1 Support Vector Machines (SVMs) CS533 Information Retrieval Systems

  29. denotes +1 denotes -1 Support Vector Machines (SVMs) CS533 Information Retrieval Systems

  30. denotes +1 denotes -1 Support Vector Machines (SVMs) CS533 Information Retrieval Systems

  31. denotes +1 denotes -1 Support Vector Machines (SVMs) CS533 Information Retrieval Systems

  32. denotes +1 denotes -1 Support Vector Machines (SVMs) Margin CS533 Information Retrieval Systems

  33. denotes +1 denotes -1 Support Vector Machines (SVMs) Maximum margin linear classifier, simplest SVM Support vectors lie on the margin and carry all the relevant information Support Vectors define the hyperplane CS533 Information Retrieval Systems

  34. Support Vector Machines (SVMs) CS533 Information Retrieval Systems

  35. denotes +1 denotes -1 x=0 Support Vector Machines (SVMs) How to find the hyperplane? CS533 Information Retrieval Systems

  36. denotes +1 denotes -1 x=0 Support Vector Machines (SVMs) Move training data into higher dimension with kernel functions CS533 Information Retrieval Systems

  37. denotes +1 denotes -1 x=0 Support Vector Machines (SVMs) The hyperplane may not be linear in the original space CS533 Information Retrieval Systems

  38. Support Vector Machines (SVMs) • Basis functions are of the form: • Common kernel functions: • Polynomial • Sigmoidal • Radial basis CS533 Information Retrieval Systems

  39. Multi-class SVM • SVM only works for binary classification, how to handle multi-class (N classes) cases? • Create N SVMs • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” • : • SVM N learns “Output==N” vs “Output != N” • While predicting the output, assign the label of the SVM which puts the input point into furthest positive region CS533 Information Retrieval Systems

  40. SVM Issues • Choice of kernel functions • Computational complexity of the optimization problem CS533 Information Retrieval Systems

  41. k-Nearest Neighbour Classification Method • Key idea: keep all the training instances • Given query example, take vote amongst its k neighbours • Neighbours are determined by using a distance function CS533 Information Retrieval Systems

  42. (k=4) (k=1) Probability interpretation: estimate p(y|x) as k-Nearest Neighbour Classification Method Sample adapted from Rong Jin’s slides CS533 Information Retrieval Systems

  43. k-Nearest Neighbour Classification Method • Advantages: • Training is really fast • Can learn complex target functions • Disadvantages • Slow at query time: Efficient data structures are needed to speed up the query CS533 Information Retrieval Systems

  44. How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal CS533 Information Retrieval Systems

  45. How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal CS533 Information Retrieval Systems

  46. (k=1) How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal CS533 Information Retrieval Systems

  47. How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 1 CS533 Information Retrieval Systems

  48. How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 1 CS533 Information Retrieval Systems

  49. k = 2 How to choose k? • Use validation with leave-one-out method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 3 Err(2) = 2 Err(3) = 6 CS533 Information Retrieval Systems

  50. Future Work & Conclusion • Preliminary features distributions seem discriminative • Will apply classification methods on the feature set • Will rank the features’ success rate • May come up with new style markers CS533 Information Retrieval Systems

More Related