340 likes | 735 Vues
Authorship Attribution and Stylometry. Patrick Juola Duquesne University www.jgaap.com juola@mathcs.duq.edu. Whodunit?. Authorship Attribution (aka Stylometry, cf. Authorship Profiling) : identifying an author from his/her writings Did Shakespeare really write those plays?
E N D
Authorship Attribution and Stylometry Patrick Juola Duquesne University www.jgaap.com juola@mathcs.duq.edu
Whodunit? • Authorship Attribution (aka Stylometry, cf. Authorship Profiling) : identifying an author from his/her writings • Did Shakespeare really write those plays? • Or was it the Earl of Oxford? • Or Francis Bacon? • Or Roger Bacon? • Or Kevin Bacon? • &c.
More technical definition • Authorship attribution : inferring the identity of the author of a document by examination. • Stylometry : inferring properties of the author by examination • E.g. the author was a male native English speaker aged between 25-35 with no college education but with theater training
Important problem Long history (Book of Judges, “shibboleth”) Key to literature and to history and journalism and teaching (catching cheaters) and law/investigation (Unabomber) and psychology (inferring personality from writing) and security and,… and,…
Computers are problematic Handwriting is easy, anyone can do it. Typewriting is still pretty easy if you know what you’re looking for But one 12pt Times Roman ‘A’ looks identical to any other. What cues to authorship exist?
Looking for clues What is this object?
Looking for clues (2) How far does light travel in 1/300,000 of a second?
Where is the dinner fork? Looking for clues (3)
Finding clues The object is a “couch.”
Looking for clues (2) • How far does light travel in 1/300,000 of a second? • Approximately one kilometer. • Note that other answers are not wrong, just individual. • E.g. “kilometre” is a standard spelling • ‘km’ is standard abbreviation • “click” or “k” are commonly-understood slang
The dinner fork is to the left of the plate Finding clues (3)
The dinner fork is on the immediate left of the plate Finding clues (3b)
Another example The paradigmatic and systematic utilization of sesquipedalian lexical items can be an informative element of individual and idiosyncratic patterns of linguistic variation Or, some people use big words
History Judges 12:6 Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan, and there fell at that time of the Ephraimites forty and two thousand.
The “stylome” The underlying theoretical assumption is that language is not completely controllable (i.e. it’s hard to lose your accent) Obviously, some parts (e.g. lexicon) are more controllable than others (e.g. accent). Van Halteren has coined the term “stylome” to describe these specific individual differences. Others use “fingerprint.”
Some early candidates • Authorial vocabulary may be a stylome. • You can’t use words you don’t know. • Can we measure vocabulary size? • Similarly, average word length may be a stylome (first proposed by De Morgan) • …. But neither of these work especially well.
Federalist Papers Modern stylometry more or less starts with Mosteller and Wallace Studied the Federalist Papers using multivariate statistics Took frequencies of specific high-frequency function words Classified disputed documents as H/M based on Bayesian analysis
Successes and Failures • M/W results generally confirmed accepted scholarship • But it’s also a largely artificial problem! • Federalist Papers have become “standard” • Other examples have produced noted failures • E.g. Foster’s attribution of “A Funeral Elegy”
Lots of ways to study Rudman has suggested that more than 1000 different features have been proposed over the past 100+ years. Most “work” in the sense of better than chance. But “better than chance” isn’t very good in the real world.
The Ur-study Find a document, with presumptive author Collect uncontroversial corpus of author’s writings Collect set of distractor authors, with sample corpora for each author Identify something found in author’s writings and test documentbut not in distractors’ Publish
Textual considerations • First question : How confident are we that we have a valid text to study? • Issues include corruption, editorial changes, formatting (e.g. running heads), printers errors • Second question : How confident are we of our “uncontroversial” stuff? • Third question : Do we have the right distractor authors?
Technical considerations First question : How good is the technique we’re using? Second question : Are there representativeness issues involved? Third question : Do we have enough data? Fourth question : How do we interpret the results?
Search for best practices • First question : How good is the technique we’re using? • Development of “good” techniques is an open research question. • … hence JGAAP
JGAAP Single framework allows comparative testing under controlled conditions Modular, object-oriented approach makes extension to new methods easy. Simple GUI for ease of use Simple 3-phase model under the hood
Under the hood Canonicization -- perform necessary conversions, strip out irrelevant and confusing differences Event Set Generation – partition document into “Events” Statistical Analysis – k-NN, LDA, SVM, Naïve Bayes, whatever you like….
Event Set Generation • Documents contain “events” (also called “features,” but “events” stresses ordering). • E.g. words are events • Make bag-of-words, or bag of word-bigrams • Properties of words are also events • POS, word lengths, frequencies • Phase II – convert document to (ordered) Event Set.
Analysis Classify Event Sets based on statistical properties. Again, many different ways to do this
A simple example Build histogram of Events based on (normalized) frequencies. Convert histogram to vector space by enumerating over elements Calculate distances between various histograms using distance formula Assign authorship of unknown document to closest document of known authorship
Getting JGAAP JGAAP is available at www.jgaap.com Also available at http://www.mathcs.duq.edu/~fa08rutenbar/jgaap.zip Requires Java (JDK) 1.5 or better Also requires ‘ant’ Freeware, so we can (and will) be developing during the course
Plans for rest of course Details of JGAAP Details of some of the models JGAAP includes Developing new test corpora Developing new models based on analysis of test corpora Extension to profiling….