ELABORAZIONE DEL LINGUAGGIO NATURALE

ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA

CHI L’HA SCRITTO? “On the far side of the river valley the road passed through a stark black burn. Charred and limbless trunks of trees stretching away on every side. Ash moving over the road and the sagging hands of blind wire strung from the blackened lightpoles whining thinly in the wind.”

STYLOMETRY • Studying properties of the writers of documents based only on the linguistic style they exhibit. • In particular using computational tools • The best known type of stylometric task: “who wrote this document?” • “Linguistic Style” Features: sentence length, word choices, syntactic structure, etc. • Handwriting, content-based features, and contextual features are not considered.

Applications of Stylometry • Digital Humanities: • Author attribution: Identification of unknown authors • Genre classification • Historical study of language change (diachronic linguistics) • Literary analysis • But many other applications as well: Forensics, Anonymity, Plagiarism … • “In some criminal, civil, and security matters, language can be evidence… When you are faced with a suspicious document, whether you need to know who wrote it, or if it is a real threat or real suicide note, or if it is too close for comfort to some other document, you need reliable, validated methods.”

Who wrote this? “On the far side of the river valley the road passed through a stark black burn. Charred and limbless trunks of trees stretching away on every side. Ash moving over the road and the sagging hands of blind wire strung from the blackened lightpoles whining thinly in the wind.” Cormac McCarthy

Authorship attribution • Has been a topic of research since at least mod-19th century (predates computers) • Interest in • resolving issues of disputed authorship • identifying authorship of anonymous texts • may be useful in detecting plagiarism, and authorship of computer viruses • used in forensic setting, eg to detect genuine confessions

Classical examples • Did Homer write both the Illiad and the Odyssey? • both generally attributed to a single individual named “Homer”, but both are derived from long oral tradition • Did Paul write all the NT Letters of St Paul? • Especially, the authorship of Hebrews has long been debated on theological grounds • Plato developed his philosophy in the form of dialogues, putting his own doctrines into the mouth of Socrates his teacher. • Ascertaining the correct chronological order of these dialogues would help to understand how Plato developed his philosophy • Did Shakespeare write all of his plays? • Various authors including Bacon and Marlowe are said to have written parts or all of several plays • “Shakespeare” may even be a nom-de-plume for a group of writers • two more plays – Edward III and Two Noble Kinsmen – may have been written partly by Shakespeare

The Federalist Papers • 85 articles published in 1787-88 with the aim of promoting the ratification of the new US constitution. • written by three authors, Jay, Hamilton and Madison, under the pseudonym “Publius” • Some are of known (and in some cases joint) authorship but 12 are disputed • Pioneering stylometric methods were famously used by Mosteller and Wallace in the early 1960s to attempt to answer this question • It is now considered as settled (Madison the author of the disputed papers) • The Federalist Papers present a difficult but solvable test case, and are seen as a benchmark to test new ideas

Some modern examples • Similarities with private letters helped to identify the style of the Unabomber’s manifesto • Unabomber Theodore Kaczynski perpetrated a number of bomb attacks on universities and airlines between 1978 and 1995 • Promised to stop if his 35,000-word anti-industrialist “manifesto” was published in major newspapers • Distinctive writing style and turns of phrase enabled him to be identified • Authorship of Primary Colors, a work of fiction about preparations for the Democratic primaries which showed the Bill Clinton character in a bad light

Some modern examples • Derek Bentley and his disputed murder ‘confession’ (1953) • Bentley (an illiterate man of low IQ) and another man involved in an armed robbery in which a policeman was shot • Bentley found guilty and hanged in January 1953 • In 1971 author Yallop looked closely at the case, • As well as conflicting ballistic evidence, and some procedurtal errors in the trial, Bentley’s statement was found to have been doctored by police: • Contested statement used then every 58 words on average and repeatedly used I then. • BoE uses then every 500 words, and then I ten times more often than I then. Importantly, witness statement frequencies overall are similar to BoE. • Police statement ‘genre’ of the time used then every 78 words, and typically used the I then form. • Derek Bentley acquitted in 1999, posthumously, appeal assisted by a linguistics professor

Five approaches to authorship attribution • Physical evidence • eg carbon dating and handwriting analysis, as in case of Hitler Diaries. Not relevant to linguistics/stylistics • Historical evidence • eg did Marlowe or Shakespeare write Edward III? It was published 1596, 3 yrs after Marlowe’s death, but contains references to the defeat of the Armada (1588) • “knowledge intensive”, not feasible for computers

Authorship attribution • Cipher-based decryption • idea that authors deliberately encode their names in text • especially widespread in Bible studies, but also in Shakespeare-Bacon debate • Penn (1987) used computer analysis to show Bacon had written a lot of Shakespeare’s plays • easily debunked: see http://shakespeareauthorship.com/#5b: Ross showed that using the same techniques “proved” that bacon also wrote Spenser’s Faerie Queene, the Bible, Caesar’s Gallic Wars, Hiawatha, Moby Dick and The Federalist Papers (see later)

Authorship attribution • Manual analysis • Much used in forensic linguistics • Detailed analysis of unlimited linguistic traits • Not suitable for computational analysis, but we’ll look at some examples later • Computational stylometry

Computational stylometry • Computational stylometry • Involves counting things • So can only look at what is easily countable • Modern computational stylometry based in Machine Learning • SVMs, Genetic Algorithms, Neural Networks, Bayesian Classifiers… used extensively.

Stylometry • Assumes that the essence of the individual style of an author can be captured with reference to a number of quantitative criteria, called discriminators • Obviously, some (many) aspects of style are conscious and deliberate • as such they can be easily imitated and indeed often are • many famous pastiches, either humorous or as a sort of homage • Computational stylometry is focused on subconscious elements of style less easy to imitate or falsify

Stylometry is not foolproof • We should be aware of shortcomings • Discriminators are mostly lexical, though some recent work has looked also at syntactic discriminators • Authors’ styles change, either over time, or deliberately, eg when writing in different literary genres • Many techniques rely on large quantities of data • Most of the following techniques are better at dealing with closed questions • Who wrote this, A or B? • If A wrote these, did they also write this? • How likely is it that A wrote this? • but not Who wrote this?

Basic methodologies • Word or sentence length too obvious and easy to manipulate • Frequencies of letter pairs strangely successful, though limited • Distribution of words of a given length (in syllables), especially relative frequencies, ie length of gaps between words of same syllable length.

How does it work? Linguistic Features • Basic Measurements: • Average syllable/word/sentence count, letter distribution, punctuation. • Lexical Density • Unique_Words / Total_Words • Gunning-Fog Readability Index: • 0.4 * ( Average_Sentence_Length + 100 * Complex_Word_Ratio ) • Result: years of formal education required to read the text.

Vocabulary richness • Based on the idea that author’s vocabulary is more or less constant • Various measures • Type-token ratio • Simpson’s index (the chance that two word arbitrarily chosen from text will be the same) • Yule’s K (occurrence of a given word is a chance occurrence can be modelled as a Poisson distribution) • Entropy (measure of uniformity)

The Federalist Papers • 85 papers arguing for the adoption of the US constitution • written by three authors (Jay, Hamilton, Madison) • 5 authored by Jay • 51 authored by Hamilton • 14 authored by Madison • 3 jointly by Hamilton and Madison • authorship of 12 of them disputed (Hamilton or Madison?) • Mosteller and Wallace (1964) employed function words such as prepositions, conjunctions, and articles as discriminators. • e.g., the word upon averaged 3.24 appearances per 1,000 words in the known writings of Hamilton but only 0.23 in the writings of Madison • 30 “marker words” identified as discriminativeof the two contested authors: upon, whilst, there, on, while, vigor, by, consequently, would, voice

Bayesian probability • Bayes hypothesis reconciles prior hypotheses (in this case based on historical observation) with conditional probabilities based on measurements • If prior hypothesis (eg that there is a 1:3 chance that Madison wrote the paper) is confirmed by the measurements (eg of features associated with Madison’s style), the result will be neutral • If prior hypothesis is contradicted by the measurements, result will be much more striking

Cumulative sum charts • Method • Assume authorial “fingerprints” such as percentage of short words, or words beginning with a vowel • Put two texts together and plot the number of items per sentence against the cumulative average • If graph has a sharp divergence at the point where the texts are joined, this shows the authors differ • Highly controversial • Interpretation of graphs very subjective • But much used in courts! • Weighted cusum • Slightly sounder footing statistically – eliminates need for subjective judgment • Still not very accurate compared to other measures

Multivariate analysis • Thanks to computers it is now possible to collect large numbers of different measurements, of a variety of features • Variants of multivariate analysis • Cluster analysis • Correspondence analysis • Principal components analysis

Cluster analysis • Group objects according to their similarity with respect to a given feature • Produces a tree diagram or “dendogram”

Correspondence analysis • Example of superlatives in Dickens’ and Smollett’s works • Tabata 2007: http://www.digitalhumanities.org/dh2007/abstracts/xhtml.xq?id=259) • Count frequency of 242 superlatives in 30 texts • CA allows classification of associations between variables in a 2d matrix, rows x columns • D1 distinguishes Dickens from Smollett • D2

Principal components analysis • Like cluster analysis but can work with much larger range of variables • PCA is a statistical method for arranging large arrays of data into interpretable patterning match • “principal components” are computed by calculating the correlations between all the variables, then grouping them into sets that show the most correspondence • each “set” is a “component”, or “dimension”

Final word • Many of these techniques are also used to identify different genres rather than different authors • especially PCA, where the dimensions can be characterised • (In fact, cluster analysis and PCA illustrations were taken from such a study!) • An interesting question: how well do they work on pastiches? • If interested, see H Somers & F Tweedie “Authorship attribution and pastiche”, Computers and the Humanities 37 (2003), 407-429.

RIFERIMENTI D. Holmes “Authorship attribution” Computers and the Humanities 28 (1994), 87-106. D. Holmes “The Evolution of Stylometry in Humanities Scholarship” Literary and Linguistic Computing 13 (1998), 111-117. http://llc.oxfordjournals.org/cgi/reprint/13/3/111.pdf T. McEnery & M. Oates “Authorship identification and computational stylometry” in Dale et al (eds) Handbook of Natural Language Processing, New York (2000): Dekker, chapter 23.30

ELABORAZIONE DEL LINGUAGGIO NATURALE