Quantitative aspects of literary texts

Quantitative aspects of literary texts University of Massachusetts Dartmouth Sigma Xi Research Exhibition April 29th & 30th, 2008 Adam J. Callahan & Gary E. Davis Department of Mathematics http://physicsoftext.wordpress.com

What sort of curve is this? Type-token ratio • The distribution of word frequencies in text has been studied extensively, since at least the time of Zipf in 1936 until the present. • For a text the type-token ratio is , where types(n) is the number of word types in the first n words of the text. • The type-token ratio is just the running average of the number of new words in an initial text segment of length n. Typical decay of the type-token ratio with the number of words, n: This data is for the text: With the Turks in Palestine, by A. Aaronsohn.

Power laws • A log-log plot of versus – yields a good straight line fit (r2 = 0.964): This line might not look geometrically quite straight, but the correlation coefficient is quite high: r = 0.982 This gives an analytical expression for the type-token ratio: In the case of the Aarohnson text, A  3.150 and d  0.270 This is an approximate power law decay of the type-token ratio with the number of words.

Very slowly varying tails • A power law for the type-token ratio, , says that the product should be approximately constant, equal to A . • A plot of versus n shows that, typically, this is only true from some point on: The apparent downward slope from about 5000 words on is something of an illusion due to scale: the slope of the line is approximately 0.000042 – a slope of 1 in 24,000 How can we determine a “turnover” point n*, beyond which the type-token ratio is a genuine power law?

n* Regression coefficient analysis • We plot the r2 for a straight line fit to versus for n  n0, against n0: For the Aaronsohn text we see a local maximum for r2 of 0.9975 (r = 0.9988) at n* = 4293. The corresponding least squares value for the index d is 0.383 For n  n*, with r2  1 -- an almost perfect fit to a power law. For n  n*, the type-token ratio is better described as a decreasing logarithmic function of n.

Entropy • The ith word has a relative frequency of occurrence (i,n) in the first n words of a text. We regard (i,n) as the probability of occurrence of the ithword in the first n words of a text . For this probability distribution the Shannon entropy of the initial segment of text of length n is • This amounts to treating each initial text segment as a self-contained text, for statistical purposes, with the point of examining how the entropy changes as the text is enlarged to by the addition of a new word or a previously used word. We examined the variation of H(n) with n for a variety of literary texts. • When a new word is added to an existing text segment, the entropy necessarily increases. When a previously used word is added to an initial segment of text the entropy will generally rise if the word is used rarely, but fall if it is used often. • How does the entropy H(n) vary with n? Empirically we find that H(n) increases approximately logarithmically with n (r2 = 0.923 for the Aarohnson text): The average statistical “surprise” of the text, with the addition of a new or previously used word, rises approximately logarithmically with the length of the text.

The Voynich manuscript • The Voynich manuscript is MS 408 of the Beinecke library at Yale University. It is a still mysterious, undeciphered manuscript written using unusual symbolic forms, but apparently representing a text with linguistic structure [G. Landini, Evidence of linguistic structure in the Voynich manuscript using spectral analysis, Cryptologia, (2001)]. Using the Takahashi transcription of these symbolic forms we plotted the entropy H(n) of the first n words of the Voynich text as a function of n . As for all the other texts we examined, H(n) varies approximately logarithmically with n. However, there is a significantly large block of the Voynich text, about 5000 words from the first 12000 words of the text on - approximately 16% of the total text - for which the entropy decreases. This necessarily indicates a large degree of repetition of words that have been used significantly often in the text before this point. The Voynich text, is becoming significantly statistically less surprising between 12000 and 17000 words

The Voynich manuscript also shows unusual behavior when we plot ther2 for a straight line fit to versus for n  n0, against n0: The shaded area corresponds approximately to the region of decreasing entropy. These successive local maxima and local minima in the plot suggest a variety of different stages of usage of new word types throughout the manuscript. A similar, but less variable, situation holds for Darwin’s Origin of Species: The dip around 53,000 words is approximately where Darwin starts Chapter 6: “Difficulties on Theory”.

Distribution of frequencies Distribution of –log frequencies Distribution of log returns • The distribution of word frequencies is highly skewed, a fact well known even before Zipf quantified it in 1936. Borrowing an idea from finance we look at the distribution of the log returns: • where (i) is the frequency of the ithword in the entire text. • This distribution is typically highly symmetric, with mean close to 0, but with low kurtosis (broad shoulders), reminiscent of a modified raised cosine distribution rather than a normal distribution: Distribution of log returns References G. K. Zipf, Human Behaviour and the Principle of Least Effort, Addison-Wesley, 1949. G. Landini, Evidence of linguistic structure in the Voynich manuscript using spectral analysis, Cryptologia, 4 (2001). L. L. Goncalves, L. B. Goncalves, Fractal power laws in literary English. Physica A, 360 (2), 557-575 (2006). S. I. Resnick, Heavy-Tail Phenomena, Springer, 2007.

Quantitative aspects of literary texts