Digital Text and Data Processing
This week’s focus in Digital Text and Data Processing dives into the computational analysis of literary texts through text mining. We explore various methodologies, including text analysis, digital literary studies, and algorithmic criticism. Key topics include vocabulary research, tokenization, frequency lists, stylometrics, and authorship attribution methods like Delta. The session also addresses challenges in text analysis, such as case sensitivity and polysemous words, and introduces regular expressions for pattern matching in text data.
Digital Text and Data Processing
E N D
Presentation Transcript
Text Mining Research • This class: focus is mostly on computational analysis of literary texts • Different names: • ‘Text analysis’ • Digital Literary Studies • Literary informatics (Martin Mueller) • Algorithmic Criticism (Stephen Ramsay) • Two approaches: research based on vocabulary and research based on data about the words
Studies based on vocabulary • Segmentation or tokenisation • Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation
Frequency lists • Tokens and types • Frequency lists • ‘Bag of words’ model: original word order is ignored the 2782 and 1646 to 1604 of 1293 a 1152 was 950 I 902 that 799 she 776 in 733 her 698 you 652 he 628 had 606 it 518 not 510 is 489
Stylometrics • Study of style on the basis of quantitative aspects • Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors David Hoover, Textual Analysis
Vocabulary Diversity • Type/token ratio • Normalisation for the number of words in a text Peter Garrard, Textual Pathology
Zipf’s law • A small numer of words have a high frequency, a large number of ‘hapax legomena’ (words that appear only once) • Function words and lexical words
Authorship attribution • Suggesting an author for texts whose authorship is disputed • One possible method: Delta (developed by John Burrows) John Burrows, Never Say Always Again: Reflections on the Numbers Game
Applications • Authorship attribution • Formal similarities and differences between genres, literary periods, authors • ‘Thematic summaries’ by creating lists of significant function words (e.g. inverse document frequency) • Allusions; intertextual references • Investigation of the structure of book, cf. Tanya Clement’s study of Gertude Stein’s The Making of America
Challenges • Case-insensitivity, e.g. ‘his’ or ‘His’ • Compound words and phrasal verbs, e.g. ‘carry out’, ‘look after’, ‘swimming pool’, ‘bus stop’ • Different spellings (diachronic and synchronic) • Polysemous words • ‘reductionst’ approach
Regular expressions • Text patterns • Simplest regular expression: Simple sequence of charactersExample: /sun/Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match:[…] the gate of the eastern sun, […] gloom beneath the noonday sun.
\b can be used in regular expressions to represent word boundaries • If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. /\bsun\b/i[…] Points to the unrisen sun! […][…] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]
Character classes . Any character, except the newline \w Any alphanumerical character: alphabetical characters, numbers and underscore \d Any digit \s White space: space, tab, newline [..] Any of the characters supplied within square brackets
Quantifiers {n,m} Pattern must occur a least n times, at most m times {n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,}
Examples /\d{4}/ Matches: 1234, 2013, 1066 /[a-zA-Z]+/ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t /b[aeiou]{1,2}t\w*/ Matches: bit, but, beat, boathouseNot: beauty, blister, boat-house
Anchors Do not match characters, but locations within strings. \b Word boundaries ^ Start of a line $ End of a line
Match variables • Parentheses create substrings within a regular expression • In perl, this substring is stored as variable $1 • Example: $keyword = “computer-aided” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “computer” }
Regular expressions can be combined with vertical bar (‘|’) /\bsun\b|\bstar\b|\bmoon\b/ • ‘special characters’ need to be escaped with the backslash (‘\’) /\?/ /\[/
Exercise Download “concordance.pl” and experiment with regular expressions
Recapitulation W1 • Variables begin with a dollar sign. Two types: strings and numbers • Statements end in a semi-colon • “Use strict” has the effect that all variables need to be declared on first use with the “my keyword” • “Use warnings” means that programmers will be warned when there errors, even when there are “non-fatal”
Operators • Concatenation of strings with the dot $string1 = “Hello” ; $string2 = “World” ; $string3 = $string . “ “ . $string2 ; • Mathematical operators: $sum = 5 + 1 ; $sum = 5++ ; $number = 2 ; $number += 3 ;
Three types of variables • Scalars: a single value; start with $ • Arrays: multiple values; start with @ • Hashes: Multple values which can be referenced with ‘keys’; start with %
$line = “If music be the food of love, play on” ; @array = split( “ “ , $line ) ; # $array[0] contains “If” # $array[4] contains “food”
my $freqList ; $freqList{“if”}++ ; $freqList{“music”}++ ; print $freqList{“if”} ;
Looping through an array Looping through an array foreach my $w ( @words ) { print $w ; } foreach my $w ( @words ) { print $w ; } Looping through a hash foreach my $w ( keys %freq ) { print $w . “\t” . $freq{$w} ; }