Digital Text and Data Processing

Digital Text and Data Processing Week 2

Text Mining Research • This class: focus is mostly on computational analysis of literary texts • Different names: • ‘Text analysis’ • Digital Literary Studies • Literary informatics (Martin Mueller) • Algorithmic Criticism (Stephen Ramsay) • Two approaches: research based on vocabulary and research based on data about the words

Studies based on vocabulary • Segmentation or tokenisation • Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation

Frequency lists • Tokens and types • Frequency lists • ‘Bag of words’ model: original word order is ignored the 2782 and 1646 to 1604 of 1293 a 1152 was 950 I 902 that 799 she 776 in 733 her 698 you 652 he 628 had 606 it 518 not 510 is 489

Stylometrics • Study of style on the basis of quantitative aspects • Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors David Hoover, Textual Analysis

Hugh Craig, Stylistic Analysis and Authorship Studies

Vocabulary Diversity • Type/token ratio • Normalisation for the number of words in a text Peter Garrard, Textual Pathology

Zipf’s law • A small numer of words have a high frequency, a large number of ‘hapax legomena’ (words that appear only once) • Function words and lexical words

Authorship attribution • Suggesting an author for texts whose authorship is disputed • One possible method: Delta (developed by John Burrows) John Burrows, Never Say Always Again: Reflections on the Numbers Game

Applications • Authorship attribution • Formal similarities and differences between genres, literary periods, authors • ‘Thematic summaries’ by creating lists of significant function words (e.g. inverse document frequency) • Allusions; intertextual references • Investigation of the structure of book, cf. Tanya Clement’s study of Gertude Stein’s The Making of America

Challenges • Case-insensitivity, e.g. ‘his’ or ‘His’ • Compound words and phrasal verbs, e.g. ‘carry out’, ‘look after’, ‘swimming pool’, ‘bus stop’ • Different spellings (diachronic and synchronic) • Polysemous words • ‘reductionst’ approach

Regular expressions • Text patterns • Simplest regular expression: Simple sequence of charactersExample: /sun/Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match:[…] the gate of the eastern sun, […] gloom beneath the noonday sun.

\b can be used in regular expressions to represent word boundaries • If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. /\bsun\b/i[…] Points to the unrisen sun! […][…] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]

Character classes . Any character, except the newline \w Any alphanumerical character: alphabetical characters, numbers and underscore \d Any digit \s White space: space, tab, newline [..] Any of the characters supplied within square brackets

Quantifiers {n,m} Pattern must occur a least n times, at most m times {n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,}

Examples /\d{4}/ Matches: 1234, 2013, 1066 /[a-zA-Z]+/ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t /b[aeiou]{1,2}t\w*/ Matches: bit, but, beat, boathouseNot: beauty, blister, boat-house

Anchors Do not match characters, but locations within strings. \b Word boundaries ^ Start of a line $ End of a line

Match variables • Parentheses create substrings within a regular expression • In perl, this substring is stored as variable $1 • Example: $keyword = “computer-aided” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “computer” }

Regular expressions can be combined with vertical bar (‘|’) /\bsun\b|\bstar\b|\bmoon\b/ • ‘special characters’ need to be escaped with the backslash (‘\’) /\?/ /\[/

Exercise Download “concordance.pl” and experiment with regular expressions

Recapitulation W1 • Variables begin with a dollar sign. Two types: strings and numbers • Statements end in a semi-colon • “Use strict” has the effect that all variables need to be declared on first use with the “my keyword” • “Use warnings” means that programmers will be warned when there errors, even when there are “non-fatal”

Operators • Concatenation of strings with the dot $string1 = “Hello” ; $string2 = “World” ; $string3 = $string . “ “ . $string2 ; • Mathematical operators: $sum = 5 + 1 ; $sum = 5++ ; $number = 2 ; $number += 3 ;

Three types of variables • Scalars: a single value; start with $ • Arrays: multiple values; start with @ • Hashes: Multple values which can be referenced with ‘keys’; start with %

$line = “If music be the food of love, play on” ; @array = split( “ “ , $line ) ; # $array[0] contains “If” # $array[4] contains “food”

my $freqList ; $freqList{“if”}++ ; $freqList{“music”}++ ; print $freqList{“if”} ;

Looping through an array Looping through an array foreach my $w ( @words ) { print $w ; } foreach my $w ( @words ) { print $w ; } Looping through a hash foreach my $w ( keys %freq ) { print $w . “\t” . $freq{$w} ; }

Digital Text and Data Processing

Digital Text and Data Processing

Presentation Transcript

Text Processing

Data-Intensive Text Processing with MapReduce

Digital Text

Strings and Text Processing

TEXT PROCESSING 1

Basic Text Processing

Text processing

TEXT PROCESSING UTILITIES

Text Processing

Text Processing

Tokenizing and Text Processing

Advanced Text Processing

Text processing

Text Pre-processing and Faster Query Processing

Text Processing

Strings and Text Processing

Employee Data Mining Based on Text and Image Processing

Text Pre-processing and Faster Query Processing

Text Processing

Digital Processing for EELS Data

Text processing

Text processing

Sea Ice

Sea Ice