Text Mining Presidential Speeches – Brand Management

Text Mining Presidential Speeches – Brand Management
Timothy D’Auria BostonDecision.com May 14th, 2012 Disclaimer: Boston Decision believes the information contained herein to be accurate. However, Boston Decision, LLC makes no guarantees and no warranties, written, oral or implied, including without limitation any implied warranties of merchantability, fitness, or accuracy. Recipient assumes all responsibility for use of the information contained herein.

Boston Decision MINE – PREDICT – AUTOMATE Provide the skills, resources, expertise

How It All Started

Thought… Track how a candidate changes positions over time.

How to Do It

Sentiment Analysis -1 +1

Standard Deviation of Sentiment Average Sentiment = (+1 – 1) / 2 = 0 Square Difference of +1 from Mean = (+1 – 0)^2 = 1 Square Difference of -1 from Mean = (-1 – 0)^2 = 1 Standard Deviation = sqrt((1 + 1) / 1) = sqrt(2) = 1.41

Sentiment Analysis

The Flip-Flop Score FFScore = sum of sentiment standard deviations across pertinent issues.

The Problem Flip-flops are rare Flip-flops are rarely clean-cut

Finding Candidates are fairly consistent in message

Candidate Fingerprint Given a random speech where we are unsure who said it…

Prediction Can we predict who said it? Accuracy?

Why bother? Automatic brand consistency Plagiarism detection With a simple change to the text used… Predict effective campaign messaging Predict profitable content Identify comments indicating readiness to buy Optimal keyword selection

Speech Sources http://obamaspeeches.com/ http://mittromneycentral.com/speeches/

The Technology R – Free! tm wordcloud kernlab plyr class Snowball RStudio - Free

Stored the Speeches

Speeches

Define Speech Directories in R candidates <- c("romney", "obama") pathname <- "C:/Users/tdauria/Google Drive/meetups/05/speeches"

Create A Corpus A corpus is a container for documents

What is a document? A document contains text information

What is a document? A text file A paragraph A sentence Etc..

Speech Corpus 1 corpus per candidate Each document is a single speech s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))

Clean the Corpus What is the value of text? Upper versus. Lower Case The, A, An, This, That Telephone, phone, phones

Cleanup Function cleanCorpus <- function(corpus) { # Apply Text Mining Clean-up Functions corpus.tmp <- tm_map(corpus, convertPrettyApostrophe) corpus.tmp <- tm_map(corpus.tmp, removePunctuation) corpus.tmp <- tm_map(corpus.tmp, stripWhitespace) corpus.tmp <- tm_map(corpus.tmp, tolower) corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp <- tm_map(corpus.tmp, stemDocument, language = "english") return(corpus.tmp) }

Remove Punctuation Before: After: corpus.tmp <- tm_map(corpus.tmp, removePunctuation)

To Lowercase Before: After: corpus.tmp <- tm_map(corpus.tmp, tolower)

Remove English Stopwords Before: After: corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))

Term Document Matrix

Sparse Terms Most terms will be used infrequently and won’t add value to the analysis. Remove. s.tdm <- removeSparseTerms(s.tdm, 0.7)

Word Cloud A visual tool to explore frequency of word usage wordcloud(term, freq)

Obama Word Cloud

Romney Word Cloud

Relationship between concepts

Romney

Romney Take leadership Free economy Business opportunity Obama takes away freedom

Obama Caring for people Time for change Hope

Romney vs. Obama Obama themes are broader Height of term on the dendrogram Romney themes are more business-oriented Obama more personal-oriented

Concept Framing How does each candidate frame a topic in terms of other topics? Daniel P. Parker, U. Penn

Term Associations Economy Energy Health Military findAssocs(tdm[[1]][[2]], 'economy', 0.50)

Economy

Energy

Health

Military

Create predictive model Input a speech Output a name

Term Document Matrix

Hypothesis Candidates have unique linguistic patterns These patterns can serve as a fingerprint

Predict an unknown

K-Nearest Neighbor Algorithm Which past speech most closely matches with the speech we are trying to identify? ? Romney Obama Obama Romney Obama Romney

K-Nearest Neighbor Algorithm Closeness is measured by plotting each term by its frequency. ? Romney Obama Obama Romney Obama Romney

K-Nearest Neighbor Algorithm K-Nearest Neighbor is one of hundreds of possible modeling approaches Fast Simple Easy to conceptualize Accurate?

Some manipulation s.mat <- t(as.matrix(tdm[["tdm"]]))

“Hold-Out Sample – Testing”

Run Model knn(training data, test data, training answers) Runs in microseconds

The Results

Confusion Matrix ACTUAL PREDICTION

Accuracy ACTUAL PREDICTION Accuracy = sum of diagonal over n = 19 / 19 = 100%

Validation Resample new test cases and repeat model Average accuracy results Average accuracy = 95%

Score Algorithm Created program where you feed in a speech, and it will output the speaker. Accepts a file or URL scoreSpeech(new speech, knn.train.data)

Take Aways Data is all around us. Shift towards unstructured data (80%) Automation Any business, any industry, any data

Next Event
End of May / Early JuneProfits from Data Mining Tim D’Auria tdauria@bostondecision.com Boston Decision, LLCAutomate & Predict Business http://www.bostondecision.com Disclaimer: Boston Decision believes the information contained herein to be accurate. However, Boston Decision, LLC makes no guarantees and no warranties, written, oral or implied, including without limitation any implied warranties of merchantability, fitness, or accuracy. Recipient assumes all responsibility for use of the information contained herein.

Text Mining Presidential Speeches – Brand Management