Text Mining Presidential Speeches – Brand Management
Timothy D’Auria BostonDecision.com May 14th, 2012 Disclaimer: Boston Decision believes the information contained herein to be accurate. However, Boston Decision, LLC makes no guarantees and no warranties, written, oral or implied, including without limitation any implied warranties of merchantability, fitness, or accuracy. Recipient assumes all responsibility for use of the information contained herein.
Boston Decision MINE – PREDICT – AUTOMATE Provide the skills, resources, expertise
How It All Started
Thought… Track how a candidate changes positions over time.
How to Do It
Sentiment Analysis -1 +1
Standard Deviation of Sentiment Average Sentiment = (+1 – 1) / 2 = 0 Square Difference of +1 from Mean = (+1 – 0)^2 = 1 Square Difference of -1 from Mean = (-1 – 0)^2 = 1 Standard Deviation = sqrt((1 + 1) / 1) = sqrt(2) = 1.41
The Flip-Flop Score FFScore = sum of sentiment standard deviations across pertinent issues.
The Problem Flip-flops are rare Flip-flops are rarely clean-cut
Finding Candidates are fairly consistent in message
Candidate Fingerprint Given a random speech where we are unsure who said it…
Prediction Can we predict who said it? Accuracy?
Why bother? Automatic brand consistency Plagiarism detection With a simple change to the text used… Predict effective campaign messaging Predict profitable content Identify comments indicating readiness to buy Optimal keyword selection
To Lowercase Before: After: corpus.tmp <- tm_map(corpus.tmp, tolower)
Remove English Stopwords Before: After: corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
Term Document Matrix
Sparse Terms Most terms will be used infrequently and won’t add value to the analysis. Remove. s.tdm <- removeSparseTerms(s.tdm, 0.7)
Word Cloud A visual tool to explore frequency of word usage wordcloud(term, freq)
Obama Word Cloud
Romney Word Cloud
Relationship between concepts
Romney Take leadership Free economy Business opportunity Obama takes away freedom
Obama Caring for people Time for change Hope
Romney vs. Obama Obama themes are broader Height of term on the dendrogram Romney themes are more business-oriented Obama more personal-oriented
Concept Framing How does each candidate frame a topic in terms of other topics? Daniel P. Parker, U. Penn
Term Associations Economy Energy Health Military findAssocs(tdm[][], 'economy', 0.50)
Create predictive model Input a speech Output a name
Term Document Matrix
Term Document Matrix
Hypothesis Candidates have unique linguistic patterns These patterns can serve as a fingerprint
Predict an unknown
K-Nearest Neighbor Algorithm Which past speech most closely matches with the speech we are trying to identify? ? Romney Obama Obama Romney Obama Romney
K-Nearest Neighbor Algorithm Closeness is measured by plotting each term by its frequency. ? Romney Obama Obama Romney Obama Romney
K-Nearest Neighbor Algorithm K-Nearest Neighbor is one of hundreds of possible modeling approaches Fast Simple Easy to conceptualize Accurate?
Some manipulation s.mat <- t(as.matrix(tdm[["tdm"]]))
“Hold-Out Sample – Testing”
Run Model knn(training data, test data, training answers) Runs in microseconds
Confusion Matrix ACTUAL PREDICTION
Accuracy ACTUAL PREDICTION Accuracy = sum of diagonal over n = 19 / 19 = 100%
Validation Resample new test cases and repeat model Average accuracy results Average accuracy = 95%
Score Algorithm Created program where you feed in a speech, and it will output the speaker. Accepts a file or URL scoreSpeech(new speech, knn.train.data)
Take Aways Data is all around us. Shift towards unstructured data (80%) Automation Any business, any industry, any data
End of May / Early JuneProfits from Data Mining Tim D’Auria firstname.lastname@example.org Boston Decision, LLCAutomate & Predict Business http://www.bostondecision.com Disclaimer: Boston Decision believes the information contained herein to be accurate. However, Boston Decision, LLC makes no guarantees and no warranties, written, oral or implied, including without limitation any implied warranties of merchantability, fitness, or accuracy. Recipient assumes all responsibility for use of the information contained herein.