Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews

THE ANDREW W. MELLON FOUNDATION Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews Xiao Hu, J. Stephen Downie, M. Cameron Jones The International Music Information Retrieval Systems Evaluation Lab (IMIRSEL) University of Illinois at Urbana-Champaign

Agenda • Motivation • Customer reviews in epinions.com • Experimental Setup • Data set • Results • Conclusions & Future Work

Motivation • Critical consumer-generated reviews of humanities materials • a rich resource of reviewers’ opinions, and background / contextual information • self-organized: pave ways to automatic processing • Text mining: mature and ready to use • Criticism mining: provides a tool to assist humanities scholars • Locating • Organizing • Analyzing critical review content

Customer Reviews • Published on www.epinions.com • Focused on the book, movie and music • Each review associated with: • a genre label • a numerical quality rating

numerical rating associated used in our experiments

28 Major Genre Categories Jazz, Rock, Country, Classical, Blues, Gospel, Punk, .… Renaissance, Medieval, Baroque, Romantic, … Music Genres

Experimental Setup • to build and evaluate a prototype criticism mining system that could automatically : • predict thegenre of the work being reviewed • predict thequality rating assigned to the reviewed item • differentiate book reviews and movie reviews, especially for items in the same genre • differentiate fiction and non-fiction book reviews

Data set

Genre Taxonomy

Genre Taxonomy : Book

Genre Taxonomy : Music • The genre labels and the rating information provided the ground truth for experiments

Data Preprocessing • HTML tags were stripped out; • Stop words were NOT stripped out; • Punctuation was NOT stripped out; • They may contain stylistic information • Tokens were stemmed

Categorization Model & Implementation • Naïve Bayesian (NB) Classifier • Computationally efficient • Empirically effective • Text-to-Knowledge (T2K) Toolkit • A text mining framework • Ready-to-use modules and itineraries • Natural Language Processing tools integrated • Supporting fast prototyping of text mining

NB itinerary in T2K Data Preprocessing NB Classifier

Results & Discussions

Genre Classification 5 fold random cross validation for book and movie reviews 3 fold random cross validation for music reviews

Confusion : Book Reviews

Confusion : Movie

Confusion : Music

Rating Classification • Five-class classification • 1 star vs. 2 stars vs. 3 stars vs. 4 stars vs 5 stars • Binary Group classification • 1 star + 2 stars vs. 4 stars + 5 stars • ad extremis classification • 1 star vs. 5 stars 5 fold random cross validation for all experiments

Rating : Book Reviews

Rating : Movie Reviews

Rating : Music Reviews

Confusion : Book Reviews

Confusion : Movie Reviews

Confusion : Music Reviews

Classification of Book and Movie Reviews 1 • Reviews on all available genres • Books : 9 genres; Movies : 11 genres • Reviews on individual, comparable genres

Classification of Book and Movie Reviews 2 • Eliminated words that can directly suggest the categories: • "book", "movie", "fiction", "film", "novel", "actor", "actress", "read", "watch", "scene" • Frequently occurred in each category, but not both • To make things harder / avoid oversimplifying • Results suggest stylistic difference in users’ criticisms on books and movies 5 fold random cross validation for all experiments

Book vs. Movie Reviews 1

Classification of Fiction and Non-fiction Book Reviews 1

Classification of Fiction and Non-fiction Book Reviews 2 • Eliminated words that can directly suggest the categories: • "fiction", "non", "novel", "character", "plot", and "story" • Frequently occurred in each category, but not both • To make things harder / avoid oversimplifying • Results suggest stylistic difference in users’ criticisms on fiction books and non-fiction ones 5 fold random cross validation for all experiments

Fiction vs. Non-fiction Book Reviews

Confusion : Fiction vs. Non-fiction Book Reviews

Conclusions • Customer reviews are an excellent resource for studying humanities materials • Successful experiments: • High classification precisions: Genres; Ratings; Book vs. movie reviews Fiction vs. non-fiction book reviews • Reasonable confusions • Text mining techniques can help find important information about the materials being reviewed Criticism Mining : make the ever-growing consumer-generated review resources useful to humanities scholars.

Future work • More text mining techniques • decision trees, frequent pattern mining • Other critical text • blogs, wikis, etc • Other facets of reviews • “usage” in music reviews • Feature studies • answer the “why” questions

References • Argamon, S., and Levitan, S. (2005). Measuring the Usefulness of Function Words for Authorship Attribution. Proceedings of the 17th Joined International Conference of ACH/ALLC. • Downie, J. S., Unsworth, J., Yu, B., Tcheng, D., Rockwell, G., and Ramsay, S. J. (2005). A Revolutionary Approach to Humanities Computing?: Tools Development and the D2K Data-Mining Framework. Proceedings of the 17th Joined International Conference of ACH/ALLC. • Hu, X., Downie, J. S., West, K., and Ehmann, A. (2005). Mining Music Reviews: Promising Preliminary Results. Proceedings of the Sixth International Conference on Music Information Retrieval (ISMIR). • Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1. • Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text Genre Detection Using Common Word Frequencies. Proceedings of 18th International Conference on Computational Linguistics.

THE ANDREW W. MELLON FOUNDATION Questions? IMIRSEL Thank you!

Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews