180 likes | 283 Vues
Discover how to improve precision in data mining from surveys and questionnaires through techniques like clustering, pruning, and handling ordinal/Likert data. Learn from real datasets and explore future work areas.
E N D
Mining Rules from Surveys and Questionnaires Scott Burton and Richard Morris CS 676 Presentation 12 April 2011
Surveys and Questionnaires • Frequently Used • Problems for data mining • Rarity • Related and dependent questions • Ordinal / Likert scale
Association Rule Mining Market basket analysis Cookies -> Milk
Our Goal: Improve Precision Standard Algorithms/Approaches • Apriori, MS-Apriori • Too many rules • Rules are not “interesting” or actionable • Finding the needle in the haystack Our goal • Improve Precision • How do you measure “interestingness?”
Interestingness Measures • Mostly based on Support or Confidence • Considered about 40 different metrics • All seemed to favor the wrong types of rules
Our Datasets • Smoking habits of middle school students in Mexico • Global Youth Tobacco Survey for the Pan American Health Organization (GYTSPAHO) • ~65 Questions and 13,000 responses • HINTS (Health Information National Trends Survey) • hints.cancer.gov • 2007 response data had ~475 Questions and 8,000 responses • We focused on a subset of ~100 questions
Apriori vs. MS-Apriori Apriori (Figure 1) MS-Apriori (Figure 2)
Related and Dependent Questions True but worthless rules • Do you smoke=no -> Did you smoke last week=no Our approach • Cluster similar questions • Remove any intra-cluster rules 1 2 3 7 4 8 9 5 6
Creating Clusters • Distance Metrics • Bi-conditional prediction • Attribute vs. Attribute-Value pair • Involving the subject matter expert
A Sample Clustering of Questions (see handout)
Effects of Cluster Pruning MS-Apriori (Figure 2) After cluster pruning (Figure 3)
Similar Rules Abstract Viewpoint: • A B -> C D • A -> C D • A B -> C • A B Z -> C D
Effects of Similar Rule Pruning After cluster pruning (Figure 3) After Similar Rule Pruning (Figure 4)
Ordinal and Likert Data Two Approaches • Pre-process • Post-process Likert Ordinal
Effects of Pre-Binning (Figure 5)
Other Examples • HINTS Data (see handout, Figures 6-10)
Conclusions and Future Work Conclusions • Increased precision of “interesting” rules • More work to be done Future work • Tuning of existing processes • Handle numerical data • Handle questions not asked to everyone • Handle questions with multiple responses • Try other record matching techniques for similar rule pruning