1 / 23

Ranking Interesting Subgroups

Ranking Interesting Subgroups. Stefan Rüping Fraunhofer IAIS stefan.rueping@iais.fraunhofer.de. Motivation. name_score >= 1 & geoscore >= 1 & housing >= 5  p = 41.6% Income_score >= 5 & name_score >= 5 & housing >= 5  p = 36.0%

velvet
Télécharger la présentation

Ranking Interesting Subgroups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking Interesting Subgroups Stefan Rüping Fraunhofer IAIS stefan.rueping@iais.fraunhofer.de

  2. Motivation name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6% Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0% Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5  p = 43.8% Families == 0 &name_score >= 1 & housing == 0  p = 28.9% Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%

  3. Motivation name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6% Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0% Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5  p = 43.8% Families == 0 &name_score >= 1 & housing == 0  p = 28.9% Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%

  4. Motivation name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6% Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0% Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5  p = 43.8% Families == 0 &name_score >= 1 & housing == 0  p = 28.9% Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1% • Applying ranking to complex data: subgroup models • Optimization of data mining models for non-expert users

  5. Overview • Introduction to Subgroup Discovery • Interesting Patterns • Ranking Subgroups • Representation • Ranking SVMs • Iterative algorithm • Experiments • Conclusions

  6. Subgroup Discovery • Input • X defined by nominal attributes A1,…,Ad • Data • Subgroup language • Propositional formula Ai1 = vj1 Ai2 = vj2 … • For a subgroup S let • g(S) = #{ xi  S }/n, p(S) = #{ xi  S | yi = 1 }/g(S), p0 = |yi = 1|/n • q(S) = g(S)a (p(S)-p0) • Task • Find k subgroups with highest significance (maximal quality q) Subgroupsizeandclassprobability a = 0.5  t-testSubgroupquality = significanceofpattern

  7. Subgroup Discovery: Example

  8. Subgroup Discovery: Example S1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265

  9. Subgroup Discovery: Example S1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265 S2: Advertised = yes sales = high g(s) = 2/8 p(S) = 2/2 q(S) = (2/8)0.5 (2/2 – 5/8) = 0.187

  10. Subgroup Discovery: Example S1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265 S2: Advertised = yes sales = high g(s) = 2/8 p(S) = 2/2 q(S) = (2/8)0.5 (2/2 – 5/8) = 0.187 Significance≠ Interestingness

  11. Interesting Patterns What makes a pattern interesting to the user? Depends on prior knowledge, but heuristics exist • Attributes • Actionability • Acquaintedness • Sub-space • Novelty • Complexity • Not too complex • Not too simple ?

  12. Overview: Ranking Interesting Subgroups „S1 > S2“ Data SubgroupDiscovery Task Modification Ranking SVM SubgroupRepresentation

  13. Subgroup Representation (1/3) • Subgroups become examples of ranking learner! • Notation • Ai = original attribute • r(S) = representation of subgroup S • Remember: important properties of subgroups • Attributes • Examples • Complexity • Representing complexity • r(S) includes g(S) andp(S)-p0

  14. Subgroup Representation (2/3) Representing attributes • For each attribute Ai of the original examples include into subgroup representation attribute • Observation: TF/IDF-like representation performs even better

  15. Subgroup Representation (3/3) Representing examples • User may be more interested in subset of examples • Construct list of known relevant and irrelevant subgroups from user feedback • For each subgroup S and each known relevant/irrelevant subgroup T define relatedness of S to known subgroup T

  16. Ranking Optimization Problem • Rationale • Subgroup discovery gives quality q(S) = g(S)a (p(S)-p0) • User defines ranking by pairs „S1 > S2“ (S1 is better than S2) • Find true ranking q* such that S1 > S2 <=> q*(S1) > q*(S2) • Assumption (justfied by assuming hidden labels of interestingness of examples) • Define linear ranking function log q*(S) = (a,1,w) r(S)

  17. Ranking Optimization Problem (2/2) • Solution similar to ranking SVM • Optimization problem: • Equivalent problem: where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)

  18. Ranking Optimization Problem (2/2) Deviation fromparameter a0 in subgroupdiscovery • Solution similar to ranking SVM • Optimization problem: • Equivalent problem: where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)

  19. Ranking Optimization Problem (2/2) Deviation fromparameter a0 in subgroupdiscovery • Solution similar to ranking SVM • Optimization problem: • Equivalent problem: where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S) Constant weightfor g(S) definesmargin

  20. Iterative Procedure subgrouprankingsearch • Why? • Google: ~1012 web pages • Same number of possible subgroups on 12-dimensional data set with 9 distinct values per attribute •  cannot compute all subgroups for single-step ranking • Approach • Optimization problem gives new estimate of a • Transform weight of subgroups–features into weights for original examples • Idea: replace binary y with numeric value. Appropriate offset guarantees that subgroup-q is approximates optimized q*

  21. Experiments • Simulation on UCI data • Replace true label with most correlated attribute • Use true label to simulate user • Measure correspondence of algorithm‘s ranking with subgroups found on true label • Tests ability of approach to flexibly adapt to correlated patterns • Performance measure • Area under the curve – retrieval of true top 100 subgroups • Kendall‘s  - internal consistency of returned ranking

  22. Results • Wilcoxon signed rank test confirms significance • 3 Data sets with minimal AUC are exactly the ones with minimal correlation between true and proxy label!

  23. Conclusions • Example of ranking on complex, knowledge-rich data • Interestingness of subgroups patterns can be significantly increased with interactive ranking-based method • Step toward automating machine learning for end-users • Future work: • Validation with true users • Active learning approach

More Related