1 / 26

Mining Data Semantics (MDS'2011) Workshop

Combining Semi-Supervised Clustering with Social Network Analysis : A Case Study on Fraud Detection. Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, joao.botelho@ist.utl.pt |.

december
Télécharger la présentation

Mining Data Semantics (MDS'2011) Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, joao.botelho@ist.utl.pt | Cláudia Antunes, claudia.antunes@ist.utl.pt

  2. CONTENTS • Motivationandproblemstatement • S2C+SNA methodology • Case study • Conclusions

  3. CONTENTS • Motivationandproblemstatement • S2C+SNA methodology • Case study • Conclusions

  4. FRAUD DETECTION IN TAXES PAYMENTS • Fraudin Taxes Payments • Improper payments in taxes due to fraud, waste and abuse; • Involves millions of possible fraud targets; • Necessityof effective tools to prevent fraud or or just to identify it in time;

  5. CHALLENGES ON FRAUD DETECTION

  6. CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions

  7. Metodologia da Solução S2C+SNA METHODOLOGY

  8. WHY SEMI-SUPERVISED CLUSTERING?

  9. WHY SOCIAL NETWORKS?

  10. DATA PREPARATION> DATASET Thismethodology assumes theexistenceoftwodatasets: - Datasetwithlabeledandunlabeledinstances; - Social network Data (describing interactions between these instances);

  11. DATA PREPARATION>SNOWBALL SAMPLING • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.

  12. DATA PREPARATION>BAD RANK • DerivedfromPageRank e HITS • Usedby Google to detectweb SPAM • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.

  13. DATA PREPARATION>BAD RANK (DEMO)

  14. DATA PREPARATION>BAD RANK • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.

  15. MODELING>SEMI-SUPERVISED CLUSTERING • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge. • Typically, this knowledge can be incorporated: • when the initial centroids are chosen (by seeding) • Seeded-Kmeans • Constrained-Kmeans • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). • PCK-Means • MPCK-Means

  16. MODELING>SEMI-SUPERVISED CLUSTERING

  17. CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions

  18. CASE STUDY • Dataset: Fraudin Taxes Payments; • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. • 3000 instances; • 50% Fraud; 50% NonFraud;

  19. EXPERIMENTS SETUP • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.

  20. CLUSTERING RESULTS WITH AND WITHOUT BADRANK ATTRIBUTE

  21. BEST AND WORST RESULTS WITHOUT BADRANK

  22. BEST AND WORST RESULTS WITH BADRANK

  23. CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions

  24. CONCLUSIONS • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. • Semi-supervised clustering performs better when data is enriched with social network analysis. • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.

  25. CONCLUSIONS • This methodology can also be applied to other areas: • where supervised information is very difficult to achieve • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data). • Churn detection is a good candidate to apply this methodology.

  26. FIM QUESTIONS?

More Related