Mining Data Semantics (MDS'2011) Workshop

CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, joao.botelho@ist.utl.pt | Cláudia Antunes, claudia.antunes@ist.utl.pt

CONTENTS • Motivationandproblemstatement • S2C+SNA methodology • Case study • Conclusions

FRAUD DETECTION IN TAXES PAYMENTS • Fraudin Taxes Payments • Improper payments in taxes due to fraud, waste and abuse; • Involves millions of possible fraud targets; • Necessityof effective tools to prevent fraud or or just to identify it in time;

CHALLENGES ON FRAUD DETECTION

CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions

Metodologia da Solução S2C+SNA METHODOLOGY

WHY SEMI-SUPERVISED CLUSTERING?

WHY SOCIAL NETWORKS?

DATA PREPARATION> DATASET Thismethodology assumes theexistenceoftwodatasets: - Datasetwithlabeledandunlabeledinstances; - Social network Data (describing interactions between these instances);

DATA PREPARATION>SNOWBALL SAMPLING • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.

DATA PREPARATION>BAD RANK • DerivedfromPageRank e HITS • Usedby Google to detectweb SPAM • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.

DATA PREPARATION>BAD RANK (DEMO)

DATA PREPARATION>BAD RANK • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.

MODELING>SEMI-SUPERVISED CLUSTERING • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge. • Typically, this knowledge can be incorporated: • when the initial centroids are chosen (by seeding) • Seeded-Kmeans • Constrained-Kmeans • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). • PCK-Means • MPCK-Means

MODELING>SEMI-SUPERVISED CLUSTERING

CASE STUDY • Dataset: Fraudin Taxes Payments; • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. • 3000 instances; • 50% Fraud; 50% NonFraud;

EXPERIMENTS SETUP • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.

CLUSTERING RESULTS WITH AND WITHOUT BADRANK ATTRIBUTE

BEST AND WORST RESULTS WITHOUT BADRANK

BEST AND WORST RESULTS WITH BADRANK

CONCLUSIONS • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. • Semi-supervised clustering performs better when data is enriched with social network analysis. • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.

CONCLUSIONS • This methodology can also be applied to other areas: • where supervised information is very difficult to achieve • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data). • Churn detection is a good candidate to apply this methodology.

FIM QUESTIONS?

Mining Data Semantics (MDS'2011) Workshop