1 / 30

Data Mining at ATO 2004 Canberra

Data Mining at ATO 2004 Canberra. Warwick Graco Analytics Project Change Program ATO. Outline. Some key Themes ATO at a glance. Data Mining in Government. White-Collar Crime – Dollar Figures Quoted are in the Stratosphere Sophisticated Frauds and Internal Fraud Slowness of Regulators.

evelynl
Télécharger la présentation

Data Mining at ATO 2004 Canberra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining at ATO2004 Canberra Warwick Graco Analytics Project Change Program ATO

  2. Outline • Some key Themes • ATO at a glance

  3. Data Mining in Government • White-Collar Crime – Dollar Figures Quoted are in the Stratosphere • Sophisticated Frauds and Internal Fraud • Slowness of Regulators

  4. Roles of Data Miner • Role of Analytics • Cost-effectiveness versus Precision with Detection • Medical versus Engineering Model • Inoculate your system against Security and Integrity Attacks • Use of agents for this purpose • Each agent designed to detect a specific breach

  5. Some New Points of View • Fraud found at the edge or boundary of pockets of activity rather than being outliers Outliers Boundary Cases

  6. Some New Points of View • False Negatives • Perturbing Classifications • Determine effects of different proportions of perturbed classifications have on hit rates and by inference miss rates • This is potentially a method for estimating the incidence of fraud and abuse in society eg • Size of Black Economy • Amount lost to Health Fraud and Abuse • Amount lost to Social Security Fraud and Abuse

  7. Taylor-Russell Table Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier

  8. What tools in stock • Diverse Applications of Data Mining including • Hot Spots Methodology • Tree Stumps • Control Charts • Detection of Outliers • Hardware and Software Developments with Data Mining

  9. Keynote Speakers • Professor Han • Covered Classical Mining and Modelling Methods • Covered Some New Developments eg web mining, stream mining and bioinformatics mining • One example he used was how data mining can be used in software engineering to debug programs

  10. Keynote Speakers • Usama Fayyad • Covered the Lessons Learned from Applying Data Mining in practice • Emphasised the importance of consulting experts and incorporating domain knowledge in the mining process • Discoveries from mining have to be related to expert’s understanding and interpretation of issues • Have to both mine data and model processes to obtain ideal results eg reselling

  11. Keynote Speakers • Usama Fayyad • Emphasised the importance of presenting results in a way that managers understand and relate to eg lifetime value of employees versus churn rates • Covered many of the technical challenges facing the field eg complexity, scalability, validation and the need for firm theoretical foundation

  12. Outline • Efficacy of Models • Recently Developed Models • Local versus Global Models • Features • Identifying Discriminatory Features • Sharing the Magic Few Features • Estimating Miss Rates and Showing Cost/Benefits

  13. Outline • Analytics –achieving synergies • Regulatory Work – need for proactive approach • Mapping the Detection Process • Static versus Dynamic Aspects • Capturing Expertise • Embedding Knowledge

  14. Efficacy of Models • Symbolic • Random Forests • Tree Stumps • MART • Statistical • MARS • Weighted KNN • Biological • Boosted ANNs • GA/ANN

  15. Efficacy of Models • There is hype about the versatility and effectiveness of many of these models • We need to clarify scientifically how well they perform and in what circumstances • They need to be tested in a variety of domains

  16. Local versus Global Models • Comparisons between those which are narrow and specific and those which are broad and general in focus • The former are important with transaction fraud and abuse and the later with client profiling • We need to establish how each contributes to detection and how they can work together in tandem • The need to test a medical approach to model development compared to engineered approaches – does the former afford greater protection than the latter against security and integrity attacks

  17. Discriminatory Features • Feature Selection versus Classification Trade Off with Identifying features • If you have the luxury of many classified cases that represent the important trends in the data, you can use a supervised approach to identify discriminatory features • Examples include filter and wrapper approaches

  18. Discriminatory Features • If you do have this luxury, one option is to use an unsupervised approach to identify discriminatory features • Examples include taxonomic and clustering methods and anomaly detection • We need to do comparisons to see which methods work best and in what circumstances

  19. Magic Few Features • A small number of highly Discriminatory Features account for most high-risk cases • An example is SPP across age and gender groups with GPs • The remainder provide small contributions • Discriminating Features tend to be locked in vaults and not shared across regulatory agencies

  20. Miss Rates with Detection • Perturbation versus Tasselation • Sensitivity versus Specificity Trade Off • A major challenge is establish to what degree discovery and detection technology increases the strike rate above the industry benchmark of 1:10 with identification of security and integrity breaches

  21. Taylor-Russell Table Aberrant Cases True Positives False Negatives Baseline Separating Aberrant from Acceptable Cases False Positives True Negatives Acceptable Cases Cutoff used by Classifier

  22. Cost/Benefits • We also need studies to inform us of cost benefits of using discovery and detection technology • That is what cost benefit ratio can we expect from using this technology compared to using conventional compliance measures such as telephone tip offs, random audits, purpose-based audits etc

  23. Cost-Benefit Approach Optimal Return Benefits Costs

  24. Analytics • Analytics is a new field and embraces a variety of disciplines including intelligence, risk analysis, profiling, data matching, discovery (mining) and detection (modelling) work • Major challenge is achieving integration across these disciplines so that they work together and achieve synergies

  25. Analytics • Intelligence should drive compliance activities in terms of where those who do risk analysis, profiling, matching and discovery and detection focus their attention • A major failing with many organisations is that work done is not based on the results of sound intelligence • The decision making is often ad hoc and arbitrary

  26. Regulatory Work • Telephone and Banking work real time and have to detect security and integrity breaches as they occur • Payment, Insurance and Revenue Collection Agencies work past time and seek restitution after the event • Question how do we get onto the front foot become proactive rather than reactive with discovery, detection, treatment and prevention

  27. Static versus Dynamic View • KDD historically has focused on retrospective and prospective score models • These give a static view of compliance issues – ie a picture of practice at a particular point in time • Fraud and abuse are usually not static but dynamic. They tend to change and sometimes change quickly

  28. Static versus Dynamic View • To identify and track these issues effectively requires an investigative approach • Our current models and approaches are not well suited to track changes and to continuously adapt to new developments • One illustration of what is implied here is work with and model the steps, procedures and routines that experts use to solve cases

  29. Role of Expertise • Need to develop procedures/methods for capturing the knowledge, skills and strategies experts employ yo identify non compliance or the smell factor with cases and to incorporate these as routines and models in our discovery and detection systems • Examples include the expertise used for • Crime Identification • Feature Selection • Classifications of Cases

  30. Embedded Knowledge • The expertise captured can be included as metaknowledge and be linked to security and integrity breaches, features and cases • Knowledge needs to be embedded in discovery and detection processes

More Related