1 / 28

Review of Fraud Classification Using Principal Components Analysis of RIDITS

Review of Fraud Classification Using Principal Components Analysis of RIDITS. By Louise A. Francis Francis Analytics and Actuarial Data Mining, Inc. Objectives. Address question: Why use new method, PRIDIT? Introduce other methods used in similar circumstances

Télécharger la présentation

Review of Fraud Classification Using Principal Components Analysis of RIDITS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review ofFraud Classification Using Principal Components Analysis of RIDITS By Louise A. Francis Francis Analytics and Actuarial Data Mining, Inc.

  2. Objectives • Address question: Why use new method, PRIDIT? • Introduce other methods used in similar circumstances • Explain how PRIDIT adds to methods available • Explain limitations of PRIDIT/RIDIT

  3. A Key Problem in Fraud Modeling • Most data mining methods need a target (dependent) variable • Y = a + b1x1 + b2x2 + … bnxn • Fraud (Yes/No or Fraud Score) = f(predictor variables) • Need sample of data where claims have been determined to be fraudulent or legitimate

  4. Dependent variable hard to get • In a large sample of automobile insurance claims perhaps 1/3 may have an element of abuse or fraud • Scarce resources are not expensed on such large volumes of claims to determine their legitimacy • Only a small percentage referred to SIU investigators or other investigations • There are time lags in determining the outcome of investigations

  5. Unsupervised learning • Another approach that does not require a dependent variable • Two Key Kinds • Cluster Analysis • Principal Components/Factor Analysis • Pridit uses this approach • It is applied to ordered categorical variables

  6. Cluster Analysis • Records are grouped in categories that have similar values on the variables • Examples • Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing • Text analysis: Use words that tend to occur together to classify documents • Note: no dependent variable used in analysis

  7. Clustering • Common Method: k-means, hierarchical • No dependent variable – records are grouped into classes with similar values on the variable • Start with a measure of similarity or dissimilarity • Maximize dissimilarity between members of different clusters

  8. Dissimilarity (Distance) Measure – Continuous Variables • Euclidian Distance • Manhattan Distance

  9. Binary Variables

  10. Binary Variables • Sample Matching • Rogers and Tanimoto

  11. Example: Fraud Data • Data from 1993 closed claim study conducted by Automobile Insurers Bureau of Massachusetts • Claim files often have variables which may be useful in assessing suspicion of fraud, but a dependent variable is often not available • Variables used for clustering: • Legal representation • Prior Claim • SIU Investigation • At fault • Police report • Number of providers

  12. Statistics for Clusters • Based on descriptive statistics, Cluster 2 appears to have higher likelihood of fraudulent claims – more about this later

  13. Principal Components Analysis • A form of dimension (variable) reduction • Suppose we want to combine all the information related to the “financial” dimension of fraud • Medical provider bill (indicative of padding claim) • Hospital bill • Number of providers • Economic Losses • Claimed wages • Incurred Losses

  14. Principal Components • These variables are correlated but not perfectly correlated • We replace many variables with a weighted sum of the variables

  15. Correlation Matrix for Variables

  16. Finding Factor or Component • The correlation matrix is used to find the factor that explains the most variance (captures most of the correlation) for the set of variables • That component or factor extracted will be a weighted average of the variables • More than one Component or Factor may result from applying the method

  17. Evaluating Importance of Variables • Use factor loadings

  18. Problem: Categorical Variables • It is not clear how to best perform Principal Components/Factor Analysis on categorical variables • The categories may be coded as a series of binary dummy variables • If the categories are ordered categories, you may loose important information • This is the problem that PRIDIT addresses

  19. RIDIT • Variables are ordered so that lowest value is associated with highest probability of fraud • Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i

  20. Example: RIDIT for Legal Representation

  21. PRIDIT • Use RIDIT statistics in Principal Components Analysis

  22. Scoring • Assign a score to each claim • The score can be used to sort claims • More effort expended on claims more likely to be fraudulent or abusive • In the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT score • A suspicion score was assigned to each claim by an expert

  23. PRIDIT vs. Suspicion Score

  24. Clustering and Suspicion Score

  25. Result • There appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusive • The clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims

  26. Comparison: PRIDIT and Clustering • PRIDIT gives a score, which may be very useful for claims sorting. Clustering assigns claims to classes. They are either in or out of the assigned class. • Clustering ignores information about the order of values for categorical variables • Clustering can accommodate both categorical and continuous variables

  27. Comparison • Unordered categorical variables with many values (i.e., injury type): • Clustering has a procedure for measuring dissimilarity for these variables and can use them in clustering • If the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.

  28. Review ofFraud Classification Using Principal Components Analysis of RIDITS By Louise A. Francis Francis Analytics and Actuarial Data Mining, Inc.

More Related