1 / 64

Data mining in Health Insurance

Data mining in Health Insurance. Introduction. Rob Konijn, rob.konijn@achmea.nl VU University Amsterdam Leiden Institute of Advanced Computer Science (LIACS) Achmea Health Insurance Currently working here Delivering leads for other departments to follow up Fraud, abuse

hamlet
Télécharger la présentation

Data mining in Health Insurance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data mining in Health Insurance

  2. Introduction • Rob Konijn, rob.konijn@achmea.nl • VU University Amsterdam • Leiden Institute of Advanced Computer Science (LIACS) • Achmea Health Insurance • Currently working here • Delivering leads for other departments to follow up • Fraud, abuse • Research topic keywords: data mining/ unsupervised learning / fraud detection

  3. Outline • Intro Application • Health Insurance • Fraud detection • Part 1: Subgroup discovery • Part 2: Anomaly detection (slides partly by Z. Slavik, VU)

  4. Intro Application • Health Insurance Data • Health Insurance in NL • Obligatory • Only private insurance companies • About 100 euro/month(everyone)+170 euro (income) • Premium increase of 5-12% each year Achmea: about 6 million customers

  5. Funding of Health Insurance Costs in the Netherlands vereveningsfonds vereveningsfonds vereveningsfonds vereveningsfonds vereveningsfonds vereveningsfonds vereveningsfonds vereveningsfonds rijksbijdrage verzekerden 18- 2 mld vereveningsbijdrage inkomensafh. bijdrage werkgevers 17 mld 18 mld zorgverzekeraar verzekerde zorgverzekeraar nominale premie 18+: - rekenpremie (~€ 947/vrz): 12 mld - opslag (~€ 150/vrz) : 2 mld 30 mld zorguitgaven

  6. Verevenings-model Mannen Vrouwen 0 - 4 jr 1,400 1,210 • By population characteristics • Age • Gender • Income, social class • Type of work • Calculation afterwards • High costs compensation (>15.000 euro) 5 - 9 jr 1,026 936 10 - 14 jr 907 918 15 - 17 jr 964 1,062 18 - 24 jr 892 1,214 25 - 29 jr 870 1,768 30 - 34 jr 905 1,876 35 - 39 jr 980 1,476 40 - 44 jr 1,044 1,232 45 - 49 jr 1,183 1,366 50 - 54 jr 1,354 1,532 55 - 59 jr 1,639 1,713 60 - 64 jr 1,885 1,905 65 - 69 jr 2,394 2,201 70 - 74 jr 2,826 2,560 75 - 79 jr 3,244 2,886 80 - 84 jr 3,349 3,018 85 - 89 jr 3,424 3,034 90 jr e.o. 3,464 3,014

  7. Fraude in de zorg

  8. Introduction Application:The Data • Transactional data • Records of an event • Visit to a medical practitioner • Charged directly by medical practioner • Patient is not involved • Risk of fraud

  9. Transactional Data • Transactions: Facts • Achmea: About 200 mln transactions per year • Info of customers and practitioners: dimensions

  10. Different levels of hierarchy • Records represent events • However, for example for fraud detection, we are interested in customers, or medical practitoners • See examples next pages • Groups of records: Subgroup Discovery • Individual patients/practioners: outlier detection

  11. Different types of fraud hierarchy • On a patient level, or on a hospital level:

  12. Handling different hierarchy • Creating profiles from transactional data • Aggregating costs over a time period • Each record: patient • Each attribute i =1 to n: cost spent on treatment i • Feature construction, for example • The ratio of long/short consults (G.P.) • The ratio of 3-way and 2 way fillings (Dentist) • Usually used for one-way analysis

  13. Different types of fraud detection • Supervised • A labeled fraud set • A labeled non-fraud set • Credit cards, debit cards • Unsupervised • No labels • Health Insurance, Cargo, telecom, tax etc.

  14. Unsupervised learning in Health Insurance Data • Anomaly Detection (outlier detection) • Finding individual deviating points • Subgroup Discovery • Finding (descriptions of) deviating groups • Focus on differences and uncommon behavior • In contrast to other unsupervised learning methods • Clustering • Frequent Pattern mining

  15. Subgroup Discovery • Goal: Find differences in claim behavior of medical practitioners • To detect inefficient claim behavior • Actions: • A visit from the account manager • To include in contract negotiations • In the extreme case: fraud • Investigation by the fraud detection department • By describing deviations of a practitioner from its peers • Subgroups

  16. Patient-level, Subgroup Discovery • Subgroup (orange): group of patients • Target (red) • Indicates whether a patient visited a practitioner (1), or not (0)

  17. Subgroup Discovery: Quality Measures • Target Dentist: 1672 patiënten • Compare with peer group, 100.000 patients in total • Subgroup V11 > 42 euro : 10347 patients • V11: one sided filling • Crosstable

  18. The cross table • Cross table in data • Cross table expected: • Assuming independence

  19. Calculating Wracc and Lift • Size subgroup = P(S) = 0.10347, size target dentist = P(T) = 0.01672 • Weighted Relative ACCuracy (WRAcc) = P(ST) – P(S)P(T) = (871 – 173)/100000 = 689/100000 • Lift = P(ST)/P(S)P(T) = 871/173 = 5.03

  20. Example dentistry, at depth 1, one target dentist

  21. ROC analysis, target dentist

  22. Making SD more useful: adding prior knowledge • Adding prior knowledge • Background variables patient (age, gender, etc.) • Specialism practitioner • For dentistry: choice of insurance • Adding already known differences • Already detected by domain experts themselves • Already detected during a previous data mining run

  23. Prior Knowledge, Motivation

  24. Example, influence of prior knowledge

  25. The idea: create an expected cross table using prior knowledge

  26. Quality Measures • Ratio (Lift) • Difference (WRAcc) • Squared sum (Chi-square statistic)

  27. Example, iterative approach • Idea: add subgroup to prior knowledge iteratively • Target = single pharmacy • Patients that visited the hospital in last 3 years removed from data • Compare with peer group (400,000 patients), 2929 patiënts of target pharmacy • Top subgroup : “B03XA01 (Erythropoietin)>0 euro” 1 ‘target’ pharmacy rest subgroup B03XA01 > 0 rest

  28. Next iteration • Add “B03XA01 (EPO) >0 euro” to prior knowledge • Next best subgroup: “N05AX08 (Risperdal)>= 500 euro”

  29. Figure describing subgroup:N05AX08 > 500 Left: target pharmacy, right: other pharmacies

  30. Addition: adding costs to quality measure • M55: dental cleaning • V11: 1-way filling • V21: polishing • Cost of treatments in subgroup 370 euro (average) • 791 more patients than expected • Total quality 791*370 = 292,469 euro

  31. Iterative approach, top 3 subgroups • V12: 2-sided filling • V21: polishing • V60: indirect pulpa covering • V21 and V60 are not allowed on the same day • Claim back (from all dentists): 1.3 million euro

  32. 3d isometrics, cost based QM

  33. Other target types: double binary target • Target 1: year: 2009 or 2008 • Target 2: target practitioner • Pattern: • M59: extensive (expensive) dental cleaning • C12: second consult in one year • Crosstable:

  34. Other target types: Multiclass target • Subgroup (orange): group of patients • Target (red), now is a multi-value column, one value per dentist

  35. Multiclass target, in ROC Space

  36. Anemaly Detection The exampleabovecontains a contextualanomaly...

  37. Outline Anomaly Detection • Anomalies • Definition • Types • Technique categories • Examples • Lecture based on • Chandola et al. (2009). Anomaly Detection: A Survey • Paper in BB 38

  38. Definition • “Anomaly detection refers to the problem of finding patternsin data that do not conform to expected behavior” • Anomalies, aka. • Outliers • Discordant observations • Exceptions • Aberrations • Surprises • Peculiarities • Contaminants

  39. Anomaly types Point anomalies • A data point is anomalous with respect to the rest of the data

  40. Not covered today • Other types of anomalies: • Collective anomalies • Contextual anomalies • Other detection approaches: • Supervised learning • Semi supervised • Assume training data is from normal class • Use to detect anomalies in the future

  41. We focus on outlier scores • Scores • You get a ranked list of anomalies • “We investigate the top 10” • “An anomaly has a score of at least 134” • Leads followed by fraud investigators • Labels ANOMALY

  42. Detectionmethodcategorisation • Model based • Depth based • Distance Based • Information theory related (not covered) • Spectral theory related (not covered)

  43. Model based • Build a (statistical) model of the data • Data instances occur in high probability regions of a stochastic model, while anomalies occur in low probability regions • Or: data instances have a high distance to the model are outliers • Or: data instances have a high influence on the model are outliers

  44. Example: one way outlier detection • Pharmacy records • Records represent patients • One attribute at a time: • This example: attribute describing the costs spent on fertility medication (gonodatropin) in a year • We could use such one way detection for each attribute in the data

  45. Example, model = parametric probability density function

  46. Example, model = non-parametric distribution • Left: kernel density estimate • Right: boxplot

  47. Example: regression model

  48. Other models possible • Probabilistic • Bayesian networks • Regression models • Regression trees/ random forests • Neural networks • Outlier score = prediction error (residual)

  49. Depth based methods • Applied on 1-4 dimensional datasets • Or 1-4 attributes at a time • Objects that have a high distance to the “center of the data” are considered outliers • Example Pharmacy: • Records represent patients • 2 attributes: • Costs spent on diabetes medication • Costs spent on diabetes testing material

More Related