490 likes | 504 Vues
Dimension Reduction in Workers Compensation. CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com. Objectives. Answer questions: What is dimension reduction and why use it?
E N D
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com
Objectives • Answer questions: What is dimension reduction and why use it? • Introduce key methods of dimension reduction • Illustrate with examples in Workers Compensation • There will be some formulas, but emphasis is on insight into basic mechanisms of the procedures
Introduction • “How do mere observations become data for analysis?” • “Specific variable values are never immutable characteristics of the data” • Jacoby, Data Theory and Dimension Analysis, Sage Publications • Many of the dimension reduction/measurement techniques originated in the social sciences and dealt with how to create scales from responses on attitudinal and opinion surveys
Unsupervised learning • Dimension reduction methods generally unsupervised learning • Supervised Learning • A dependent or target variable • Unsupervised learning • No target variable • Group like variables or like records together
The Data • BLS Economic indexes • Components of inflation • Employment data • Health insurance inflation • Texas Department of Insurance closed claim data for 2002 and 2003 • Employment related injury • Excludes small claims • About 1800 records
What is a dimension? • Jacoby – The number of separate and interesting sources of variation • In many studies each variable is a dimension • However, we can also view each record in a database as a dimension
The Two Major Categories of Dimension Reduction • Variable reduction • Factor Analysis • Principal Components Analysis • Record reduction • Clustering • Other methods tend to be developments on these
Principal Components Analysis • A form of dimension (variable) reduction • Suppose we want to combine all the information related to the “inflation” dimension of insurance costs • Medical care costs • Employment (wage) costs • Other • Energy • Transportation • Services
Principal Components • These variables are correlated but not perfectly correlated • We replace many variables with a weighted sum of the variables • These are then used as independent variables in a predictive model
Factor/Principal Components Analysis • Linear methods – use linear correlation matrix • Correlation matrix decomposed to find smaller number of factors the are related to the same underlying drivers • Highly correlated variables tend to have high load on the same factor
Factor/Principal Components Analysis • Uses eignevectors and eigenvalues • R is correlation matrix, V eigenvectors, lambda eigenvalues
Factor Rotation • Find simpler more easily interpretable factors • Use notion of factor complexity
Factor Rotation • Quartimax Rotation • Maximize q • Varimax Rotation • Maximizes the variance of squared loadings for each factor rather than for each variable
How Many Factors to Keep? • Eigenvalues provide information on how much variance is explained • Proportion explained by a given component=corresponding eigenvalue/n • Use Scree Plot • Rule of thumb: keep all factors with eigenvalues>1
What About Categorical Data? • Factor analysis is performed on numeric data • You could code data as binary dummy variables • Categorical Variables from Texas data • Injury • Cause of loss • Business Class • Health Insurance (Y/N)
Optimal Scaling • A method of dealing with categorical variables • Can be used to model nonlinear relationships • Uses regression to • Assign numbers to categories • Fit regression coefficients • Y*=f(X*) • In each round of fitting, a new Y* and X* is created
Row Reduction: Cluster Analysis • Records are grouped in categories that have similar values on the variables • Examples • Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing • Text analysis: Use words that tend to occur together to classify documents • Fraud modeling • Territory definition • Note: no dependent variable used in analysis
Clustering • Common Method: k-means, hierarchical • No dependent variable – records are grouped into classes with similar values on the variable • Start with a measure of similarity or dissimilarity • Maximize dissimilarity between members of different clusters
Dissimilarity (Distance) Measure – Continuous Variables • Euclidian Distance • Manhattan Distance
Binary Variables • Sample Matching • Rogers and Tanimoto
Example: Texas Data • Data from 2002 and 2003 closed claim database by Texas Ins Dept • Only claims over a threshold included • Variables used for clustering: • Report Lag • Settlement Lag • County (ranked by how often in data) • Injury • Cause of Loss • Business class
Results Using Only Numeric Variables Used Euclidian distance measure
Two Stage Clustering With Categorical Variables • First compute dissimilarity measures • Then get clusters • Find optimum number of clusters
Tying Things Together: Multidimensional Scaling • A mathematical way to connect clustering and factor analysis • Data can be decomposed into key row dimensions times a diagonal weight matrix times key column dimensions
Modern dimension reduction • Hidden layer in neural networks like a nonlinear principle components • Projection Pursuit Regression – a nonlinear PCA • Kahonen self-organizing maps – a kind of neural network that does clustering • These can be understood as enhancements factor analysis or clustering
Recommended References • Hacher, 1994, A Step-by-Step Approach for Using the SAS System for Factor Ananlysis and Structural Equation Modeling, SAS Publications • Jacoby, 1991, Data Theory and Dimension Analysis, Sage Publications • Kaufman and Rousseeuw,1990, Finding Groups in Data, Wiley • Kim and Mueller, 1978, Factor Analysis: Statistical Methods and Practical Issues, Sage Publications