Danila Filipponi Simonetta Cozzi ISTAT, Italy

Outlier Identification Procedures for Contingency Tables in Longitudinal Data Danila Filipponi Simonetta Cozzi ISTAT, Italy Roma,8-11 July 2008

What is the problem? ► Starting from December 2006, ISTAT releases a statistical register of local units (LU) of enterprises (ASIA-LU) , supplying every year information on local units, available until the 2001 only every ten years (Industry and Services Census). ►The set-up of the register have been carried out starting from an administrative/statistical informative base of addresses and using statistical models to estimate the activity status and other attributes of the local units. ► ASIA-LU provides (mainly) the number of local units and local units employees by municipality and economical activity. What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

What is the problem? ► Because of the nature of the available information, a selective editing to identify possible anomalous counts (LU/employees) in some combinations of the classification variables is indispensable ►The objective is to identify anomalous number of employees and/or local units classified by municipality and economical activity, taking into account the longitudinal information on LU, i.e. the local units registers (2004-2005) and the Census surveys (1991-1996-2001). What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE The contingency table is: Simulation of Correlated count data Results Outliers Detection in ASIA-UL

where is a distribution family such and has density and What is the problem? ► Outlying observations in a set of data are generally viewed as deviations from a model assumption: the majority of observations -inliers- are assumed to come from a selected model (null model); few units – outliers- are thought of as coming from a different model. What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data -GEE ► The outliers identification problem is then translated into the problem of identifying those observations that lie in an outlier regiondefined according to the selected null model Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

Let consider T categorical variables with possible outcomes . Each combination defines a cell of a contingency table. Given a set of data, each observation belongs to a combination and the frequency count of a cell can be denoted as Under a loglinear Poisson model, the cell counts are considered as a realizations of independent Poisson variables with expected values Outliers in Contingency Tables Some Notation What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data -GEE Correlated count data -REGEE Simulation of Correlated count data Results In a contingency table a cell count yi is view as outlier if it occurs with a small probability under the null model. Outliers Detection in ASIA-UL

► The cell count yi is then an if it lies in the of Poisson’s distribution with parameters . The values should be chosen in a way that the probability that one or more outliers occurring in the contingency table do not exceed a given value . Assuming all the to be the same, then it can be shown that Outliers in Contingency Tables ► Assuming a Log linear Poisson model, the outlier region for each cell count yi is defined as What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data where N is the set of all non-negative integers and Correlated count data -GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

► In practice to define the and identify the outlying cells, it is necessary to estimate the vector of parameters ► Loglinear models for contingency table are Generalized Linear Models(GLM) where the expected cell count is with X is a full rank design matrix and a parameter vector. Outliers in Contingency Tables What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data -GEE Correlated count data -REGEE ► In the situation with only one measurement for each subject, i.e. without a correlation structure, the classical estimator for GLM is the maximum likelihood (ML) estimator. Because of the nature of ML estimator, the regression parameters estimates can be highly influenced by the presence of outlying cells. Some robust alternative have been proposed in literature. Simulation of Correlated count data Results Outliers Detection in ASIA-UL

►Given a contingency table with two factors, if an additive modelis assumed, the value can becan be expressed as the sum of a constant term,an effect for level i of the row factor, an effect for level j of the column factor, and a casual term: Non parametric approach – Median Polish What is the problem? ►A procedure that supplies robust estimates in the analysis of contingency tables is the median polishmethod(Mosteller & Tukey, 1977; Emerson & Hoaglin, 1983). Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data -GEE Correlated count data -REGEE . Simulation of Correlated count data ►The median polish procedure operates in an iterative manner on the table, calculating and subtracting row and column medians and ends when all the rows and columns have a median equal to zero. Results Outliers Detection in ASIA-UL

Correlated count data What is the problem? In longitudinal studies, repeated data looks like : Outlier in contingency tables where Non parametric approach – Median Polish Correlated count data ► Repeated responses on the same subject tend to be more alike (generally positive correlated) then responses on different subject. Standard statistical procedures that ignore the between subjects correlation may produce invalid results. Correlated count data -GEE Correlated count data -REGEE Simulation of Correlated count data ► There are several way to extend GLMs to take into account the correlationbetween subjects: marginal modeling approach (GEE), random effects models for categorical responses (GLMM), transitional models. Results Outliers Detection in ASIA-UL

Correlated count data - GEE What is the problem? ► A reasonable alternative to ML estimations for longitudinal count data is a multivariate generalization of the quasi-likelihood. Let Outlier in contingency tables Non parametric approach – Median Polish ni vector of outcome Correlated count data ni x p matrix of covariate Correlated count data -GEE ► Rather then assuming a distribution for the response variable Y, in the quasi-likelihood method are specified only the moments: Correlated count data -REGEE Simulation of Correlated count data the mean which is a function of the linear predictor Results Outliers Detection in ASIA-UL the variance that depends on the mean and a scale parameter

is an diagonal matrix with the jth diagonal element is an correlation matrix Correlated count data - GEE What is the problem? The covariance matrix where: Outlier in contingency tables Non parametric approach – Median Polish Correlated count data In the quasi-likelihood method, the estimate of the regression and nuisance parameter are the solutions of the generalized quasi-score function, called Generalized Estimating Equation (GEE): Correlated count data -GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

► Preisser and Quaqish (1999), in order to provide robust estimation of , introduced a generalization of GEE which include weights in the estimating equations in order to downweight the influential observation. Correlated count data - REGEE What is the problem? ► Because the QL estimators have properties similar to the ML estimators, the regression and the nuisance parameters can be influenced by outliers. Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE ► They define the resistant generalized estimating equation (REGEE)as: Simulation of Correlated count data Results Outliers Detection in ASIA-UL

is an diagonal weight matrix containing robustness weights The weight have been chosen as function of the Pearson residuals, to ensure robustness with respect to outlying points in the y-space. We use as weightfunction is a bias eliminating constant determined by the marginal distribution of Y, where Correlated count data - REGEE What is the problem? where: Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

► Robust estimators are also needed for the nuisance parameters and to avoid consequences on the regression parameters estimates ► If the moment estimations of and are: where an autoregressive AR(1) working correlation matrix has been specified (i.e ) Correlated count data - REGEE What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE and Simulation of Correlated count data Results Outliers Detection in ASIA-UL

The parameter vector and is a row of the design matrix X obtained as a dummy coding Simulation of Correlated count data Outliers identification procedures, based on previously estimated parameters with the three different estimation methods, have been compared in a simulation study. What is the problem? Outlier in contingency tables Non parametric approach – Median Polish In the study 4x4x5 tables are simulated Correlated count data Correlated count data-GEE where Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

► If is a random vector with a mean and covariance matrix , in the OS method is decompose in where is an nxl matrix of 0’s and 1’s and is a l-vector of independent Poisson variables. The dimension l depends on the structure of the covariance matrix and the matrix is defined in a way that has the proper mean vector and covariance matrix ► Once is defined the means of can be obtained solving the equation Simulation of Correlated count data What is the problem? ► Correlated Poisson variables are simulated using the overlapping sum (OS) algorithm (Park and Shin, 1998). Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

Simulation of Correlated count data What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Outliers in the simulated tables are produced by replacing the selected cell Yijt by Max(inl(α,μij))+1 or Min(inl(α,μij))-1 where α has been chosen as (10-2, 10-4, 10-8) Results Outliers Detection in ASIA-UL

Results What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

Ρ=0,1 %outliers=0,05 Ρ=0,1 %outliers=0,01 Ρ=0,8 %outliers=0,05 Ρ=0,8 %outliers=0,01 Results What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

Outliers Detection in ASIA-UL What is the problem? The outlier identification procedures have been applied in the control process of the Statistical Register of the Local Units (ASIA-UL). Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

Outliers Detection in ASIA-UL What is the problem? Outlier in contingency tables Non parametric approach – Median Polish Correlated count data Correlated count data-GEE Correlated count data -REGEE Simulation of Correlated count data Results Outliers Detection in ASIA-UL

Danila Filipponi Simonetta Cozzi ISTAT, Italy

Danila Filipponi Simonetta Cozzi ISTAT, Italy

Presentation Transcript

Paola Anitori - ISTAT

ISTAT

Francesco Rizzo (ISTAT - Italy ) Stefano De Francisci (ISTAT – Italy )

Michelle Jouvenal – Silvio Stoppoloni Istat, Italy

Aurora De Santis, Riccardo Carbini Istat, Italy

Anna Ciammola and Donatella Tuzi ISTAT - Italy

“SDMX in Istat: from data reporting to data dissemination” Francesco Rizzo – ISTAT, Italy

Anna Ciammola, Claudia Cicconi Francesca Di Palma ISTAT - Italy

Modernisation in Istat

Francesco Rizzo (ISTAT - Italy ) Stefano De Francisci (ISTAT – Italy )

Francesco Rizzo (ISTAT - Italy)