330 likes | 527 Vues
SCREENING MULTIVARIATE COMORBIDITIES . by. Gilbert MacKenzie. Centre of Biostatistics. Introduction. There are two general problems The first is to understand the multivariate distribution of relevant comorbidities, say p in number. . . Descriptive Statistical Activity.
 
                
                E N D
SCREENING MULTIVARIATE COMORBIDITIES by Gilbert MacKenzie Centre of Biostatistics
Introduction There are two general problems The first is to understand the multivariate distribution of relevant comorbidities, say p in number.  Descriptive Statistical Activity The second is to understand which components of this distribution influence outcome, say Y (disease status)  Explanatory Statistical Activity These are of course inter-related.
Focus The main focus in on the first question - taking into account  Binary (1,0) nature of comorbidities  Multivariate distribution p – dimensional  Resulting non-Normal correlation structure Analysis is then in terms of a p-dimensional contingency table involving a log-linear model for the resulting frequencies, i.e., a GLM.
Remark However, the second question is also of interest –  Requires a model for Y in terms of x – subset of comorbidities  Burden of comorbidity - Index or Score – sometimes effective  But no information on non-main effect complexity  Also, may not take account of clinical severity Hence loss of scientific information
Primary data structure p =3 Example, consider 3 binary comorbidities = A, B & C NB: where ijknijk = n & ijkijk = 1.
Statistical Model Basic assumption is ijk = E{ Nijk } = n.ijk = exp{ ijk} (1) with log-linear parametrization loge {ijk} =  ijk =  + i + j + k + ()ij + ()ik + ()jk + ()ijk (2) and conditional Poisson Model Pr (Nijk= nijk) = exp{-ijk}.ijknijk/ nijk! (3)
Interpretation Main Effects Model => SI For example, when the 2-way & 3-way interaction terms in (2) are zero we have a main effects model ijk = exp{i + j + k}(4) then the comorbidities are SI, ie, ABC Conditional Independence Model A model in which, say ijk = exp{i+j + k + ()ij + ()jk} (5) implies that A & C are SI given B and is denoted AC | B.
How does SPSS handle these models?  Hiloginlear - finds mle’s of expected numbers in cells of a hierarchical loglinear model by IPF Loglinear - finds mle’s of parameters by Newton Raphson   Genlog - similar to Loglin– different interface Above procedures restrict p =10 !! IPF = Iterative proportional fitting (Haberman, 1972) NB:Some incompatibilities between procedures
Example when p = 10 Data Example: Obesity study data comprising 5550 cases and p = 100 comorbidities. Selecting Candidate Co-morbidities Usually based on clinical criteria – here p = 10 selected on basis of most frequently occurring. Clinical Relevance However, no suggestion that these 10 are the most clinically relevant – rather they provide a stern test of the technique.
Search Strategies for p = 10 Types Forwards, Backwards, Best Subset, Structured Search Structured Search Step 1: Find the order of the maximal interaction supported Step 2: Use backward elimination to find best subset in this sub-class. Step 3: Refit final model using Newton-Raphson for estimates. Known to be a reasonable procedure & fast search – with IPF Statistical Selection Criteria (very limited) The use of P = 0.05 tends to produce subsets which are too rich, better to use P=0.01, but equivalent when structure is succinct.
Results when p = 10 K-way Interactions Tests that K-way effects are zero. K df L.R. Prob 1 10 30847.677 .0000 2 45 7285.308 .0000 3 120 180.996 .0003 4 210 243.772 .0549 5 252 131.966 1.0000 6 210 14.521 1.0000 7 120 1.701 1.0000 8 45 .033 1.0000 9 10 .000 1.0000 10 1 .000 1.0000 Max order = 4 LR is a measure of discrepancy here. So the 4-way interaction is required to provide a minimal fit – maybe not verygood !!
Best Subset of terms p = 10 Backwards elimination with P=0.01 & maxorder =4. 1. COM3*COM20*COM89*COM90 2. COM38*COM90*COM99 3. COM9*COM38*COM99 4. COM6*COM21 5. COM9*COM20*COM89 6. COM3*COM38 7. COM6*COM9 8. COM6*COM8 9. COM3*COM99 10. COM9*COM21*COM90 11. COM89*COM99 12. COM6*COM89 13. COM3*COM9*COM90 14. COM8*COM9
Confirmation of Best Subset of terms L-Ratio test - elimination of any interaction DF L.R. Prob 1. COM3*COM20*COM89*COM90 1 8.072 .0045 2. COM38*COM90*COM99 1 13.313 .0003 3. COM9*COM38*COM99 1 9.493 .0021 4. COM6*COM21 1 16.863 .0000 5. COM9*COM20*COM89 1 8.990 .0027 6. COM3*COM38 1 299.468 .0000 7. COM6*COM9 1 27.956 .0000 8. COM6*COM8 1 42.888 .0000 9. COM3*COM99 1 149.623 .0000 10. COM9*COM21*COM90 1 18.485 .0000 11. COM89*COM99 1 13.916 .0002 12.COM6*COM89 1 9.810 .0017 13 COM3*COM9*COM90 1 11.838 .0006 14 COM8*COM9 1 11.239 .0008
Comorbidities Selected p = 8 Eliminate gender variables
Results when p = 8 K-way Interactions Tests that K-way effects are zero. K DF L.R. Prob 1 8 24821.005 .0000 2 28 6603.913 .0000 3 56 101.278 .0002 4 70 80.517 .1831 5 56 28.047 .9993 6 28 2.248 1.0000 7 8 .000 1.0000 8 1 .000 .9966 Max order = 3
Best Subset of terms p = 8 : Backwards elimination with P=0.05 & maxorder =3. Terms DF L.R. Prob 1. C3*C6*C9 1 10.649 .0011 2. C3*C9*C38 1 10.724 .0011 3. C6*C9*C38 1 10.149 .0014 4. C8*C9*C38 1 10.032 .0015 5. C3*C9*C90 1 10.306 .0013 6.C9*C21*C90 1 21.822 .0000 7. C3*C8*C99 1 6.951 .0084 8. C38*C90*C99 1 12.572 .0004 9. C6*C8 1 51.814 .0000 10 C6*C21 1 17.589 .0000 11.C9*C99 1 567.225 .0000
Best Subset of terms p = 8 Backwards elimination with P=0.01 & maxorder =3. DF L.R. Prob 1. C3*C9*C90 1 11.891 .0006 2. C9*C21*C90 1 14.609 .0001 3. C9*C38*C99 1 9.510 .0020 4. C38*C90*C99 1 13.539 .0002 5. C21*C38 1 7.403 .0065 6. C6*C21 1 16.766 .0000 7. C3*C38 1 299.649 .0000 8. C3*C99 1 156.962 .0000 9. C6*C8 1 42.889 .0000 10. c8*C9 1 11.236 .0008 11. C6*C9 1 22.880 .0000
Always some outliers … we are just looking for the largest effects with the crude selection methods on offer
Results p = 8 and structural zeros Not all 256 patterns are incident – only 140 of the 256 cells in the 8 – dimensional table are non-zero S-zeros ? Non-zero These empty cells may be structural zeros – not merely sampling zeros. Will check whether all zeros are structural – but SPSS is tricky here.
Flat Table Specification Tospecify structural zeros I must tell spss the address of each cell which is to be a structural zero. To do this correctly I should produce a flat contingency table. Suppose now I have 9 comorbities – the 8 in red below and C20. Then I need to output the 9- dimensional flat table comprising a column of frequencies and followed by columns of the design matrix (indices) of the table. Step1: Generate the flat–table Problem: Crosstabs is the procedure which writes out this flat table, but only in integer mode and only for an 8 dimensional table!! It writes a .dat file TEMPORARY. SELECT IF ( c20 EQ 0). PROCEDURE OUTPUT OUTFILE='myflat_table.dat’. CROSSTABS VARIABLES=C2 C3 C7 C10 C14 C17 C18 C19 (0,1) /TABLES=C2 BY C3 BY C7 BY C10 BY C14 BY C17 BY C18 BY C19 /WRITE=ALL /COUNT ROUND CELL . This is only way to force crosstabs to write out the cells which have zero frequencies - freq is the variable written in the 1st column followed by the table indicesin subsequent columns
Legacy Code Problems Step 2: Notice that I must repeat this operation for c20 =1 and then put the two flat tables together (ie by stacking them). Putting the tables together is most easily accomplished by converting the .dat files to .sav files and using the add files command Notice however that I have to add the table subscripts for C20 = 0 for the first table and c20 =1 for the second table in order to complete the 9 -dimensional table subscripts. C20 will be the last column and the table subscripts on the right vary slowest In what follows I assume that this has been done and that the final table was stored in “myflat_table.sav” This whole procedure is a bind for higher dimensional tables but remember Crosstabs and Hiloglin can only handle 10 dimensions as a max!! – these restrictions are examples of LEGACY CODE PROBLEMSin SPSS
Defining the Structural Zero Indicator Step3 Read in the whole flat-table and define the structural zero indicator called cwht GET FILE= ‘myflat_table.sav’ Comment freq C2C3C7C10C14C17C18C19 and c20 are in this file. IF (freq gt 0) cwht=1. IF (freq eq 0) cwht=0. Comment cwht = 0 implies a structural zero. SAVE OUTFILE=‘final_flat_table_.sav‘ /KEEP= Freq c2 c3 c7 c10 c14 c17 c18 c19c20 cwht /COMPRESSED. Comment the save step immediately above may be omitted and the code combined with the analysis on the next slide.
Analysing the Flat Table Step 4: Fit Final model. Method backward, maxorder = 3 and p =0.01 Get file = ‘final_flat_table.sav’ WEIGHT BY freq. HILOGLINEAR C2 C3 C7 C10 C14 C17 C18 C19c20 (0,1) /CWEIGHT = cwht /METHOD=backward /MAXORDER=3 /CRITERIA =iterate(30) maxsteps(10000) p (0.01) /PRINT=none /PLOT = default. Reconstruct the 9-D contingency table Includingc20 using weight command Specify the structural zeros to hiloglin usingcwht variable
Results with p = 8 and structural zeros K-way Interactions Tests that K-way effects are zero. K DF L.R. Prob 1 8 24821.005 .0000 2 28 6603.913 .0000 3 56 101.278 .0002 4 70 80.517 .1831 5 56 28.047 .9993 6 28 2.248 1.0000 7 8 .000 1.0000 8 1 .000 .9966 Max order = 3
This model is much worse suggesting that not all the observed zeros can be structural zeros
Model Wrapping - up • Generally • Examine model sets with and without patterns of structural zeros • 2. Refit models by Newton-Raphson to obtain MLE estimates, standard errors and variance-covariance matrix • Explore Fit issues • Check Biologic & Clinical plausibility • Decide on Final Model – Parsimony first
Summary of Technique • Advantages • Multivariate, p-dimensional method • Natural statistical parametrization for binary data • Dependence measured in terms of interactions • Fast IPF-based structured search procedure for MLEs Interaction test  Backward elimination  Final N-R fit • Can incorporate binary response variable Y in analysis • Can handle structural zeros when encountered.
Scope for Development • Restrictions • Basic model-building Philosophy is out-dated in SPSS • Extension of the maximum dimension - currently p=10 max • More general search procedure basically involves a re-programming of the structured search procedure • Handling Structural Zeros – new procedure required • Also Hiloglin does not tie-up with Genlog easily – Genlog does not accept Generating Sets as input • User-selected reference categories required
In Summary • Log-linear Modelling in SPSS is not “Joined Up” • Main restriction to usefulness is p=10 • Structural Zero handling is antiquated • Modern search strategy tools needed • IPF is useful, but some models produced are not estimable by ML. • Output needs modernisation • Nevertheless the Basic procedures are useful
Key References Birch (1963). Maximum Likelihood in three-way contingency tables. JRSS, Series B, 25, 220-233. Bishop (1969). Full contingency tables, logits and split contingency tables. Biometrics, 25, 383-400. Fienberg (1972). The analysis of incomplete multi-way contingency tables. Biometrics. 24, 177-202. Goodman (1971). The analysis of multi-dimensional contingency tables. Technometrics, 13, 1, 33-61. Haberman (1972). Log-linear fit for contingency tables. JRSS, Applied Stats., 21; 2, 218-225 MacKenzie & O’Flaherty (1982). Direct design matrix generation for balanced factorial experiments – algorithm AS173. JRSS, Applied Stats, 31,75-80. O’Flaherty & MacKenzie (1982) Direct simulation of nested Fortran Do-loops – Algorithm AS172, JRSS, 31,71-74.