490 likes | 663 Vues
Revealing the riddle of REML. Mick O’Neill Faculty of Agriculture, Food & Natural Resources, University of Sydney. Background. Biometry 1 and 2 are core units with an applied stats focus. Many students have only Maths in Society on entry to the Faculty Biometry 3 is a Third Year elective
E N D
Revealing the riddle of REML Mick O’Neill Faculty of Agriculture, Food & Natural Resources, University of Sydney
Background • Biometry 1 and 2 are core units with an applied stats focus. Many students have only Maths in Society on entry to the Faculty • Biometry 3 is a Third Year elective • Biometry 4 is (still) a possible major • All students are now expected to design and analyse their fourth year experiments with little or no help from the Biometry Unit
Third year Biometry students can: • Design and analyse multi-strata factorial experiments (split-plots, strip-plots) • Perform binomial & ordinal logistic regression, Poisson regression, … • Analyse repeated measures data using REML
Step 1. What is Maximum Likelihood? • The likelihood is the prior probability of obtaining the actual data in your sample • This requires you to assume that the data follow some distribution, typically: • Binomial or Poisson for count data • Normal or LogNormal for continuous data
Step 1. What is Maximum Likelihood? The likelihood is the prior probability of obtaining the actual data in your sample Each of these distributions involves at least one unknown parameter (probability, mean, standard deviation, …) which must be estimated from the data.
Step 1. What is Maximum Likelihood? The likelihood is the prior probability of obtaining the actual data in your sample Estimation is achieved by finding that parameter valuewhich maximises the likelihood (or equivalently the log-likelihood)
Example 1. Binomial data • Guess p = P(seed germinates) • Evaluate LogL • Maximise LogL by varying p
Example 2. Normal data • Guess m and s • Evaluate LogL • Maximise LogL by varying m and s
Step 2. What is REML? It is possible to partition the likelihood into two terms: • a likelihood that involves m (as well as s2) • and • a residual likelihood that involves only s2
Step 2. What is REML? It is possible to partition the likelihood into two terms, in such a way that: • the first likelihood can be maximised to estimate m (and its solution does not depend on the value of s2) • the residual likelihood can be maximised to estimates2 REML estimate
= How?
Solutions • ML estimate of variance is • REML estimate is • In each case the estimate of m isthe sample mean
sn-1 sn ML estimate REML estimate
= Extensions
ANOVA vs REML ANOVA: Source of variation d.f. s.s. m.s. v.r. F pr. Chick stratum Diet 3 0.01847 0.00616 0.10 0.958 Residual 12 0.73230 0.06103 Total 15 0.75078 REML Variance Components Analysis: Fixed model : Constant+Diet Random model : Chick Chick used as residual term *** Residual variance model *** Term Factor Model(order) Parameter Estimate S.e. Chick Identity Sigma20.0610 0.02491 *** Wald tests for fixed effects *** Fixed term Wald statistic d.f. Wald/d.f. Chi-sq prob Diet 0.30 3 0.100.960
ANOVA: • ***** Tables of means ***** Grand mean 3.129 Diet Diet_1 Diet_2 Diet_3 Diet_4 3.107 3.188 3.112 3.107 *** Standard errors of differences of means *** Table Diet rep. 4 d.f. 12 s.e.d. 0.1747 REML Variance Components Analysis: *** Table of predicted means for Diet *** Diet Diet_1 Diet_2 Diet_3 Diet_4 3.107 3.187 3.112 3.107 Standard error of differences: 0.1747
Example 4a. One-way (in randomised blocks) – fixed treatments ANOVA: Source of variation d.f. s.s. m.s. v.r. F pr. Block stratum 5 5.410 1.082 0.29 Block.*Units* stratum Spacing 4 125.661 31.415 8.50 <.001 Residual 20 73.919 3.696 REML Variance Components Analysis (a) With Block + Spacing both fixed effects: Term Factor Model(order) Parameter Estimate S.e. Residual Identity Sigma2 3.696 1.169 Fixed term Wald statistic d.f. Wald/d.f. Chi-sq prob Block 1.46 5 0.29 0.917 Spacing 34.00 4 8.50 <0.001
Random blocks, fixed treatments ANOVA: Source of variation d.f. s.s. m.s. v.r. F pr. Block stratum 5 5.410 1.082 0.29 Block.*Units* stratum Spacing 4 125.661 31.415 8.50 <.001 Residual 20 73.919 3.696 REML Variance Components Analysis (b) With Spacing fixed and Block random: *** Estimated Variance Components *** Random term Component S.e. Block 0.000 BOUND *** Residual variance model *** Term Factor Model(order) Parameter Estimate S.e. Residual Identity Sigma2 3.173 0.897 *** Wald tests for fixed effects *** Fixed term Wald statistic d.f. Wald/d.f. Chi-sq prob Spacing 39.60 4 9.90 <0.001 Estimated Source Value Block -0.5228 Error 3.6959
Model for RCBD Yield of soybean = Overall mean + Block effect + Spacing effect + Error • Overall mean and Spacing effects are fixedeffects • Block effect is a random term • Error is a random term
General Linear Mixed Model Yield of soybean = Overall mean + Block effect + Spacing effect + Error Y = Fixed effects + Random effects + Error term Y = Xt + Zu + e • The random effects can be correlated • The error term can be correlated • The random effects are uncorrelated with the error term
General Linear Mixed Model Y = Fixed effects + Random effects + Error term Y = Xt + Zu + e is a scaling factor, often set to 1
REML is used as the default to estimate to variance and covariance parameters of the model • The algorithm does not depend on the data being balanced
Furthermore, different choices for the variance matrices allow for : • an appropriate repeated measures analysis for normal data • an appropriate spatial analysis for field trials Nested models can be compared using the change in deviance which is approximately c2 with df = change in number of parameters
For students: For markers:
Example 6. Use of devianceWidths (in mm) of the dorsal shield of larvae of ticks found on 4 rabbits
Minitab’s analysis Source DF Adj MS F P Rabbit 3 602.6 5.26 0.004 Error 33 114.5 Estimated Source Term Source Value 1 Rabbit (2) + 9.0090 (1) Rabbit 54.18 2 Error (2) Error 114.48 Rabbit Mean 1 372.3 2 354.4 3 355.3 4 361.3 these are sample means
GenStat’s Linear Mixed Models analysis Random term Component S.e. Rabbit 55.0 55.8 *** Residual variance model *** Term Model(order) Parameter Estimate S.e. Residual Identity Sigma2 114.4 28.2 Table of predicted means for Rabbit (these are BLUPs) Rabbit 1 2 3 4 369.9 355.5 356.0 361.2 Standard error of differences: Average 4.613 Maximum 5.055 Minimum 4.133 Average variance of differences: 21.38 Deviance d.f. 215.22 34
Test H0: Method: drop Rabbit as a random term Deviance d.f. 221.21 35 for reduced model 215.22 34 Change in deviance = 6.0 with 1 df P-value = 0.014 The variation in the widths of the dorsal shield of larvae of ticks found among rabbits differs significantly across rabbits (P = 0.014) The variance among rabbits is estimated to be 55.0 ( 55.7) compared to the variance within rabbits, namely 114.4 ( 28.2)
Example 7 - Repeated Measures Growth of a fungus (in cm) over time
Split plot without Greenhouse-Geisser adjustmment (assumes equi-correlation structure among times) Source of variation d.f. m.s. v.r. F pr. Rep.Fungus stratum Fungus 2 8.104 97.30 <.001 Residual 9 0.083 3.37 Rep.Fungus.Time stratum Time 5 55.231 2233.21 <.001 Fungus.Time 10 0.933 37.71 <.001 Residual 45 0.025 Estimated stratum variances Stratum variance d.f. variance component Rep.Fungus 0.0833 9 0.0098 Rep.Fungus.Time 0.0247 45 0.0247
Split plot with Greenhouse-Geisser adjustmment (tests equi-correlation structure among times) Source of variation d.f. m.s. v.r. F pr. Rep.Fungus stratum Fungus 2 8.104 97.30 <.001 Residual 9 0.083 3.37 Rep.Fungus.Time stratum Time 5 55.231 2233.21 <.001 Fungus.Time 10 0.933 37.71 <.001 Residual 45 0.025 (d.f. are multiplied by the correction factors before calculating F probabilities) Box's tests for symmetry of the covariance matrix: Chi-square 57.47 on 19 df: probability 0.000 F-test 2.93 on 19, 859 df: probability 0.000 Greenhouse-Geisser epsilon 0.3206
Split plot via REML – ignoring changing variances Fixed model : Constant+Fungus+Time+Fungus.Time Random model : Rep.Fungus+Rep.Fungus.Time Estimated Variance Components Random term Component S.e. Rep.Fungus 0.00976 0.00660 Residual variance model Term Model(order) Parameter Estimate S.e. Rep.Fungus.Time Identity Sigma2 0.0247 0.0052 Deviance d.f. -109.90 52 Fixed term Wald statistic d.f. Wald/d.f. Chi-sq prob Fungus 194.60 2 97.30 <0.001 Time 11166.05 5 2233.21 <0.001 Fungus.Time 377.08 10 37.71 <0.001
Split plot via REML – accounting for changing variances (a) Fixed model : Constant+Fungus+Time+Fungus.Time Random model : Rep.Fungus+Rep.Fungus.Time Estimated Variance Components Random term Component S.e. Rep.Fungus 0.01053 0.00539 Residual variance model Term Model(order) Parameter Estimate S.e. Rep.Fungus.Time Identity Sigma2 0.0082 .00428 Rep Identity - - - Fungus Identity - - - Time Diagonal d_1 1.000 FIXED d_2 1.102 0.815 d_3 0.227 0.215 d_4 0.262 0.253 d_5 1.965 1.443 d_6 13.550 9.580
Split plot via REML – accounting for changing variances (b) Fixed model : Constant+Fungus+Time+Fungus.Time Random model : Rep.Fungus+Rep.Fungus.Time Estimated Variance Components Random term Component S.e. Rep.Fungus 0.01053 0.00539 Residual variance model Term Model(order) Parameter Estimate S.e. Rep.Fungus.Time Identity Sigma2 1.000 FIXED Rep Identity - - - Fungus Identity - - - Time Diagonal d_1 0.0082 0.0043 d_2 0.0091 0.0047 d_3 0.0019 0.0015 d_4 0.0022 0.0017 d_5 0.0162 0.0081 d_6 0.1116 0.0530
Split plot via REML – accounting for changing variances and an AR(1) correlation structure Fixed model : Constant+Fungus+Time+Fungus.Time Random model : Rep.Fungus+Rep.Fungus.Time Estimated Variance Components Random term Component S.e. Rep.Fungus 0.010616 0.005572 Residual variance model Term Model(order) Parameter Estimate S.e. Rep.Fungus.Time Identity Sigma2 0.0085 0.0045 Rep Identity - - - Fungus Identity - - - Time AR(1) hetphi_10.148 0.209 d_1 1.000 FIXED d_2 1.202 0.895 d_3 0.260 0.249 d_4 0.264 0.266 d_5 1.829 1.347 d_6 13.560 9.620
Split plot via REML – accounting for changing variances and an AR(1) correlation structure Deviance d.f. Change d.f. Same variance, uncorrelated -109.90 52 Different variances over time -151.03 47 41.13 5 + AR(1) correlation structure -151.59 46 0.56 1
Example 8 – Spatial analysis RCBD (fixed) fertilisers Potato yields (t/ha)
Source of variation d.f. m.s. v.r. F pr. Block stratum 3 2.929 1.43 Block.Treatment stratum Treatment 5 92.359 45.07 <.001 Residual 15 2.049 Treatment A B C D E F 17.55 25.60 27.67 25.94 30.51 30.63 *** Standard errors of differences of means *** Table Treatment rep. 4 d.f. 15 s.e.d. 1.012
REML Random term Component S.e. Block 0.395 0.500 Residual variance model Term Factor Model Parameter Estimate S.e. Y.X Sigma2 2.849 1.739 Y AR(1) phi_10.7054 0.2078 X AR(1) phi_1 -0.2508 0.3397 Deviance d.f. 36.54 14 Treatment A B C D E F 17.74 26.29 26.79 26.34 30.41 29.37 Standard error of differences: Average 0.7749 Maximum 0.8942 Minimum 0.6465 Average variance of differences: 0.6050