250 likes | 392 Vues
Lecture 4, part 1 : Linear Regression Analysis: Two Advanced Topics. Karen Bandeen -Roche, PhD Department of Biostatistics Johns Hopkins University. July 14, 2011. Introduction to Statistical Measurement and Modeling. Data examples. Boxing and neurological injury
E N D
Lecture 4, part 1: Linear RegressionAnalysis: Two Advanced Topics Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University July 14, 2011 Introduction to Statistical Measurement and Modeling
Data examples • Boxing and neurological injury • Scientific question: Does amateur boxing lead to decline in neurological performance? • Some related statistical questions: • Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure? • Is boxing-associated decline independent of initial cognition and age? • Is there a threshold of boxing that initiates harm?
Outline • Topic #1: Confounding • Handling this is crucial if we are to draw correct conclusions about risk factors • Topic #2: Signal / noise decomposition • Signal: Regression model predictions • Noise: Residual variation • Another way of approaching inference, precision of prediction
Topic # 1: Confounding • Confound means to “confuse” • When the comparison is between groups that are otherwise not similar in ways that affect the outcome • Lurking variables,….
Confounding Example: Drowning and Eating Ice Cream * * * * * * * Drowning rate * * * * * * * * * * * * * * * * * * * Ice Cream eaten
Confounding Epidemiology definition: A characteristic “C” is a confounder if it is associated (related) with both the outcome (Y: drowning) and the risk factor (X: ice cream) and is not causally in between Ice Cream Consumption Drowning rate ?? JHU Intro to Clinical Research
Confounding Statistical definition: A characteristic “C” is a confounder if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs with, versus without, adjustment for C Ice Cream Eaten Drowning rate Outdoor Temperature
Confounding Example: Drowning and Eating Ice Cream * * * * * * * Drowning rate * * * * * * * * * Warm temperature * * * * * * * * * * Cool temperature Ice Cream eaten
Effect modification A characteristic “E” is an effect modifier if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs within levels of E Ice Cream Consumption Drowning rate Outdoor temperature JHU Intro to Clinical Research
Effect Modification: Drowning and Eating Ice Cream * * * * * * * * * * Drowning rate * * * * * * Warm temperature * * * * * * * * * * Cool temperature Ice Cream eaten
Topic #2: Signal/Noise Decomposition • Lovely due to geometry of least squares • Facilitates testing involving multiple parameters at once • Provides insight into R-squared
Signal/Noise Decomposition • First step: decomposition of variance • “Regression” part: Variance of s • “Error” or “Residual” part: Variance of e • Together: These determine “total” variance of Ys • “Sums of Squares” (SS) rather than variance per se • Regression SS (SSR): • Error SS (SSE): • Total SS (SST):
Signal/Noise Decomposition • Properties • SST = SSR + SSE • SSR/SST = “proportion of variance explained” by regression = R-squared • Follows from geometry • SSR and SSE are independent (assuming A1-A5) and have easily characterized probability distributions • Provides convenient testing methods • Follows from geometry plus assumptions
Signal/Noise Decomposition • SSR and SSE are independent • Define M = span(X) and take “Y” as centered at • It is possible to orthogonally rotate the coordinate axes so that first p axes ε M; remaining n-p-1 axes ε M⊥ • Gram-Schmidt orthogonalization • Doing this transforms Y into TY :=Z, for some orthonormal matrix T with columns:= {e1,...,en-1} • Distribution of Z = N(TE[Y|X],σ2I)
Signal/Noise Decomposition • SSR and SSE are independent - continued • TY=Z Y = T’Z • SSE = squared length of = • SSR = squared length of = • Claim now follows: SSR & SSE are independent because (Z1,…,Zp) and (Zp+1,…,Zn-1) are independent
Signal/Noise Decomposition • Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions • Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p • Recall {Z1,...,Zn-1} are mutually independent normal with variance=σ2 • Thus SSE = = ~ σ2χ2n-p-1 under A1-A5 (a sum of k independent squared N(0,1) is )
Signal/Noise Decomposition • Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions • For j ≤ p E[Zj|X] ≠ 0 in general • Exception: H0: β1=…=βp = 0 • Then SSR = ~ σ2χ2p under A1-A5 and ~ Fp,n-p-1 ~ with numerator and denominator independent.
Signal/Noise Decomposition • An organizational tool: The analysis of variance (ANOVA) table F = MSR/MSE
“Global” hypothesis tests • These involve sets of parameters • Hypotheses of the form H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs. H1: βj ≠ 0 for at least one of the j Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. Example 3: H0: all coefficients involving a variable = 0.
“Global” hypothesis tests • Testing method: Sequential decomposition of sums of squares • Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model • Fit model excluding xj1,...,xjpj: Save SSE = SSEs • Fit “full” (or larger) model adding xj1,...,xjpj to smaller model. Save SSE=SSEL, often=overall SSE • Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)] • Distribution under null: F(pj,n-p-1) • Define rejection region based on this distribution • Compute S • Reject or not as S is in rejection region or not
Signal/Noise Decomposition • An augmented version for global testing F = MSR(2|1)/MSE
R-squared – Another view • From last lecture: ECDF Corr(Y, ) squared • More conventional: R2 = SSR/SST • Geometry justifies why they are the same • Cov(Y, ) = Cov(Y- + , ) = Cov(e, ) + Var( ) • Covariance = inner product first term = 0 • A measure of precision with which regression model describes individual responses
Outline: A few more topics • Colinearity • Overfitting • Influence • Mediation • Multiple comparisons
Main points • Confounding occurs when an apparent association between a predictor and outcome reflects the association of each with a third variable • A primary goal of regression is to “adjust” for confounding • Least squares decomposition of Y into fit and residual provides an appealing statistical testing framework • An association of an outcome with predictors is evidenced if SS due to regression is large relative to SSE • Geometry: orthogonal decomposition provides convenient sampling distribution, view of R2 • ANOVA