330 likes | 522 Vues
Regression analysis. Relating two data matrices/tables to each other. Purpose: prediction and interpretation. Y-data. X-data. Typical examples. Spectroscopy: Predict chemistry from spectral measurements Product development: Relating sensory to chemistry data
E N D
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data
Typical examples • Spectroscopy: Predict chemistry from spectral measurements • Product development: Relating sensory to chemistry data • Marketing: Relating sensory data to consumer preferences
Topics covered • Simple linear regression • The selectivity problem: a reason why multivariate methods are needed • The collinearity problem: a reason why data compression is needed • The outlier problem: why and how to detect
Simple linear regression • One y and one x. Use x to predict y. • Use a linear model/equation and fit it by least squares
Data structure X-variable Y-variable 2 4 1 . . . 7 6 8 . . . Objects, same number in x and y-column
Least squares (LS) used for estimation of regression coefficients y y=b0+b1x+e b1 b0 x Simple linear regression
Model Regression analysis Data (X,Y) Future X Prediction Regression analysis Interpretation Outliers? Pre-processing
The selectivity problem A reason why multivariate methods are needed
Multiple linear regression • Provides • predicted values • regression coefficients • diagnostics • If there are many highly collinear variables • unstable regression equations • difficult to interpret coefficients: many and unstable
Collinearity, the problem of correlated X-variable y=b0+b1x1+b2x2+e Regression in this case is fitting a plane to the data (open circles) The two x’s have high correlation Leads to unstable equation/plane (in the direction with little variability)
Possible solutions • Select the most important wavelengths/variables (stepwise methods) • Compress the variables to the most dominating dimensions (PCR, PLS) • We will concentrate on the latter (can be combined)
Data compression • We will first discuss the situation with one y-variable • Focus on ideas and principles • Provides regression equation (as above) and plots for interpretation
Model for data compression methods X=TPT+E Centred X and y y=Tq+f T-scores, carrier of information from X to y P,q –loadings E,f – residuals (noise)
x3 PCA to compress data x2 ti x1 y q t-score Regression by data compression PC1 Regression on scores
x1 x2 MLR y x3 x4 x1 t1 x2 PCR y t2 x3 x4 x1 t1 y x2 PLS x3 t2 x4
PCR and PLS For each factor/component • PCR • Maximize variance of linear combinations of X • PLS • Maximize covariance between linear combinations of X and y Each factor is subtracted before the next is computed
Principal component regression (PCR) • Uses principal components • Solves the collinearity problem, stable solutions • Provides plots for interpretation (scores and loadings) • Well understood • Outlier diagnostics • Easy to modify • But uses only X to determine components
PLS-regression • Easy to compute • Stable solutions • Provides scores and loadings • Often less number of componentsthan PCR • Sometimes better predictions
PCR and PLS for several Y-variables • PCR is computed for each Y. Each Y is regressed onto the principal components • PLS: The algorithm is easily modified. Maximises linear combinations of X and Y. • For both methods: Regression equations and plots
Validation is important • Measure quality of the predictor • Determine A – number of components • Compare methods
Prediction testing Calibration Estimate coefficients Testing/validation Predict y, use the coefficients
Calibrate, find y=f(x) estimate coefficients Predict y, use the coefficients Cross-validation
Validation • Compute • Plot RMSEP versus component • Choose the number of components with best RMSEP properties • Compare for different methods
RMSEP MLR NIR calibration of protein in wheat. 6 NIR wavelengths 12 calibration samples, 26 test samples
Estimation error Model error Conceptual illustration of important phenomena
Prediction vs. cross-validation • Prediction testing: Prediction ability of the predictor at hand. Requires much data. • Cross-validation: Property of the method. Better for smaller data set.
Validation • One should also plot measured versus predicted y-value • Correlation can be computed, but can sometimes be misleading
Example, plot of y versus predicted y Plot of measured and predicted protein NIR calibration
Outlier detection • Instrument error or noise • Drift of signal (over time) • Misprints • Samples outside normal range (different population)
Outlier detection • Outliers can be detected because • Model for spectral data (X=TPT+E) • Model for relationship between X and y (y=Tq+f)
Outlier detectiontools • Residuals • X and y-residuals • X-residuals as before, y-residual is difference between measured and predicted y • Leverage • hi