Regression analysis

Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data

Typical examples • Spectroscopy: Predict chemistry from spectral measurements • Product development: Relating sensory to chemistry data • Marketing: Relating sensory data to consumer preferences

Topics covered • Simple linear regression • The selectivity problem: a reason why multivariate methods are needed • The collinearity problem: a reason why data compression is needed • The outlier problem: why and how to detect

Simple linear regression • One y and one x. Use x to predict y. • Use a linear model/equation and fit it by least squares

Data structure X-variable Y-variable 2 4 1 . . . 7 6 8 . . . Objects, same number in x and y-column

Least squares (LS) used for estimation of regression coefficients y y=b0+b1x+e b1 b0 x Simple linear regression

Model Regression analysis Data (X,Y) Future X Prediction Regression analysis Interpretation Outliers? Pre-processing

The selectivity problem A reason why multivariate methods are needed

Can be used for several Y’s also

Multiple linear regression • Provides • predicted values • regression coefficients • diagnostics • If there are many highly collinear variables • unstable regression equations • difficult to interpret coefficients: many and unstable

Collinearity, the problem of correlated X-variable y=b0+b1x1+b2x2+e Regression in this case is fitting a plane to the data (open circles) The two x’s have high correlation Leads to unstable equation/plane (in the direction with little variability)

Possible solutions • Select the most important wavelengths/variables (stepwise methods) • Compress the variables to the most dominating dimensions (PCR, PLS) • We will concentrate on the latter (can be combined)

Data compression • We will first discuss the situation with one y-variable • Focus on ideas and principles • Provides regression equation (as above) and plots for interpretation

Model for data compression methods X=TPT+E Centred X and y y=Tq+f T-scores, carrier of information from X to y P,q –loadings E,f – residuals (noise)

x3 PCA to compress data x2 ti x1 y q t-score Regression by data compression PC1 Regression on scores

x1 x2 MLR y x3 x4 x1 t1 x2 PCR y t2 x3 x4 x1 t1 y x2 PLS x3 t2 x4

PCR and PLS For each factor/component • PCR • Maximize variance of linear combinations of X • PLS • Maximize covariance between linear combinations of X and y Each factor is subtracted before the next is computed

Principal component regression (PCR) • Uses principal components • Solves the collinearity problem, stable solutions • Provides plots for interpretation (scores and loadings) • Well understood • Outlier diagnostics • Easy to modify • But uses only X to determine components

PLS-regression • Easy to compute • Stable solutions • Provides scores and loadings • Often less number of componentsthan PCR • Sometimes better predictions

PCR and PLS for several Y-variables • PCR is computed for each Y. Each Y is regressed onto the principal components • PLS: The algorithm is easily modified. Maximises linear combinations of X and Y. • For both methods: Regression equations and plots

Validation is important • Measure quality of the predictor • Determine A – number of components • Compare methods

Prediction testing Calibration Estimate coefficients Testing/validation Predict y, use the coefficients

Calibrate, find y=f(x) estimate coefficients Predict y, use the coefficients Cross-validation

Validation • Compute • Plot RMSEP versus component • Choose the number of components with best RMSEP properties • Compare for different methods

RMSEP MLR NIR calibration of protein in wheat. 6 NIR wavelengths 12 calibration samples, 26 test samples

Estimation error Model error Conceptual illustration of important phenomena

Prediction vs. cross-validation • Prediction testing: Prediction ability of the predictor at hand. Requires much data. • Cross-validation: Property of the method. Better for smaller data set.

Validation • One should also plot measured versus predicted y-value • Correlation can be computed, but can sometimes be misleading

Example, plot of y versus predicted y Plot of measured and predicted protein NIR calibration

Outlier detection • Instrument error or noise • Drift of signal (over time) • Misprints • Samples outside normal range (different population)

Outlier detection • Outliers can be detected because • Model for spectral data (X=TPT+E) • Model for relationship between X and y (y=Tq+f)

Outlier detectiontools • Residuals • X and y-residuals • X-residuals as before, y-residual is difference between measured and predicted y • Leverage • hi

Regression analysis