The H-method of Mathematical modelling

The H-method of Mathematical modelling Agnar Höskuldsson, IPL, DTU, Bldg 403, 2800 Kgs Lyngby, Denmark Email: ah@ipl.dtu.dk Website: www.predict.ws

Content • Stability of solutions, example • An empirical approach to modelling • The H-principle of mathematical modelling • Optimization associated with the H-principle • Expansions of matrices • Problems in PLS regression • Subset selection by the H-method • Case study. Foss, Denmark • General linear models • Graphs in linear models • Case study. Process data. • Graphic illustrations of latent structures • Non-linear estimation in Gauss-Newton procedures • Polynomial surfaces in low dimensions • Classification procedures. Weights. Graphics. • Multiple weighing procedures. • Mult-way data. • Path modelling. Application to batch data. • Multi-block data analysis • Vision: Object oriented modelling

Research and industry : • Many variables. An example is a NIR (Near Infra-Red) instrument may give automatically 1056 data values. • Important to find appropriate variables. • Reduced rank for inference from data. • Prediction is the objective of primary interest. • Difficult to specify a detailed model. • Important to test the solution found. • Graphic illustrations of variation in data • Presentation of results in simple terms

61 NIR-measurements of beer at Carlsberg. Only 926 values are used for each spectrum. Y-variable the ’quality’ of beer.

Quality of beer at Carlsberg:Finding score vectors • Task: Find a score vector (”a profile vector”) t=Xw that • Is as reliable as possible • 2. Gives as good description of the quality parameter as possible A closer analysis shows that only 81 values for each spectrum should be used. The X-matrix is then 6181. Y is 611.

Quality of beer at Carlsberg: Results Measured y-values against the computed y-values Deviations: All deviations are between -0.3 og 0.3 Result: 4 score vectors were found.Explanation of Y: 99.5%Explanation of X: 98.5%

Quality of beer at Carlsberg: Changes in the solution vector • Three critical dimensions: • Dimension 4 gives optimal predictions. • Dimension 9 gives significant score vektors measured by the correlation coefficient • Dimension 14 corresponds to the measurements uncertainty in the instrument. The individual coordinates of the solution vector are shown as curves. x-axis is the dimension of the solution vector.

Methodology: w t is good representative for X t is good for the task in question  X t=Xw=w1x1+…+wKxK When t has been found X is adjusted for t: XX – d t pT, where d=1/(tTt) and p=XTt. The analysis starts over, now using the revised X.

The H-principle, formulation • It is a recommendation of how we should carry out the modelling procedure for any mathematical model: • Carry out the modelling in steps. You specify how you want to look at the data at this step by formulating how the weights are computed. • At each step compute expressions for • a) improvement in fit, Δfitb) the associated prediction, ΔPrecision • 3. Compute the solution that maximizes the product • ΔFit  ΔPrecision • 4. In case the computed solution improves the prediction abilities of the model, the solution is accepted. If the solution does not provide this improvement, we stop modelling. • 5. The data is adjusted for what has been selected and start again at 1). • Ref: Prediction Methods in Science and Technology, Vol 1. Thor Publishing, 1996, pp 405+disk. ISBN 87-985941-0-9

The task of modelling data • In industry the primary focus is on the prediction associated with new samples. Assuming standard assumptions for linear regression the variance of the estimated response value, y(x0), given a new sample x0, is • Var(y(x0)) = s2x0T(XTX)-1x0, • where s2 = yT (I-X(XTX)-1XT)y/(N-A) • The task of modelling is • Good fit, small value of s2, or of yT(I-X (XTX)-1XT)y • Low value of model variation, x0T(XTX)-1x0

Background for the H-principle. 1 The measure of error of fit: YTY- YTX (XTX)-1 XTY= YT(I- X (XTX)-1 XT)Y and the measure of precision, (XTX)-1 are stochastically independent. It means that a knowledge of the measure of error of fit does not provide with any information on the precision. (Similarly like estimate of mean value of the normal distribution does not provide with any information on the variance)

Background for the H-principle. 2 Significance testing: The F-test (t-test) of significance of a regression coefficient is equivalent to test the significance of the correlation coefficient between the present residual response values and the score vector of the coefficient. The correlation coefficient is invariant to the size of the score vector, tA+1, in question. From significance testing there is no information on the precision of the obtained model. The predictions derived from a ’significant model’ can be good or bad. (Statistical significance testing in industrial data with many variables tends in general to lead to severe overfitting).

Application to linear regression Multivariate regression, YN(XB,). In step 1). we are looking for a score vector t, computed as t=Xw, that gives us a good description of the data for the response variables, Y. There are two aspects of this modelling task, step 2), a) improvement in fit: |YTt|2/(tTt) b) associated prediction in terms of variance: |Σ|/(tTt) Step 3) suggests to maximize the product {|YTt|2/(tTt)}  {1/[|Σ|/(tTt)]} = |YTt|2/|Σ| Assuming |Σ| constant, we maximize |YTt|2 = wT (XTYYTX) w, for |w|=1. Solution: Find w as the eigen vector of the leading eigen value of (XTYYTX) w = λ w. (PLS regression)

Schematic interpretation of the optimization task w w q X X Y Y  q=YTt t=Xw t=Xw v=Yq Find a weight vector w: Maximize |q|2 Find weight vectors w and q: Maximize (tTv) 1056 5 Common sizes for optical instruments Y X Ref.: PLS Regression MethodsJournal of Chemometrics, 2 (1988) 211-228 50

Mathematical expansions Data. Simultaneous decomposition of X and X+ X = d1t1p1T + d2t2p2T + … + dAtApAT + … + dKtKpKT X+ = d1r1s1T + d2r2s2T + … + dArAsAT + … + dKrKsKT Covariance matrix. Simultaneous decomposition of S and S+ S = d1p1p1T + d2p2p2T + … + dApApAT+ … + dKpKpKT S+ = d1r1r1T + d2r2r2T + … + dArArAT+ … + dKrKrKT Ref: Data analysis, matrix decompositions and generalized inverse, SIAM J. Sci. Comput, 15 (1994) 239-262.

Problems in PLS regression When the weight vector w has been found, the score vector t=Xw is computed. The measure of fit for a score vector t: |YTt|2/(tTt) Suppose that X consists of two parts, X=(X1, X2), w=(w1, w2) and t=Xw. Then (YTt) = (YTX1w1) + (YTX2w2), (tTt) = w1TX1TX1w1 + 2w2TX2TX1w1+ w2TX2TX2w2. It may be that X1 provides with good fit, but X2 shows low covariance with Y. It means that (YTX1w1) is large but (YTX2w2) is small. But the value of w2TX2TX2w2 can so large that it spoils the fit. In general, if large percentage of the weight values in w is small, it is important to reduce X by removing columns (variables) that have small weights (covariances).

The procedure used by the H-method: Scale the original data to unit variance. Use x-variable no. i (1iN) if sij is numerically larger than csisj/N for at least one y-variable, yj, j=1,…,M. Find the value of c that gives the best cross-validation. si2=(xiTxi)/N and sj2=(yjTyj)/N In some cases, especially when there is not very high correlation between variables, it may be advantageous to select variables at each step. Only weights, wi, that are significantly different from zero are selected.

Cross-validation The cross-validation procedure used here is to divide the samples into 10 segments of almost equal size. One segment is left out and the regression parameters are found using data from the 9 segments. Then the response values of the left out segment are estimated using the regression parameters found. This is repeated for each segment. Thus at the end of the cross-validation procedure we have estimates of the response values, yc, such that each value in yc is estimated using 90% of the data. If there are 45 samples, 4 or 5 samples would be left out at each regression. The 40 or 41 samples are used to estimate the regression parameters, which then are used to compute the response values.

Measure of fit and cross-validation The R2 value in linear regression is defined as R2=1-|y-ŷ|2/|y|2. It has the important interpretation that it is the squared value of the simple correlation coefficient between the observed response value and the estimated response value according to the linear model. The simple correlation coefficient for two sets of variables, x and y, is defined as rxy=(xTy)/(|x||y|). If values are not centered, centering is done before computing the correlation coefficient. Thus R2 can be computed as R2=( ryŷ)2. In the cross-validation there is a similar situation, the response value, y, and estimated response values from cross validation, yc. For comparison it is natural to define Q2 similarly as the squared simple correlation coefficient between the response values and the estimated response values from cross validation, Q2=( ryyc)2.

Foss, Hillerød, Denmark Products Analytical laboratories Beer Chemicals Confectionery Dressings and condiments Edible oils and fats Feed and forage Flour Grain Meat Microbiology Milk and dairy Pet food Pharmaceuticals Sugar Water and soil Wine Foss has used the NIR-technology to develop measurement instruments to the industry. Yearly sales of instruments is around 200 mio euros. From the NIR-spectra regression model is developed, which is used to estimate of chemical magnitudes.

Foss, Example 45 spectra for milk measured by their Milk-Scan instrument. Each spectra consists of 1056 numbers. They are shown as 45 curves.

Foss. Measurements of fatness in milk Y-values: Fatness in milk PLS regression using 5 components. The R2 value is 0.9788 and Q2=0.8828.

Plot of the values of (xiTy3)2. i=1,…,1056. y3=fatness.

Observed versus computed response values. Upper one model, lower cross-validation. 105 x-variables. The R2 value is 0.994. Now we see that the cross-validated values also fit well to the line, Q2=0.989.

Plot of response values, y3, residuals from the model, e=y3-ye, and residuals from cross-validation, ec=y3-yc. We see that the residuals from the model vary within 0.03 and the cross-validated residuals within 0.05. This is satisfactory for Foss. It can say to its customers that the precision of the instruments is of the order 0.05. This precision is also satisfactory for the customers.

General linear models The results are expansions of the matrices as follows: S = d1p1p1T + …+ dApApAT + … + dKpKpKT = PDPT S-1 = d1v1v1T + …+ dAvAvAT + … + dKvKvKT = VDVT X = d1t1p1T + …+ dAtApAT + … + dKtKpKT = TDPT XTY = d1p1q1T + …+ dApAqAT + … + dKpKqKT = PDQT B=S-1XTY = d1v1q1T + …+ dAvAqAT + … + dKvKqKT = VDQT Ŷ=XB= d1t1q1T + …+ dAtAqAT + … + dKtKqKT = TDQT S=XTX + U, where U is a positive (semi) definite matrix. Examples: Kalman filtering methodsVariance componentsMixed models (fixed and random factors in experimental design)

There are four sets of vectors, wa, pa, va, and ta, at each step. wa, the weight vector. It reflects the emphasis of the analysis. Different weights give different regression analysis. In the plots of the vectors we look for if one or more variables get small weights for all weight vectors. If one or more of them get generally small weights, it is investigated if they should be removed from analysis. ta, the score vector. It is computed as ta =Xa-1wa or ta = Xva. The score vectors define the latent structure. They show what has been used of X and how Y can be described. Pair wise plots of the score vectors show the variation in the part of data that is being used. pa, the loading vector. It is computed as pa =Sa-1wa. If S=XTX, then pa =XTta. If the X matrix has been auto-scaled, and ta scaled to unit length, the loading vector pa can be viewed as the correlation coefficients between the original variables and the a-th score variable. In the general case where S is any positive definite matrix, a similar interpretation is used. Pair wise plots of the loading vectors show the correlation structure in data. va, the loading weight vector. It is given by pa =Sva. They show how pa is derived from the correlations of the original X-variables. Since S0=S, if follows that v1=w1. The loading weight vectors are studied in order to know how the original variables generate the latent structure.

Further plots to study the results of the analysis: Observed versus computed Y-values. The columns of Y are drawn against the corresponding columns of Ŷ=XBA. The graphs are supplied by different measures of how good a fit has been obtained. Y-values against the score vectors. These graphs show the quality of the fit at each step of the computations. In case of Stepwise Regression analysis the weight vectors are given as wa =(0,0,…,0,1,0,…), where 1 corresponds to the variable selected. In this case the graphs are called ’Added variable plots’. Y-values against the residuals. The residuals are given as E=Y-XBA. If the plots of the columns of Y against the corresponding column of E show systematic variations, it indicates that the modeling task has not been successful. The Y-residuals. The columns of the residual matrix E are to exhibit random behavior. Therefore, plots, where the y-axis is a column of E and x-axis is e.g., the sample number or a score vector, should show random scatter of points. Note that all the graphical analysis above can be done for any choices of the weight vectors wa that have been selected and any positive definite matrix S.

Case study. Process data. Process data The data that are considered here are process data. These are hourly measurements of 12 x-variables and a quality variable y. The process was measured over a period of 289 hours. Thus X is a 28912 matrix and y a 2891 vector. Before analysis the data are auto-scaled (centered and scaled to unit variance). Principal Component Analysis, PCA It is often useful to study data by PCA analysis. The weight vectors will be the eigen vectors associated with the eigen value system XTXw=w. It can also be computed from the Singular Value Decomposition of X. It gives X=AFWT, where F is a diagonal matrix, and A and W are orthonormal. In this case the score matrix can be computed as T=AF (=XW).

Plots of pairwise score vectors. Upper left no 2 versus no 1.

Plot of coordinates of t1 and t3 versus time (coordinate number).

Pairvise plots of the first four score vectors in PLS regression.

Plots of the response variable versus the first score vectors.

Plot of observed versus computed response variable. Data auto-scaled (mean zero and variance 1). The first four score vectors explain R2=97.8% if the variation of the response variables. The score vectors account for 57.8% of the variation of X.

Ridge Regression It can be shown theoretically that the mean squared error of prediction can be reduced by allowing a small bias in the estimation. In practical terms it means that instead of working with the covariance matrix S=XX, it can be advantageous to work with S=XTX+kI, where I is the KK identity matrix and k is a (small) positive constant. There exist a number of theoretically motivated formulas for computing k, but in most cases they are not appropriate. Instead k is normally found by cross-validation, for instance by a leave-one-out cross-validation. This way of finding k is assumed here. It gives k=0.0019. The value of k is so small that the graphs will be identical to the ones shown above. Traditionally, the full rank solution is used in the case of Ridge Regression. This means that all 13 score vectors would be used. This would cause severe overfitting, because there are only 4 significant score vectors.

Gauss-Newton procedure for estimation in nonlinear models

Standard solution

H-method

NIR-data. Y-variable quality of fish. Plot the response variables versus the first PLS score vector. (Thormod Næs, Chem. Intell. Lab. Syst. 5 (1989) 155-168)

Results from finding multivariate polynomial in score variables Only linear terms involving t1 and t2, and squared t1, t12, are significant.

Chem. Intell. Lab. Syst., 44 (1998) 15-30

Classification procedures For industry the keywords are: Simple, Visual, Stable, Robust and Reliable. The methods used must be simple and applicable by different people that may not have specialized experience. The classical statistical procedures have the possibilities of visualising the results, but they are not stable and robust as required by industry. By insisting on that the methods are reliable in the sense that the procedures must detect outliers, trends, error rate and other developments in data, the industry accepts that some methods may be good in ‘a laboratory’, but may not satisfy the quality and security requirements needed. The industry requires methods similar the ones used in regression analysis: Weight vectors Score vectors Loading vectors Loading weight vectors

Plot of Treated (above the line) and Not treated data (below) Data: Treated: 107 109 112 114 119 121 128 139 Not treated: 98 102 103 104 106 109 110 112 113 The value of the weight (wi value in w) is to reflect the degree of non-overlapping between these two sets of data values.

The ‘Mushroom data’. The mushrooms have been measured on 16 variables. There are three groups of mushrooms. There are given 7, 8 and 8 samples of the mushrooms. The third group is clearly different from the first two. There are practical problems in separating group one and two. The question is if the graphic procedure above can help in visualising the difference in the groups. In order to show the overall variation in data, a PCA analysis is carried out for all the data. The results are shown in the Figure. We can confirm that group number 3 is well separated from the others, while group 1 and 2 are close to each other. Scatter plot of the first two score vectors. PCA analysis for all data.

V1 X1 T1 V1 X2 V1 T1,0 X3 Basis group Figure 1.6. Schematic illustration of the projection Figure 1.5. Transformation relative to one group

Data seen from group 1. Samples are projected onto the score space of group 1. The score space of group 1 is rotated to give best possible separation.

Traditional data are repeated measurement on samples: w X t=Xw Values of x1 are around 3.0values of x2 are around 12.0values of x3 are around 21.0 The vector t does the task needed.

Double (multiple) weighing schemes w X v =vTXw maximum 10 judges (persons) are asked to rate 6 different types of food. The values in the table are average values of ratings of 5 different quality measures of taste.

The H-method of Mathematical modelling

The H-method of Mathematical modelling

Presentation Transcript

Mathematical Modelling

The deductive method of proof by Mathematical Induction

Mathematical Modelling

Mathematical modelling in the field of kites

Preparing for the future : the role of mathematical modelling

Mathematical Modelling of Future Energy Systems

Mathematical Modelling, 5th

MATHEMATICAL INVESTIGATION AND MODELLING

Mathematical Modelling of Healthcare Associated Infections

Mathematical Modelling

Mathematical Modelling of Radiotherapy: Applying the LQ model.

Mathematical issues with volatility modelling

The OO-H method

MA3264 Mathematical Modelling Lecture 6

Mathematical Modelling of Power Units

MA3264 Mathematical Modelling Lecture 8

Mathematical Modelling of Weighted Networks

Lectures on mathematical modelling

3. Mathematical Modelling

TB Mathematical modelling workshop

Mathematical Modelling in Geography

Mathematical issues with volatility modelling