Scientific Methods 1

Scientific Methods 1 ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture 10: Statistical Methods- Intro to PCA & Monte Carlo Methods Barry & Goran www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131 COMP80131-SEEDSM12_10

Correlation & covariance Pearson correlation coeff for two samples of M variables Lies between -1 & 1 Covariance between two samples of M variables: 1/(M-1) if meanx is sample-mean; 1/M if meanx is pop-mean COMP80131-SEEDSM12_10

In vector notation (1/(varx.vary)) xTy or (1/M) xTy when means have been subtracted x = y= Both measure similarity between 2 cols of numbers (vectors). COMP80131-SEEDSM12_10

Principal Components Analysis • PCA converts samples of M variables into samples of a smaller number of variables called principal components. • Produces shorter columns. • Exploits interdependency or correlation among the M variables in each col. • Evidence is the similarity between columns as seen in lots of examples. • If there is none, PCA cannot reduce number of variables. • First princ comp has the highest variance. • It accounts for as much variability as possible. • Each succeeding princ comp has the highest variance possible while being orthogonal to (uncorrelated with) the previous ones. COMP80131-SEEDSM12_10

P C A • Reduces number of variables (dimensionality) - without significant loss of information • Also named: ‘discrete Karhunen–Loève transform (KLT)’, ‘Hotelling transform’ ‘proper orthogonal decomposition (POD)’. • Related to (but not the same as): ‘Factor analysis’ COMP80131-SEEDSM12_10

Example Assume 5 observations of behaviour of 3 variables x1, x2, x3: x1: 1 2 3 1 4 sample-mean = 11/5 = 2.2 x2: 2 1 3 -1 2 sample-mean = 1.4 x3: 3 4 7 1 8 sample mean = 4.6 Subtract means: x1' : -1.2 -0.2 0.8 -1.2 1.8 x2' : 0.6 -0.4 1.6 -2.4 0.6 call this matrix X x3' : -1.6 -0.6 2.4 -3.6 3.4 Calculate ‘Covariance Matrix’ (C): x1' x2' x3' x1' : 1.7 1.15 3.6 x2' : 1.15 2.3 3.45 x3' : 3.6 3.45 8.3 C(1,2) = average value of x1'.x2‘ = (-1.20.6 +0.20.4 + 0.81.6 +1.22.4+1.80.6) = = 4.6/4 = 1.15 COMP80131-SEEDSM12_10

Eigenvalues & eigenvectors [U, diagV] = eig(C); 0.811 0.458 0.364 0 0 0 0.324 -0.87 0.372 0 0.964 0 -0.487 0.184 0.854 0 0 11.34 u3u2u13 0 0 0 2 0 D U 0 0 1 COMP80131-SEEDSM12_10

Transforming the measurements • For each column of matrix X, multiply by UT to transform it to a different set of numbers. • For each column x’ transform it to UTx’ • Or do it all at once by calculating Y = UT*X. • We get: 0 0 0 0 0 Y = -1.37 0.146 -0.583 0.874 0.929 -1.58 -0.734 2.936 -4.404 3.781 • First column of X is now expressed as: 0  u1 -1.37  u2 – 1.58  u3 • Similarly for all the other four columns of X. • Each column is now expressed in terms of the eigenvectors of C. COMP80131-SEEDSM12_10

Reducing dimensionality • UT C U = D therefore C = U D UT ( since UT is inverse of U) • Now we can express: • C = 1 (u1u1T) + 2 (u2u2T) + 3 (u3u3T) • Now look at the eigenvalues 1, 2, 3 • Strike out zero valued one (3) with corresponding eigenvector (u3). • Leaves u1 & u2 as princ components. • Can represent all the data, without loss, with just these two. • Can remove smaller eigenvalues (such as 2) with its eigenvector. • (If they do not affect C much they should not affect X) • Whole data can represented by just u1 without serious loss of accuracy. COMP80131-SEEDSM12_10

Reconstructing orig data from princ comps • Because Y = UT*X, then X = U*Y. • If we don’t strike out any eigenvals & eigenvecs, this gets us back to orig data. • If we strike out row 1 of Y and u1 (first col of U), we still get back to orig data. • If we strike out row 2 of Y and u2, we get back something close to orig data., • We do not lose much info by keeping just one princ. comp • Dimensionality reduces from 3 to 2 or 1. • (Normally, eigenvals reordered in decreasing magnitude, but I have not done that here) COMP80131-SEEDSM12_10

In MATLAB clear all; origData = [1 2 3 1 4 ; 2 1 3 -1 2 ; 3 4 7 1 8] [M,N] = size(origData); meanofCols = mean(origData,2); % subtract off mean for EACH dimension zmData = origData - repmat(meanofCols,1,N) covarMat = 1 / (N-1) * (zmData * zmData') % find the eigenvectors and eigenvalues of covarMat [eigVecs, diagV] = eig(covarMat) eigVals = diag(diagV) [reigVals, Ind] = sort(eigVals,'descend'); % sort the variances in decreasing order reigVecs = eigVecs(:,Ind); % Reorder eigenvectors accordingly proj_zmData = reigVecs' * zmData disp('Approx to original data taking just a few principal components'); nPC = input('How many PCs do you need (look at eigenvals to decide):'); PCproj_zmData = proj_zmData(1:nPC,:) PCVecs = reigVecs(:,1:nPC) %Only keep the first few reordered eigVecs RecApprox_zmData = PCVecs * PCproj_zmData RecApprox_origData = RecApprox_zmData + repmat(meanofCols,1,N) COMP80131-SEEDSM12_10

Monte Carlo Methods • Use of repeated random sampling of the behaviour of highly complex multidimensional mathematical equations describing real or simulated systems, to determine their properties. • The repeated random sampling produces observations to which statistical inference can be applied. COMP80131-SEEDSM12_10

Pseudo-random processes • Name Monte Carlo refers to the famous casino. • Gambling requires a random process such as the spinning of a roulette wheel. • Monte Carlo methods use pseudo-random processes implemented in software. • ‘Pseudo-random’ processes are not truly random. • The variables produced can be predicted by a person who knows the algorithm being used. • However, they can be used to simulate the effects of true randomness. • Simulations not required to be numerically identical to real processes. • Aim is to produce statistical results such as averages & distributions. • Requires a ‘sampling’ of the population of all possible modes of behaviour of the system. COMP80131-SEEDSM12_10

Illustration • Monte Carlo methods may be used to evaluate multi-dimensional integrals. • Consider the problem of calculating the area of an ellipse by generating a set of N pseudo-random number pairs (xi,yi) uniformly covering the area -1<x<1, -1<y<1 as illustrated next: COMP80131-SEEDSM12_10

y 1 x -1 1 Area of an ellipse Area of square is 2  2 = 4 If there are N points and M of them fall inside the ellipse, area of ellipse  4  M / N as N  (Frequentist approach) COMP80131-SEEDSM12_10

Simplicity of MC methods • This example illustrates the simplicity of MC techniques, but not their computational advantages. • We could have use a regularly placed grid of points rather than randomly placed points in the rectangle as on next slide COMP80131-SEEDSM12_10

Regular grid y 1 x -1 1cm COMP80131-SEEDSM12_10

Advantages of Monte Carlo • In fact there are no advantages for such a 2-dimensional dimensional problem. • Consider a multi-dimensional integration COMP80131-SEEDSM12_10

Disadvantage of regular grid • f(x1, x2, …, xL) may be evaluated at regularly spaced points as a means of evaluating the integral. • Number of regularly spaced points, N, must increase exponentially with dimension L if error is not to increase exponentially with L. • If N = 100 when L=2, then adjacent points will be 0.2 cm apart. • If L increases to 3, we need N=103 points to maintain the same separation between them. • When L = 4, we need N= 104 etc. – ‘Curse of dimensionality’ • Look at this another way: • Assume N remains fixed with regular sampling, and L increases. • Each dimension must be sampled more & more sparsely - and less and less efficiently. • More & more points with same value in each dimension. • Error increases in proportion to N-2/L COMP80131-SEEDSM12_10

Advantage of random sampling • Uniformly distributed random points in L-dimensional space. • Avoids inefficiency of rectangular grids created by regular sampling by using a purely random sample of N points uniformly distributed • For high dimensions K, error is proportional to 1/(N) • To reduce the error by a factor of 2, the sample size N must be increased by a factor of 4 regardless of the dimensionality. • There are ways of decreasing the Monte Carlo error to make the technique still more efficient. • One approach is to use ‘quasi-random’ or ‘low-discrepancy’ sampling. • The use of such quasi-random sampling for numerical integration is referred to as “quasi–Monte Carlo” integration. COMP80131-SEEDSM12_10

MATLAB: Area of Semicircle for N=1:200 M=0; for i=1:N x=2*rand -1; y=rand*1.0; I = sqrt(1-x*x); if y <= I, M=M+1; end; %if y <= I point (x,y) is below curve !!! end; Int(N)=M*2/N; end; % of N loop figure(6); plot(Int); title('Area of semicircle'); xlabel('Number of points'); COMP80131-SEEDSM12_10

Convergence as N COMP80131-SEEDSM12_10

MATLAB code for scatter plot clear; N=6000; M=0; for i=1:N x(i)=rand*2-1; y(i)=rand*1.0; I = sqrt(1-x(i)*x(i)); if y(i)<=I, M=M+1; C(i) = 2; else C(i)=1; end; end; scatter(x,y,6,C,'filled'); Int = M*2/N ; title(sprintf('Scatter of MC area method: N=%d, Int=%d',N,Int)); disp('Red if I<= y, blue if I>y'); xlabel('x Plot red if y <= I'); ylabel('y'); COMP80131-SEEDSM12_10

Integration of circle – scatter plot COMP80131-SEEDSM12_10

Scientific Methods 1