NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 12 Overview John Birks

OVERVIEW • Some applications • Volcanic tephras • Scotland’s most famous product • Integrated analyses • Problems of percentage compositional data • Log-ratios • Chameleons of CA and CCA • Software availability • Web sites • Final comments • Topics covered • Exploratory data analysis • Clustering • Gradient analysis • Hypothesis testing • Principle of parsimony in data analysis • Possible future developments • Conventional • Less conventional

EXPLORATORY DATA ANALYSIS Essential first step Feel for the data – ranges, need for transformations, rogue or outlying observations NEVER FORGET THE GRAPH CLUSTERING Can be useful for some purposes – basic description, summarisation of large data sets. Fraught with problems and difficulties – choice of DC, choice of clustering method, difficulties of validation and evaluation Good general purposeTWINSPAN – ORBACLAN – COINSPAN GRADIENT ANALYSIS Regression, calibration, ordination, constrained ordination, discriminant analysis and canonical variates analysis, analysis of stratigraphical and spatial data. HYPOTHESIS TESTING Randomisation tests, Monte Carlo permutation tests.

Cajo ter Braak 1987 Wageningen

Classification of gradient analysis techniques by type of problem, response model and method of estimation. a Constrained multivariate regression b Ordination after regression on covariables c Constrained ordination after regression on covariables = constrained partial multivariate regression d “Reduced-rank regression” = “PCA of y with respect to x”

A straight line displays the linear relation between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (). (a = intercept; b = slope or regression coefficient).

A unimodal relation between the abundance value (y) of a species and an environmental variable (x). (u = optimum or mode: t = tolerance; c = maximum).

GRADIENT ANALYSIS Linear based-models or unimodal-based methods Critical question, not a matter of personal preference If gradients are short, sound statistical reasons to use linear methods – Gaussian-based methods break down, edge effects in CA and related techniques become serious, biplot interpretations easy. If gradients are long, linear methods become ineffective (‘horseshoe’ effect). How to estimate gradient length? Regression Hierarchical series of response models GLM and HOF Calibration GLM, DCCA (single x variable) Ordination DCA (detrending by segments, non-linear rescaling) Constrained DCCA (detrending by segments, non-linear rescaling) ordination Partial ordination Partial DCA (detrending by segments, non-linear rescaling) Partial constrained Partial DCCA (detrending by segments, non-linear rescaling) ordination

HYPOTHESIS TESTING Monte Carlo permutation tests and randomisation tests Distribution free, do not require normality of error distribution Do require INDEPENDENCE or EXCHANGEABILITY Validity of permutation test results depends on the validity of the type of permutation for the data set at hand. Completely randomised observations, completely random permutation is appropriate = randomisation test. Randomised block design-permutation must be conditioned on blocks, e.g. type of farm declared as covariable, if randomisation is conditioned on these, permutations are restricted to within farm. Time series or line transect – restricted permutations and data kept in order. Spatial data on grid – restricted permutations and data kept in position. Repeated measurements – BACI

PRINCIPLE OF PARSIMONY IN DATA ANALYSIS William of Occam (Ockham), 14th century English nominalist philosopher. Insisted that given a set of equally good explanations for a given phenomenon, the explanation to be favoured is the SIMPLEST EXPLANATION. Strong appeal to common sense. Entities should not be multiplied without necessity. It is vain to do with more what can be done with less. An explanation of the facts should be no more complicated than necessary. Among competing hypotheses or models, favour the simplest one that is consistent with the data. ‘Shaved’ explanations to the minimum. In data analysis: 1) Models should have as few parameters as possible. 2) Linear models should be preferred to non-linear models. 3) Models relying on few assumptions should be preferred to those relying on many. 4) Models should be simplified/pared down until they are MINIMAL ADEQUATE. 5) Simple explanations should be preferred to complex explanations.

RELEVANCE OF PRINCIPLE OF PARSIMONY TO DATA ANALYSIS MINIMAL ADEQUATE - as statistically acceptable as the most complex model MODEL (MAM) - only contains significant parameters - high explanatory power - large number of degrees of freedom - may not be one MAM CLUSTERING - prefer simple cluster analysis methods (few assumptions, simple values of, ,) - intuitively sensible REGRESSION - GAM – GLM - In GAM, simplest smoothers to be used - In GLM, model simplification to find MAM (e.g. AIC) CALIBRATION - minimum number of components for lowest RMSEP in PLS or WA-PLS ORDINATION - retain smallest number of statistically significant axes (broken stick test) - retain ‘signal’ at expense of noise

PARTIAL ORDINATION remove effects of ‘nuisance variables’ (covariables or concomitant variables) by partialling out their effects ordination of residuals retain smallest number of statistically significant axes (broken stick test) ‘signal’ at expense of ‘noise’ and ‘nuisance variables’ CONSTRAINED ORDINATION most powerful if the number of predictor variables is small compared to number of samples. Constraints are strong, arch effects avoided, no need for detrending, outlier effects minimised minimal adequate model (forward selection, VIF, variable selection, AIC) only retain statistically significant axes PARTIAL CONSTRAINED ORDINATION as above + partial ordination STRATIGRAPHICAL DATA ANALYSIS only retain statistically significant zones simplify data to major axes or gradients of variation

CHOICE BETWEEN INDIRECT & DIRECT GRADIENT ANALYSIS Indirect gradient analysis – two steps Direct gradient analysis – one combined step If relevant environmental data are to hand, direct approach is likely to be more effective and simpler than indirect approach. Generally achieve a simpler model from direct gradient analysis. CHOICE BETWEEN REGRESSION & CONSTRAINED ORDINATION Both regression procedures! One Y or many Y. Depends on purpose – is it an advantage to analyse all species simultaneously or individually?

CONSTRAINED REGRESSION ORDINATION Community assemblage or individual taxa? HOLISTIC INDIVIDUALISTIC COMMON GRADIENTS SEPARATE GRADIENTS QUICK, SIMPLE SLOW, COMPLEX, DEMANDING LITTLE THEORY MUCH THEORY (GLM) EXPLORATORY MORE CONFIRMATORY, IN DEPTH LIMITING FACTORS Research questions Hypotheses to be tested and evaluated Data quality

TYPES OF GRADIENT ANALYSIS METHODS BASED ON WEIGHTED AVERAGING Community data - incidences (1/0) or abundances ( 0) of species at sites. Environmental data - quantitative and/or qualitative (1/0) variables at same sites. Use weighted averages of species scores (appropriate for unimodal biological data) and linear combinations (weighted sums) of environmental variables (appropriate for linear environmental data)

Lecture topic 2 Exploratory data Model specific ‘outlier’ detection; interactive analysis graphics 3 Clustering COINSPAN; better randomisation tests; CART; latent class analysis 4, 5 Regression analysis GLM and GAM framework evaluation by cross- validation. Give up SS, deviance, t, etc! 6 Indirect gradient ? quest for the ‘ideal’ ordination method, 2-analysis matrix CA and PCA 7 Direct gradient 3-matrix CCA and RDA (biology, environment, analysis species attributes); multi-component variance partition- ing, vector-based reduced rank models with GAMs 8 Calibration and WAPLS; non-linear deshrinking; ? ML; mixed response reconstruction models; chemometrics, Bayesian framework, more consideration of spatial autocorrelation 9 Classification ? give up classical methods; use permutation tests; classification and regression trees and random forests 10 Stratigraphical and ? more consideration of temporal and spatial spatial data autocorrelation 11 Hypothesis testing More realistic permutation tests (restrictions); better p estimation POSSIBLE FUTURE DEVELOPMENTS - CONVENTIONAL

NEURAL NETWORKS – THE LESS CONVENTIONAL DATA ANALYSIS APPROACH IN THE FUTURE? • Back propagation neural network – layers containing neurons • input vector • input layer • hidden layer • output layer • output vector • Clearly can have different types of input and output vectors, e.g. • INPUT VECTORS OUTPUT VECTORS • > 1 Predictor 1 or more Responses Regression • > 1 ‘Responses’ 1 or more ‘Predictors’ Inverse regression or calibration • > 1 Variables 2 or more Classes Discriminant analysis

CALIBRATION (INVERSE REGRESSION) AND ENVIRONMENTAL RECONSTRUCTIONS Malmgren & Nordlund (1997) Palaeo-3 136, 359–373 Planktonic foraminifera 54 core-top samples Summer water and winter water temperatures Core E48–22 Extends to oxygen stage 9 320,000 years Compared neural network as a calibration tool with: Imbrie & Kipp principal component regression 2-block PLS (SIMCA) Modern analog technique (MAT) WA-PLS CRITERION FOR NETWORK SUCCESS Cross-validation leave-one-out Estimate RMSE (average error rate in training set) RMSEP (predictions based on leave-one-out cross-validation) 3 neurons 600–700 cycles RMSEP Summer Winter °C rsrw Neural N 0.71 0.76 0.99 0.98 PLS 1.01 1.05 0.98 0.97 MAT 1.26 1.14 0.97 0.96 Imbrie & Kipp 1.22 1.05 0.97 0.96 WA-PLS 1.04 0.86 0.97 0.96

Changes in root-mean-square errors (RMSE) for S in relation to number of training epochs for 3-layer BP neural networks with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. The networks were trained over 50 intervals of 100 epochs each (in total of 5,000 epochs). As expected, the RMSEs decrease as training proceeds. The minimum RMSE, 0.3539, was obtained after training a network with 10 neurons in the hidden layer over 5,000 epochs. Similar results were obtained also for W (not shown in diagram).

Changes in root-mean-square errors of prediction (RMSEP) for S with increasing number of training epochs in a 3-layer back propagation neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. These error rates were determined using the Leave-One-Out technique, implying training of the networks over 54 sets consisting of 53 observations each, with one observation left out for later testing. The lowest RMSEPs for both S and W, 0.7176 and 0.7636, respectively, were obtained for a configuration with 3 neurons (only the results for S are shown in the diagram). Note that set-ups with 1, 2, and 3 neurons gave lower RMSEPs than for 4, 5, and 10 neurons.

Summer Winter Relationships between observed and predicted S and W using a 3-layer BP neural networks with 3 neurons in the hidden layer. Lines are linear regression lines. The product-moment correlation coefficients (r) are shown in the lower right hand corners.

Prediction errors for different network configurations: root-mean-square errors for the differences between observed and predicted S and W using a 3-layer BP neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. Root-mean-square errors of prediction (RMSEP) are based on the Leave-One-Out technique in which each of the 54 observations in the data set is left out one at a time and the network is trained on the remaining observations. The trained network is then used to predict the excluded observation. The network was run over 50 intervals of 100 epochs each, and the error rates were recorded after each interval.

Prediction error for different methods: Root-mean-square errors of prediction (RMSEP) for S and W obtained from a 3-layer BP network, Imbrie-Kipp Transfer Functions (IKTF), the Modern Analog Technique (MAT), and Soft Modelling of Class Analogy (SIMCA) Predictions were made using the Leave-One-Out technique

Predictions of S and W in core E48-22 from southern Indian Ocean based on a BP network, compared to the oxygen isotope (18O of Globorotalia truncautulinoides) curve presented by Williams (1976) for the uppermost 440 cm of the core. The cross-correlation coefficients for the relationships between 18O and the predicted S and W are –0.68 and –0.71, respectively, for zero lags (p<0.001). Interglacial isotope stages 1, 5, 7, and 9 as interpreted here, are indicated in the diagram.

Problems with ANN implementation and cross-validation Easy to over-fit the model. Leave-one-out cross-validation is not a stringent test as ANN will continue to train and optimise its network to the one sample left out. Need a training set (ca. 80%) and an optimisation (or selection set) (ca. 10%) to select the ANN model with the lowest prediction error AND an independent test set (ca. 10%) whose prediction error is calculated using the model selected by the optimisation set. Telford et al. (2004) Palaeoceanography 19

947 Atlantic foraminifera data. Split randomly 100 times into training set (747 samples), optimisation set (100 samples), and test set (100 samples). No advantage in the hours of ANN computing when cross-validated rigorously. ANN appears to be a very complicated (and slow) way of doing a MAT! May not be so good after all!

DIATOMS AND NEURAL NETWORKS Descriptive statistics for the SWAP diatom-pH data set No. of samples 167 No. of taxa 267 % no. of +ve values in data 18.47 Total inertia 3.39

SWAP data-set: 167 lakes convergence Artificial Neural Network Yves Prairie & Julien Racca (2002)

SWAP data-set: 167 lakes jack-knife predicted pH against observed pH Yves Prairie & Julien Racca (2002)

pH reconstruction by ANN and WA-PLS: (RLGH core) Yves Prairie & Julien Racca (2002)

SKELETONISATION ALGORITHM Pruning algorithm comparable to BACKWARD ELIMINATION in regression models 1. Measure relevance Pi for each taxon i Pi = E withouti – E with i where E = RMSE 2. Train network with all taxa using back-propagation 3. Compute relevance Pi based on error propagation and weights 4. Taxon with smallest estimated relevance Pi[Did this in 5% classes of importance] 5. Re-train the network to a minimum again [After deleting a taxon, the values of the remaining taxon are not re-calculated, so the input data are always the same original relative abundance values] Racca et al. (2003)

N2 ANN functionality

Leave-one-predicted pH ANN

ROUND LOCH OF GLENHEAD 30% pruned ANN 60% pruned ANN 85% pruned ANN 0% pruned ANN All taxa WA All taxa ML

General characteristics of the 37 most functional taxa for calibration based on ANN modelling approach.

Summary statistics of the SWAP diatom pH inference models according to the classes of taxa included based on the Skeletonisation procedure Apparent Cross-validation

Cross-validation Apparent Ideally apparent RMSE should be a reliable measure of the actual predictive of a model, and the difference between apparent and cross-validated RMSE indicates the extent to which the model has overfitted the data

Examples of the recently published diatom-based inference models in palaeolimnology used.

CURSE OF DIMENSIONALITY related to ratio of number of taxa to number of lakes, as this ratio determines the ratio of the dimensional space in which the function is determined to the number of observations for which the function is determined. MAXIMUM ROBUSTNESS – ratio of taxa : lakes as small as possible (1) increase the number of lakes (2) decrease the number of taxa

“Neural networks have the potential for data analysis and represent a viable alternative to more conventional data-analytical methods”.Malmgren & Nordlund (1997) • Advantages: • 1) Mixed linear and non-linear responses. • 2) Good empirical performance. • 3) Wide applicability. • Many predictors and many ‘responses’. • Disadvantages: • 1) Very much a black box. • 2) Conceptually complex. • Little underlying theory. • Easy to misuse and report erroneous model performance statistics. • PATTERN RECOGNITION • Unsupervised (cluster analysis, indirect gradient analysis) or supervised (discriminant analysis, direct gradient analysis) Statistical theory Linear methods Discriminants & Decision Theory Non-parametric methods CART trees Nearest-neighbour K-NN Neural network LDA BELIEF NETWORKS

VOLCANIC TEPHRAS IN N.W.EUROPE OF LATE-GLACIAL AND EARLY HOLOCENE AGE • Vedde Ashmid Younger Dryas ca 10600 14C yrs BP • (Rhyolitic type) Kråkenes, Norway • Several other sites in W Norway • Borrobol, Scotland • Tynaspirit, Scotland • Whitrig, Scotland • VeddeKråkenes • (Basaltic type) W Norway • BorrobolLower LG Interstadial ca 12500 14C yrs BP • Borrobol, Scotland • Tynaspirit, Scotland • Whitrig, Scotland • Saksunarvatnearly Holocene ca 9000 14C yrs BP = 9930 – 10010 cal yr • Faeroes • Kråkenes, W Norway • Dallican Water, Shetland SiO2 TiO2 Al2O3 FeO MnO MgO CaO Na2O K2O “The way in which correlation by tephrochronology may revolutionise approaches to reconstructing the sequence of events in the N.E.Atlantic region...”Lowe & Turney (1997)

VB VB VB VB VB VB VB VB B B B B B B B B S S S S S S S S V V V V V V V V SiO2 Al2O3 TiO2 FeO MgO CaO K2O Na2O

2 = 0.841 28% 1 = 0.988 32.9% CANONICAL VARIATES ANALYSIS (= multiple discriminant analysis) Group means

Saksun Vedde Borrobol Vedde B. CVA – individual samples

CVA

CVA- biplot of variables

Borrobol • Saksunavatn • Vedde Basaltic • Vedde Norway • Vedde Scotland Vedde Scotland + a few Vedde Norway Vedde Norway Minimum-variance cluster analysis √% data = chord distance Borrobol Saksunati Vedde Basalt 0.955 cophenetic correlation

2 = 0.016 1.6% PCA √% data 97.4% 1 = 0.96 95.9%

Vedde Norway Saksunavatn Borobol Vedde Scotland Vedde basaltic

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA