Analyzing Polynomials and Model Fits in Data Handling

Data Handling & AnalysisPolynomials and model fit Andrew Jackson a.jackson@tcd.ie

Linear type data • How are two measures related?

What do we do about curvature? • Data are the number of species (Y) recorded per time spent looking for them (X) • Specifically, these data come from fisheries data • Good proxy for species diversity in the marine habitat

Clearly a straight line won’t do

… the residuals are horrible

Polynomials • Polynomials are linear equations that show curvature • Quadratics • Y = b0 + b1X + b2X2 • Cubics • Y = b0 + b1X + b2X2 + b3X3 • 5th, 6th order polynomials etc…

Quadratic model

Quadratic residuals • Better… • But not so good at lower values of x • Try a more complicated model like a cubic

Cubic model • Note the double curvature • Model appears to explain the lower values better • But how sure are we of the increase at higher values?

Cubic residuals • Better than the quadratic • But still over-estimating the lowest values of x

Log transform the X variable • Model is • Y~log(X) • Appears to explain the data very well across the full range • Check the residuals…

Y~log(X) residuals • Now these look pretty near perfect

The null model • Consists of a mean and a variance only • It gives us a benchmark against which we can test our models that include more information • If we can’t do better than the null model then we don’t understand our data or system!

Residuals of the null model

Choosing between alternative models • We now have a choice between 5 models • Null model (zero order polynomial, which includes an intercept only – i.e. just a mean and variance model) • Straight line (first order polynomial) • Quadratic (second order polynomial) • Cubic (third order polynomial) • First order polynomial with log(X) • How do we select which one to use? • Higher order polynomials require more parameters

Parsimony as a central tenet • Parsimony is the application of the most simplest explanation for a phenomenon and underpins all of science • So.. We need to pick the model that • Fits the data the best, and … • Uses the least number of parameters

Likelihood of data

AIC for model selection • We will use Akaike’s Information Criterion (AIC) to select the most suitable model • AIC = -2Log(likelihood) + 2k • Log-likelihood gets bigger the better the fit • k is the number of parameters in the model • Lower AIC = more suitable model

AIC of our models • Null model - 248.2 • Straight line - 184.1 • Quadratic - 142.5 • Cubic - 124.9 • 4th order - 83.5 • 5th order - 77.6 • 6th order - 77.7 • log(X) - 68.4 • So the log(x) model is the best in this case • Note that adding more orders to the polynomials ceases to confer any benefit after 5th order. Also… these get increasingly difficult to explain and relate to biological phenomena

Conclusions • AIC provides an objective way to compare alternative models • Lower AIC indicates a more parsimonius model • Must only compare AIC on models of the exact same response variable • Only provides relative, and not absolute indication of model fit • Still need to check that the model is any good • Residuals etc…

Analyzing Polynomials and Model Fits in Data Handling