Joint work between Eva K ü ster, Ingolf K ü hn ~ UFZ

Analysing the link between traits & invasive spread in German flora: accounting for residence time Joint work between Eva Küster, Ingolf Kühn ~ UFZ Adam Butler, Stijn Bierman, Glenn Marion ~ BioSS Athens ALARM meeting, January 2007

Introduction • Direct data on the arrival, establishment & spread of invasive species are typically not available at the national or pan-European levels • Indirect data about the traits & current spatial distribution of species that invaded in the past can be used to identify correlative relationships between traits and invasive success, accounting for phylogeny • Data on traits are often missing or ambiguous, however, creating serious problems for the analysis – we look at how to address these using Bayesian methods

Data • We analyse data on German vascular plants • Biolflor (www.ufz.de/biolflor): database with information on traits & phylogeny of 3660 species • Florkart (www.floraweb.de): database with information on presence/absense of 4000+ species for 2995 grid cells within Germany • We look at neophyte species (arrivals since 1490), excluding ephemerophytes: there are 388 such species • We use the # of grid cells occupied as a measure of invasive success

Morphology Life form Growth form Life span Generative reproductive cycles Propagation & dispersal Types of storage organs Existence of storage organs Types of shoot metamorphoses Types of root metamorphoses Leaf traits Leaf persistence Leaf anatomy Leaf form Flowering phenology Beginning of flowering season Length of flowering season End of flowering season Genetics Ploidy DNA content Niche breadth in Germany # hemerobic levels Urbanity # of habitat types # of vegetation formations # phytosociological classes Diaspores & germinules Types of diaspores Weights of diaspores Weights of germinules Invasive history Mode of introduction Residence time Life strategy Ecological strategy Ruderal life strategy Native global distribution Floristic zones of native area # floristic zones in native area Continent of native area # continents in native area Native in old or new world? Oceanity of native area Amplitude of oceanity Floral & reproductive biology Strategy types of reproduction Mating strategy Pollen vector Flower colour Floral UV pattern Floral UV reflection Blossom type

Current analysis by UFZKüster, Kühn and Klotz (in prep.) • Regress log(# grid cells occupied) onto each of the ~40 individual traits in turn, in the presence of phylogenetic variables • Retain only traits that are significant at the 95% level, exclude non-predictive traits, & then use cluster analysis to further reduce the set of traits • Use AIC to select the best model from within this set of traits, including interactions • At all stages, use only those species that have complete data for all traits currently in the model

Phylogenetic correctionKüster, Kühn and Klotz (in prep.) • Compute the patristic distance matrix based on the phylogenetic codes given in biolflor • For the current set of species – • apply a principal coordinate analysis to the relevant part of the distance matrix • retain only axes associated with positive eigenvalues • then retain the axes that account for the first 80% of variation • then regress log(# grid cell occupied) onto the remaining axes and retain only those that are significant at the 95% level • The phylogenetic variables need to be recomputed whenever the set of species is changed

Missing data • A large number of species are currently excluded from the final analysis as data are missing on some of their traits • This is inefficient, & could potentially lead to bias if the data are missing not at random • The missing data arise from different sources – • there being no record in the Biolflor database • the qualifier in Biolflor suggesting that data quality is poor • multiple states being recorded for a particular trait • a very rare state being recorded

Residence times • Residence time is a particularly important variable because • it has good explanatory power to describe occupancy • It partly accounts for the dynamic nature of invasive processes • it allows us to make time-specific predictions about occupancy • However, data on German residence times are only available for 171 species, & for 35 of these only to the nearest century • Some auxiliary data is available for neighbouring countries • How can we properly include residence time into the analysis, given the large proportion of missing data?

Work at BioSS • The aims of our research on this at BioSS – • to explore how sensitive the results of inferences are to the assumptions that we make about missing data • to analyse the data in such a way that species with missing data for some traits do not need to be excluded • to relate the outputs from the the analysis to invasive risk • We work with the Biolflor-Florkart data, and focus upon missing data for residence times; however, the methodological ideas are widely applicable

Application to toolkit • Application to the prediction of invasive risk • e.g. Use traits & phylogeny to infer the number of cells that a recently arrived species is likely to occupy after N years of residence • This number is uncertain, so it will be a probability distribution rather than a single number

Bayesian methods • An alternative approach to statistical modelling and inference, in which data are regarded as fixed and parameters are regarded as random • Increasingly widely used: due to improvements in computational power it is now often possible to fit more advanced models using Bayesian inference than using classical statistical methods • Particularly suitable for problems that involve missing data • Implemented using free software called WinBUGS: extremely powerful but not particularly user-friendly…

Bayesian modelling • Basic model • log yi ~ N( + xi + zi + ri, 2) • …just the same as a GLM • Prior distributions We use uninformative priors , , ,  ~ N(0,1000) 2 ~ Gamma(1/1000, 1/1000) • Recast the UFZ methodology in a Bayesian context, and implement this in WinBUGS • Use this to explore potential refinements or extensions to the current analysis • Assess sensitivity to the assumptions about missing data, phylogenetic dependence and distribution of the response variable (log-normal or Binomial) • Implementation is in WinBUGS • develop ways of dealing more efficiently with missing data • Bayesian LPJ code: Ben Smith, Stephen Sitch, Sybil Schapoff CRU data: David Viner GCM data: PCMDI Statistical methods: Jonathan Rougier, Chris Glasbey Uncertainty analysis: Bjoern Reineking, Stijn Bierman Notation: for species i: yi = # of grid cells occupied ri = residence time xi = other trait data zi = phylogenetic variables MCMC details: Burn-in = 5000, Sample = 2000 Thinning ratio = 1:50

Imputation • When data on residence times are missing, then we can assume that they are random variables • We can use data on the other traits, phylogeny & number of grid cells occupied to infer the distribution of the residence time for a particular species i e.g. log ri ~ N(exp{a + bxi + czi + dyi}, s2) • Use of the cut function ensures this does not bias inferences about , , ,  and  • Recast the UFZ methodology in a Bayesian context, and implement this in WinBUGS • Use this to explore potential refinements or extensions to the current analysis • Assess sensitivity to the assumptions about missing data, phylogenetic dependence and distribution of the response variable (log-normal or Binomial) • Implementation is in WinBUGS • develop ways of dealing more efficiently with missing data • Bayesian LPJ code: Ben Smith, Stephen Sitch, Sybil Schapoff CRU data: David Viner GCM data: PCMDI Statistical methods: Jonathan Rougier, Chris Glasbey Uncertainty analysis: Bjoern Reineking, Stijn Bierman

Results: PloidyPolyploid vs diploid Pink result based on 124 species Other results based on 345 species 42 species excluded

Results: PloidyImputed values

Results: PloidyPredictions

Results: Duration of flowering Pink result based on 135 species Other results based on 379 species 8 species excluded

Results: End of flowering Pink result based on 135 species Other results based on 379 species 8 species excluded

Results: End of flowering

Results: Pollen vector Pink result: 108 species Other results: 329 species 58 species excluded

Results: Shoot metamorphoses

(Note: posterior probability that  > 0 is always >0.99)

Further work 1:Data Not Missing at Random • Our model assumes that the data on residence times are missing at random, as does the approach of excluding missing data • We can also consider possible mechanisms by which the missing data might be related to the variables of interest Let oi = 1 if residence time observed for species i, 0 otherwise • We could assume that • oi ~ Binomial(1, logit-1{A + Bxi + Czi + Dyi + Eri}) • The parameter E cannot be estimated, but we can assess sensitivity to the value of it; we assume here that E is negative

Results: End of flowering

Further work 2: Multiple traits • Relatively low proportions of missing data for the other key traits: can just exclude these when he look at traits individually, but more problematic when we look at effects of multiple traits • Most “missing data” for the other key traits arise because rare or duplicate trait states are recorded in Biolflor • We would like to incorporate this information directly into the analysis, rather than attempting to impute the missing values • We can deal with duplicate states either by assuming: • that the parameter for species that have both states is the average of the parameters for the two states; or • by including a separate parameter for species that have duplicate traits

Missing data in current analysis

Classical analysis, model = Traits + Phylogeny

Furthur work 3: Auxiliary residence time data • The imputation model allows us to draw inferences about residence times for species where the arrival date is unknown • The performance of the imputation model depends upon us it containing regressors that are strongly correlated with residence time in Germany • Possibility of using data on residence in a neighbouring country, ni, as an explanatory variable: log ri ~ N(exp{a + bxi + czi + dyi + eni }, s2)

Furthur work 4: Climate change • UFZ are using the species-level model to identify key traits for invasive success, & then a spatial approach to estimate impact of environmental change on these • A non-spatial approach might involve grouping cells according to environmental characteristics, & fitting the species-level model seperately for each group of cells • We are interesting in comparing these approaches

Joint work between Eva K ü ster, Ingolf K ü hn ~ UFZ