190 likes | 436 Vues
How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme . Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton. Shinji Fukuda Assistant Professor
E N D
How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton Shinji Fukuda Assistant Professor Faculty of Agriculture Kyushu University shinji-fkd@agr.kyushu-u.ac.jp
Outline How to split our data in cross-validation? • Introduction • Method • Species distribution data • Modelling approach • How to split data in cross-validation • Performance measures • Habitat information • Results and Discussion • Summary Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Introduction Why is cross-validation necessary? • To ensure that your model is generally good • Avoid over-fitting // improve generalisation ability • Obtain better model parameters • K-fold cross-validation • Leave-one-out cross-validation • Repeated random sub-sampling validation • To compare your target models fairly • Nested cross-validation 10-fold nested CV Calibration Validation Training Testing Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Objective How to split our data when applying CV? • Compare three potential methods • Random splitting • Stratified splitting • Prevalence-based splitting (output variable) • Habitat space-based splitting (input variables) depth Criteria • Model performance • Habitat information Absence Presence Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Methods Species distribution data (3 species) • Anguilla anguilla(European eel) : 46% • Luciobarbusguiraonis(Barbo Mediterraneo): 69% Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Methods Species distribution data (statistics) 1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)
Methods How to split data (random or stratified) • Random splitting #0 Read data #1 Set the number of CV folds #2 Set the number of calibration and validation data points #3 Set a random seed #4 Split the data set Absence Presence Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Methods How to split data (random or stratified) • Stratified splitting (prevalence) #0 Read data #1 Set the number of CV folds #2 Set the number of calibration and validation data points #3 Set a random seed #4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data #4-2 Split each of presence and absence data according to the number of CV folds #4-3 Merge presence and absence data Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Methods How to split data (random or stratified) • Stratified splitting (habitat) #0 Read data #1 Set the number of CV folds #2 Set the number of calibration and validation data points #3 Set a random seed #4 Split the data set according to Euclidean distance of habitat variables #4-1 Calculate Euclidean distance of each data point from center of habitat variable space #4-2 Split the data according to the Euclidean distance of habitat variables #4-3 Merge data sets to generate calibration and validation data sets D Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Methods Random Forests (Breiman, 2001) no CV within training data Data Sample 1 Sample k Sample 500 Bootstrap sampling Development of CART Model Voting for classification CCI, AUC, etc. Output
Methods Performance measures • CCI: Correctly Classified Instances CCI = (TP+TN)/n n: size of data set • AUC: Area under the ROC curve • Ranges between 0.5 and 1, with 1 being perfect • NSE: Nash-Sutcliffe Efficiency • Ranges between -∞ and 1, with 1 being perfect Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Methods Habitat information • Variable importance • Which habitat variables are important? • Computed internally in the RF algorithm (varImpPlot) • Habitat suitability curve • Which habitat conditions are important? • Computed internally in the RF algorithm (partialPlot) 1: suitable Suitability 0: unsuitable Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Results & Discussion Performance (A. anguilla)—Random
Results & Discussion Performance (A. anguilla)—Prevalence
Results & Discussion Performance (A. anguilla)—Habitat space
Results & Discussion Variable importance (A. anguilla)
Results & Discussion Habitat suitability curves Altitude (high) Slope (mid) Water Temp. (low)
Summary How to split the data when applying CV? • Three methods were compared • Random splitting • Prevalence-based splitting • Habitat space-based splitting • Random Forests was used as an SDM • Model performance (AUC, NSE) • Very similar (almost no influence on RF modelling) • Habitat information • Variable importance was similar (accuracy was similar…) • HSCs show variability in shape and range (to some extent, related to the degree of variable importance) • Much clearer influence may be observed in case of data with low prevalence Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>
Thank you very much!!! • Short summary • Cross-validation schemes: Random & Stratified (output & input) • Data-driven method: Random Forests • Model performance & habitat information • Influence was not clear for accuracy and variable importance • HSC shapes were variable A beautiful day in Ghent , Belgium