1 / 19

Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme . Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton. Shinji Fukuda Assistant Professor

oona
Télécharger la présentation

Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton Shinji Fukuda Assistant Professor Faculty of Agriculture Kyushu University shinji-fkd@agr.kyushu-u.ac.jp

  2. Outline How to split our data in cross-validation? • Introduction • Method • Species distribution data • Modelling approach • How to split data in cross-validation • Performance measures • Habitat information • Results and Discussion • Summary Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  3. Introduction Why is cross-validation necessary? • To ensure that your model is generally good • Avoid over-fitting // improve generalisation ability • Obtain better model parameters • K-fold cross-validation • Leave-one-out cross-validation • Repeated random sub-sampling validation • To compare your target models fairly • Nested cross-validation 10-fold nested CV Calibration Validation Training Testing Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  4. Objective How to split our data when applying CV? • Compare three potential methods • Random splitting • Stratified splitting • Prevalence-based splitting (output variable) • Habitat space-based splitting (input variables) depth Criteria • Model performance • Habitat information Absence Presence Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  5. Methods Species distribution data (3 species) • Anguilla anguilla(European eel) : 46% • Luciobarbusguiraonis(Barbo Mediterraneo): 69% Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  6. Methods Species distribution data (statistics) 1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)

  7. Methods How to split data (random or stratified) • Random splitting #0 Read data #1 Set the number of CV folds #2 Set the number of calibration and validation data points #3 Set a random seed #4 Split the data set Absence Presence Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  8. Methods How to split data (random or stratified) • Stratified splitting (prevalence) #0 Read data #1 Set the number of CV folds #2 Set the number of calibration and validation data points #3 Set a random seed #4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data #4-2 Split each of presence and absence data according to the number of CV folds #4-3 Merge presence and absence data Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  9. Methods How to split data (random or stratified) • Stratified splitting (habitat) #0 Read data #1 Set the number of CV folds #2 Set the number of calibration and validation data points #3 Set a random seed #4 Split the data set according to Euclidean distance of habitat variables #4-1 Calculate Euclidean distance of each data point from center of habitat variable space #4-2 Split the data according to the Euclidean distance of habitat variables #4-3 Merge data sets to generate calibration and validation data sets D Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  10. Methods Random Forests (Breiman, 2001) no CV within training data Data Sample 1 Sample k Sample 500 Bootstrap sampling Development of CART Model Voting for classification CCI, AUC, etc. Output

  11. Methods Performance measures • CCI: Correctly Classified Instances CCI = (TP+TN)/n n: size of data set • AUC: Area under the ROC curve • Ranges between 0.5 and 1, with 1 being perfect • NSE: Nash-Sutcliffe Efficiency • Ranges between -∞ and 1, with 1 being perfect Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  12. Methods Habitat information • Variable importance • Which habitat variables are important? • Computed internally in the RF algorithm (varImpPlot) • Habitat suitability curve • Which habitat conditions are important? • Computed internally in the RF algorithm (partialPlot) 1: suitable Suitability 0: unsuitable Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  13. Results & Discussion Performance (A. anguilla)—Random

  14. Results & Discussion Performance (A. anguilla)—Prevalence

  15. Results & Discussion Performance (A. anguilla)—Habitat space

  16. Results & Discussion Variable importance (A. anguilla)

  17. Results & Discussion Habitat suitability curves Altitude (high) Slope (mid) Water Temp. (low)

  18. Summary How to split the data when applying CV? • Three methods were compared • Random splitting • Prevalence-based splitting • Habitat space-based splitting • Random Forests was used as an SDM • Model performance (AUC, NSE) • Very similar (almost no influence on RF modelling) • Habitat information • Variable importance was similar (accuracy was similar…) • HSCs show variability in shape and range (to some extent, related to the degree of variable importance) • Much clearer influence may be observed in case of data with low prevalence Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

  19. Thank you very much!!! • Short summary • Cross-validation schemes: Random & Stratified (output & input) • Data-driven method: Random Forests • Model performance & habitat information • Influence was not clear for accuracy and variable importance • HSC shapes were variable A beautiful day in Ghent , Belgium

More Related