180 likes | 309 Vues
This guide focuses on essential error detection and data cleaning methods for biodiversity data, highlighting various error types such as taxonomic and spatial errors. It proposes practical strategies like expert checks, consulting authority lists, and employing automated tools for scientific name extraction. It emphasizes the importance of geographic references for data analysis, addresses common georeferencing mistakes, and outlines data cleaning procedures through simulated error testing. The ultimate goal is to improve data quality while acknowledging the limits of complete data cleaning.
E N D
Nothing Is Perfect: Error Detection and Data Cleaning A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …
Types of Errors in Biodiversity Data • Taxonomic data
Detection of Taxonomic Errors • Sine qua non – expert checks specimens and associated data • Check names against authority lists • Check names and authorities against authority lists • N.B.: Check out new capabilities for automated detection and extraction of scientific names … http://jbi.nhm.ku.edu
Spatial Error • Geographic references are invaluable in enabling analysis of biodiversity data, but are also extremely prone to problems
Data Cleaning Procedures • Assemble occurrence points for each species • Eliminate occurrence points one at a time (jackknife), and build models without each of the points available • Identify points that are • included in models only when included in the input data set • included in models not even when included in the input data set • Flag these points as suspect for further checking
Data Cleaning Test • Distributional data from the Atlas of Mexican Bird Distributions for various species • Select 18 points at random from those available • Add two random points • Simulates 10% error rate • Use data-cleaning procedure to see if random points could be identified as ‘erroneous’
Example – Crax rubra Successfully identified the 2 random points included in the model
Example – Rauvolfia paraensis Identified one point as outlier. Proved to be an undescribed species
Error Flagging • Never possible to clean completely—what matters is signal to noise ratio • No substitute for inspection and detailed study by specialists • HOWEVER, we can • Detect records with internal inconsistencies that clearly represent error in some field • Detect records with high probability of including errors owing to unusual characteristics • Flag those records for later checking and correction