Download
geographic data validation n.
Skip this Video
Loading SlideShow in 5 Seconds..
Geographic data validation PowerPoint Presentation
Download Presentation
Geographic data validation

Geographic data validation

103 Vues Download Presentation
Télécharger la présentation

Geographic data validation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Geographic data validation

  2. Index • Basic concepts • Why do we need validation? • How to assess geographic data • Initial checks • Intermediate checks • Advanced checks • Some final considerations

  3. Index • Basic concepts • Why do we need validation? • How to assess geographic data • Initial checks • Intermediate checks • Advanced checks • Some final considerations

  4. Basic concepts • Quality • Faithful representation of a feature • Quality of data related to quality of output • GIGO principle • Data have the potential to be used in ways unforeseen when collected. • The value of the data is directly related to the fitness for a variety of uses.

  5. Basic concepts • Fitness-for-use • The suitability of a set of data for a specific purpose • A.K.A. usability • Should not be confused with quality • Quality: Abstract • Usability: Specific • Low-quality dataset may be of a high usability

  6. Basic concepts • Precision • Closeness of repeated measurements to a given value, either correct or not • Accuracy • Closeness of a measurement to the true value

  7. Precision vs Accuracy

  8. Basic concepts • Precision • Closeness of repeated measurements to a given value, either correct or not • Accuracy • Closeness of a measurement to the true value • Precision is an intrinsic value • Accuracy depends on knowing the true value of the variable • Data validation: assessing the accuracy • Compare against a reference value

  9. Index • Basic concepts • Why do we need validation? • How to assess geographic data • Initial checks • Intermediate checks • Advanced checks • Some final considerations

  10. Why do we need validation?

  11. Why do we need validation?

  12. Why do we need validation? • This was a striking example, but more subtle issues can (and actually do) happen • We need to develop techniques and methodologies to explore the data • In other words, we need to validate the data • Validating gives a sense of the reliability of the records, and clues on how to improve it

  13. Index • Basic concepts • Why do we need validation? • How to assess geographic data • Initial checks • Intermediate checks • Advanced checks • Some final considerations

  14. How to assess? • Depending on the aim of the assessment, different techniques • Remember that high quality datasets are more likely to show high fitness-for-use • Ideally, check for quality • If we know the purpose, check for its fitness

  15. How to assess? • Work with geographic information a la DarwinCore • Work with individual records as well as collections of data • Start with the most basic pieces of information • Look for coherence with other pieces of information • If not, why? • Make modifications of information to see if they fit • In more advanced levels, make use of available taxonomic or temporal information

  16. How to assess? • Tools • Spreadsheet: Microsoft Excel, LibreOfficeCalc… • Well-known environment • Visually easy • Open Refine • Spreadsheet-like, but with some enhanced features • Scripts • Database scripts: work directly at the source • Other programming language: enhanced capabilities • GIS software • Often linked with other tools, such as spreadsheets or scripts

  17. Visualizations • Visual exploration of record set • Useful for a first-level assessment • Primary visualization for geographic data: maps • Next picture has several issues that can be detected using a map…

  18. Coordinate transposition • This happens when latitude is stored in longitude field and vice-versa • Usually difficult to detect on a one-by-one basis • But when looked at the whole picture…

  19. Zero vs Null • One of the most common issues • Storing 0 (zero) instead of leaving the field empty • This happens with some data management systems • Latitude 0 and longitude 0 are stored meaning “unknown coordinates” • But we do not know that, that is not what the standard says

  20. Negation • Forgetting or altering the positive/negative of the coordinates • Usually forgetting the minus sign • The most common source: transforming from DMS to DD, without taking “W” or “S” into account

  21. Check against country • The easiest way of checking these issues is to check if the coordinates fall inside the specified country… • Of course, if we have a country value to check against • Two ways • Use GIS software • Use webservices like geonames (we will see this in the openRefine session)

  22. Georeferencing • Intermediate check • If we have locality information and coordinates, we can check if they match • Georeferencing is a tough task, and prone to uncertainties, so some level of imprecision is to be expected • Make good use of the “uncertainty” fields in DarwinCore! • But still…

  23. 55.932576, 13.132359 Anahuac NWR (UTC 049) Grandville POINT(-1.3223333 53.44958) Marine Nature Study Area 78º 47’ 52” S; 35º 50’ 31” E Stewart Park POINT(-1.1735004 53.358746) Backyard My Habitat 55.932576, 13.132359 Wilderness Park, north of 14th St. 28054 Delaney Conservation Area 57.3, 11.9

  24. Multi-domain checks • Using information from different sources to check quality • Especially use taxonomic information to improve geospatial data • Most basic example: check data against range map • If point falls inside range map of the specified species, OK • Sometimes, temporal information is useful

  25. Index • Basic concepts • Why do we need validation? • How to assess geographic data • Initial checks • Intermediate checks • Advanced checks • Some final considerations

  26. Considerations • NEVER modify the original data • Data cleaning is a human task, and thus, it is not error-free • Information we believe is wrong may be right • Make an “improved copy” of the data • Or “flag” the records as inaccurate • Re-share the improvements • With the community: so that others don’t have to re-invent the wheel • With the original owners of the data: so that they can correct the errors at the source