Data Quality Issues

Data Quality Issues • Data quality: • proper understanding is crucial to success of any project involving geographic data • no geographic data sets can be said to be error-free • “garbage-in, garbage out”

Data quality issues for geographic data • Error: • Difference between the real world and the geographic data representation of it. • Accuracy: (another way of describing error) • Extent to which map data values match true values • Example: Imagine a point is at 219 meters elevation above sea level, but a map represents it as 210 meters above sea level. • Error: This data point is represented with 9 meters of error. • Accuracy: This data point is accurate to within 9 meters.

Data error in spatial data • Location errors • Example: a schoolhouse is located 30 feet away from its marked location on a map • A 300 meter contour line is offset 5 meters to the northwest • A satellite image pixel is located 2.4 meters away from its actual location on the ground • Attribute errors • A schoolhouse is incorrectly labeled as a church • A 300 meter contour line is actually supposed to be a 310 meter contour line • A 300 meter contour line actually represents an elevation of 302 meters • A classified satellite image pixel is labeled forest when it is actually a field

Accuracy and error regarding data sets/maps • One data point – error/accuracy can be easily defined. • Data sets/maps – error/accuracy must be summarized. • How is accuracy determined and summarized? • Very accurate data must be collected (sampled) about a subset of the full dataset/map. • This accurate sample is then compared with the original data • A summary is created that compares these 2 datasets (the sample with the same measurements from the original data)

Error with nominal data • Nominal data is right or wrong. Period. • Examples: • Landcover type: a pixel is classified as forest or field. • A building is classified as a school or a church • A county is named Orange County or Durham County

forest fields urban water Total forest 80 4 0 15 7 106 fields 2 17 0 9 2 30 urban 12 5 9 4 8 38 water 7 8 0 65 0 80 Wetlands 3 2 1 6 38 50 Total 104 36 10 99 55 304 The Nominal Data Case • An example is when you determine the accuracy of a landcover classification. • We can build something called a confusion matrix: • This compares your classification with your ground-truth sample (the very accurate sample data, as mentioned) Reference wetlands Classification

Confusion Matrix Statistics • Summarizing a confusion matrix: • Row and column summaries are made. • The most basic overall summary statistic is the percent correctly classified • This is calculated by taking the total of the diagonal entries, dividing by the grand total, and multiplying by 100 to produce a percentage • From our example: 209 / 304 * 100% = 68.8% • BUTchance alone (random assignment of classes) would give a score of better than 0 • A Kappa index : • Determined through a “semi-complex” computation. • It is another measure describing overall accuracy of a classification, ranging between 0 and 100%. • A Kappa index can be used to test if a classification is statistically significantly better than a random classification. • The Kappa index for our example evaluates to 58.3%

What level of accuracy is ‘good’? • The Overall accuracy (and row and column accuracies) are generally considered good/acceptable if they are above 85%. The USGS uses this as a guideline. • The Kappa statistic describes agreement between the classified data and the reference data (it represents the increased accuracy of the performed classification over that of a random classification). A Kappa statistic of: • Above 80% is considered to have strong agreement. • Between 40% and 80% is considered to have moderate agreement. • Below 40% is considered to have poor agreement.

Ratio data – error summaries • The overallmagnitude of errors in ratio measurements can be summarized using the root mean square error (RMSE), • Calculated by taking square root of the average squared error • This is a kind of average error • This is the primary measure of accuracy used in map accuracy standards and GIS databases • e.g. we might state that the elevations in a certain digital elevation model have an RMSE of 2 meters. • 2 meters is a sort of “average error” for a data point. • However, data error will range above and below this number. Question: is this an example of locational error or attribute error?

Spatial data accuracy • Locational data accuracy can also be summarized with RMSE. • A kind of average of the distance points/pixels are represented from their actual location on the ground. • Locational data can also be summarized in other ways: • For horizontal data, the USGS uses the US National Mapping Accuracy Standards: • 90% of all measurable points are within 1/50 of an inch for maps of spatial scale less than or equal to 1:20,000, and within 1/30 of an inch for maps of spatial scale greater than 1:20,000.

Data Quality Issues • Precision: • Level of detail at which data values are recorded. • Often referred to as ‘significant digits’. • Example: • A cell in a raster DEM recorded as 219 meters is less precise than a cell recorded at 219.05 meters.

Data quality issues (cont’d.) • Error is unbiased when the error is in ‘random’ directions. • GPS data • Human error in surveying points • Error is biased when there is systematic variation in accuracy within a geographic data set • Example: GIS tech mistypes coordinate values when entering control points to register map to digitizing tablet • all coordinate data from this map is systematically offset (biased) • Example: the wrong datum is being used • Error can propagate… • e.g., what happens if layer digitized with a spatial bias problem is used as the spatial reference to create another, new layer? Propagation can be additive

Some non-error data quality issues • Compatibility: can two or more geographic data sets be used together properly? • e.g. is it meaningful to overlay roads data digitized at 1:10,000 scale with road hazard sites digitized at 1:250,000? • Completeness: does a given data set adequately cover a study area? Are there gaps in space or time? • Example: a city’s municipal cadastral database -- do all parcel polygons have attribute information? Are any parcels missing? • Consistency: are geographic data sets consistent in terms of content, format, etc? • Example landcover data layer for a study area -- different sub-areas produced from two satellite scenes... • one Landsat TM & classified into 10 classes -vs.- • one Landsat MSS & classified into 5 classes

In the end… • Your responsibility: • assessing the applicability of a data set for your needs. •  given the resolution, accuracy, precision, bias, compatibility, completeness, & consistency of a data set or analysis result--- --- is it appropriate or suitable for the intended use? • Make use of lineage information/Metadata at a minimum: • description of source data • how was the data transformed in preparation or analysis?

Fuzzy classification • Not “fuzzy logic”. • Becoming more common in academic settings. • Used with nominal data. • Useful for landcover classifications.

Fuzzy Approaches to Uncertainty • Consider a landcover classification with these classes: • Forest • Field • Urban • water • We don’t assign a single class to each landcover pixel. • Instead, we create a probability of membership to each class. • We create 4 layers: • Layer 1: • The attribute data for each pixel is the probability that pixel is in forest. • Layer 2: • The attribute data for each pixel is the probability that pixel is a field. • Layer 3: • The attribute data for each pixel is the probability that pixel is urban. • Layer 4: • The attribute data for each pixel is the probability that pixel is water.

Fuzzy Soils Mapping Example Membership map for bare soils Membership map for alpine meadows Spatial distribution of the three types by combining the fuzzy maps Membership map for forests

Data Quality Issues