A statistical perspective on quality in GEOSS: challenges and opportunities

A statistical perspective on quality in GEOSS: challenges and opportunities Dan Cornford d.cornford@aston.ac.uk Aston University Birmingham, UK

What is quality? • Quality has several facets, but the key one in my view is a quantitative statement of the relation of a value to reality • Quality is challenging to define • ISO 9000: • “Degree to which a set of inherent characteristics fulfils requirements” • International Association for Information and Data Quality: • “Information Quality is not just “fitness for purpose;” it must be fit for all purposes” • Oxford English dictionary: • “the standard of something as measured against other things of a similar kind; the degree of excellence of something”

The key aspects of data quality An exhaustive list might include: Accuracy, Integrity, Precision, Objectivity, Completeness, Conciseness, Redundancy, Validity, Consistency, Timeliness, Accessibility, Utility, Usability , Flexibility, Traceability There is not universal agreement but we propose: accuracy: value correctly represents the real world completeness: degree of data coverage for a given region and time consistency: are rules to which the data should conform met usability: how easy is it to access and use the data traceability: can one see how the results have arisen utility: what is the user view of the data value to their use-case

Accuracy and what is reality for GEOSS • Accuracy is the most important quality aspect • This is not a talk about philosophy ... • However, we must define objects in the real world using mental concepts • I view reality as a set of continuous space-time fields of discrete or continuous valued variables • The variables represent different properties of the system, e.g. temperature, land cover • A big challenge is that reality varies over almost all space and time scales, so we need to be precise about these when defining reality

Relating observations to reality - accuracy • Assume we can define reality precisely • I argue the most useful information I can have about an observation is the relation of this observation to reality • Express this relation mathematically as y = h(x) where • y is my observation • x is reality (not known) • h() is my sensor / forward /observation model that maps reality to what I can observe • I can almost never write this due to various sources of uncertainty in my observation and h, and maybe variations in x. So I must write y = h(x) + ε(x) • ε(x) is the (irreducible) observation uncertainty

Dealing with the unknown • Uncertainty is a fundamental part of science • Managing uncertainty is at the heart of data accuracy • There are several frameworks for handling uncertainty: • frequentist probability (requires repeatability) • subjective Bayesian probability (personal, belief) • fuzzy methods (more relevant to semantics) • imprecise probabilities (you don’t have full distributions) • belief theory and other multi-valued representations • Choosing one is a challenge • I believe that subjective Bayesian approaches are a good starting point

What I would really, really want • You supply an observation y • I would want to know • h(x), the observation function • ε(x), the uncertainty about y, defining p(y|x) because then I can work out: p(x|y) α p(y|x)p(x) • This is the essence of Bayesian (probabilistic) logic – updating my beliefs • I need to know about how you define x too – what spatial and temporal scales

Why does p(y|x) matter? • Imagine I have other observations of x (reality) • How do I combine these observations rationally and optimally? • Solution is to use the Bayesian updates – need to know the joint structure of all the errors! • This is optimistic (unknowable?) however, given reality, it is likely that for many observations the errors will be uncorrelated

What about a practical example? • Consider weather forecasting: • essentially an initial value problem – we need to know about reality x at an initial time • thus if we can define p(x|y) at the start of our forecast we are good to go • this is what is called data assimilation; combining different observations requires good uncertainty characterisation for each observation, p(y|x) • In data assimilation we also need to know about the observation model y = h(x) + ε(x) and its uncertainty

How do we get p(y|x)? • This is where QA4EO comes in – has a strong metrology emphasis where all sources of uncertainty are identified • this is very challenging – to define a complete probability distribution requires many assumptions • So we had better try and check our model, p(y|x) using reference validation data, and assess the reliability of the density • reliability is used in a technical sense – it is a measure of whether the probabilities estimated are really observed

Practicalities of obtaining p(y|x) • This is not easy – I imagine a two pronged approach • Lab based “forward” assessment of instrument characteristics to build an initial uncertainty model • Field based validation campaigns to update our beliefs about the uncertainty, using a data assimilation like approach • Continual refinement based on ongoing validation • This requires new statistical methods, new systems and new software and is a challenge! • ideally should integrate into data assimilation ...

How does this relate to current approaches? • Current approaches in e.g. ISO19115 / -2 (metadata) recognise quality as something important but: • they do not give strong enough guidance on using useful quality indicators (QA4EO addresses this to a greater degree) • many of the quality indicators are very esoteric and not statistically well motivated (still true in ISO19157) • We have built UncertML specifically to describe flexibly but precisely uncertainty information in a useable form to allow probabilistic solutions

Why probabilistic quality modelling? • We need a precise, context independent definition of the observation accuracy as a key part of quality • Probabilistic approaches • work for all uses of the observation – not context specific • provide a coherent, principled framework for using observations • allow integration of data (information interoperability) and data reuse • can extract information from even noisy data – assuming reliable probabilities any ‘quality’ of data can be used

Quality, metadata and the GEO label • Quantitative probabilistic accuracy information is the single most useful aspect • Other aspects remain relevant: • traceability: provenance / lineage • usability: ease / cost of access • completeness: coverage • validity: conformance to internal and external rules • utility: user rating • I think the GEO label concept must put a well defined probabilistic notion of accuracy at its heart, but also consider these other quality aspects

Summary • Quality has many facets – accuracy is key • Accuracy should be well defined requiring a rigorous statistical framework, and a definition of reality • Quality should be at the heart of a GEO label • QA4EO is starting to show the way • We need also to show how to implement this • GeoViQua will develop some of the necessary tools, but this is a long road ...

A statistical perspective on quality in GEOSS: challenges and opportunities