1 / 76

Information integration, life-cycle and visualization

Information integration, life-cycle and visualization. Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011. Contents. Review of last class, reading Information quality, uncertainty (again) and bias (again) in a project setting. Your projects? Last few classes.

ronda
Télécharger la présentation

Information integration, life-cycle and visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information integration, life-cycle and visualization Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011

  2. Contents • Review of last class, reading • Information quality, uncertainty (again) and bias (again) in a project setting. • Your projects? • Last few classes

  3. The semantics of data and information quality, uncertainty and bias and applications in atmospheric science Peter Fox, and … Stephan Zednik1, Gregory Leptoukh2, Chris Lynnes2, Jianfu Pan3 Tetherless World Constellation, Rensselaer Polytechnic Inst. NASA Goddard Space Flight Center, Greenbelt, MD, United States Adnet Systems, Inc.

  4. Definitions (ATM) • Quality • Is in the eyes of the beholder – worst case scenario… or a good challenge • Uncertainty • has aspects of accuracy (how accurately the real world situation is assessed, it also includes bias) and precision (down to how many digits) • Bias has two aspects: • Systematic error resulting in the distortion of measurement data caused by prejudice or faulty measurement technique • A vested interest, or strongly held paradigm or condition that may skew the results of sampling, measuring, or reporting the findings of a quality assessment: • Psychological: for example, when data providers audit their own data, they usually have a bias to overstate its quality. • Sampling: Sampling procedures that result in a sample that is not truly representative of the population sampled. (Larry English)

  5. Acronyms ACCESS Advancing Collaborative Connections for Earth System Science ACE Aerosol-Cloud-Ecosystems AGU American Geophysical Union AIST Advanced Information Systems Technology AOD Aerosol Optical Depth AVHRR Advanced Very High Resolution Radiometer GACM Global Atmospheric Composition Mission GeoCAPE Geostationary Coastal and Air Pollution Events GEWEX Global Energy and Water Cycle Experiment GOES Geostationary Operational Environmental Satellite GOME-2 Global Ozone Monitoring Experiment-2 JPSS Joint Polar Satellite System LST Local Solar Time MDSA Multi-sensor Data Synergy Advisor MISR Multiangle Imaging SpectroRadiometer MODIS Moderate Resolution Imaging Spectraradiometer NPP National Polar-Orbiting Operational Environmental Satellite System Preparatory Project

  6. Acronyms (cont.) OMI Ozone Monitoring Instrument OWL Web Ontology Language PML Proof Markup Language QA4EO QA for Earth Observations REST Representational State Transfer TRL Technology Readiness Level UTC Coordinated Universal Time WADL Web Application Description Language XML eXtensible Markup Language XSL eXtensible Stylesheet Language XSLT XSL Transformation

  7. Data quality needs: fitness for purpose • Measuring Climate Change: • Model validation: gridded contiguous data with uncertainties • Long-term time series: bias assessment is the must , especially sensor degradation, orbit and spatial sampling change • Studying phenomena using multi-sensor data: • Cross-sensor bias is needed • Realizing Societal Benefits through Applications: • Near-Real Time for transport/event monitoring - in some cases, coverage and timeliness might be more important that accuracy • Pollution monitoring (e.g., air quality exceedance levels) – accuracy • Educational (users generally not well-versed in the intricacies of quality; just taking all the data as usable can impair educational lessons) – only the best products

  8. Where are we in respect to this data challenge? “The user cannot find the data; If he can find it, cannot access it; If he can access it, ; he doesn't know how good they are; if he finds them good, he can not merge them with other data” The Users View of IT, NAS 1989

  9. Level 2 data

  10. Level 2 data • Swathfor MISR, orbit 192 (2001)

  11. Level 3 data

  12. Same parameter Same space & time MODIS vs. MERIS MODIS MERIS Different results – why? A threshold used in MERIS processing effectively excludes high aerosol values. Note: MERIS was designed primarily as an ocean-color instrument, so aerosols are “obstacles” not signal.

  13. Why so difficult? • Quality is perceived differently by data providers and data recipients. • There are many different qualitative and quantitative aspects of quality. • Methodologies for dealing with data qualities are just emerging • Almost nothing exists for remote sensing data quality • Even the most comprehensive review Batini’s book demonstrates that there are no preferred methodologies for solving many data quality issues • Little funding was allocated in the past to data quality as the priority was to build an instrument, launch a rocket, collect and process data, and publish a paper using just one set of data. • Each science team handled quality differently.

  14. Spatial and temporal sampling – how to quantify to make it useful for modelers? MODIS Aqua AOD July 2009 MISR Terra AOD July 2009 • Completeness: MODIS dark target algorithm does not work for deserts • Representativeness: monthly aggregation is not enough for MISR and even MODIS • Spatial sampling patterns are different for MODIS Aqua and MISR Terra: “pulsating” areas over ocean are oriented differently due to different orbital direction during day-time measurement  Cognitive bias

  15. Quality Control vs. Quality Assessment Quality Control (QC) flags in the data (assigned by the algorithm) reflect “happiness” of the retrieval algorithm, e.g., all the necessary channels indeed had data, not too many clouds, the algorithm has converged to a solution, etc. Quality assessment is done by analyzing the data “after the fact” through validation, intercomparison with other measurements, self-consistency, etc. It is presented as bias and uncertainty. It is rather inconsistent and can be found in papers, validation reports all over the place.

  16. Challenges in dealing with Data Quality • Q: Why now? What has changed? • A: With the recent revolutionary progress in data systems, dealing with data from many different sensors finally has become a reality. • Only now, a systematic approach to remote sensing quality is on the table. • NASA is beefing up efforts on data quality. • ESA is seriously addressing these issues. • QA4EO: an international effort to bring communities together on data quality. • Many information and computer science research questions

  17. Intercomparison of data from multiple sensors Data from multiple sources to be used together: • Current sensors/missions: MODIS, MISR, GOES, OMI. • Future decadal missions: ACE, NPP, JPSS, Geo-CAPE • European and other countries’ satellites • Models Harmonization needs: • It is not sufficient just to have the data from different sensors and their provenances in one place • Before comparing and fusing data, things need to be harmonized: • Metadata: terminology, standard fields, units, scale • Data: format, grid, spatial and temporal resolution, wavelength, etc. • Provenance: source, assumptions, algorithm, processing steps • Quality: bias, uncertainty, fitness-for-purpose, validation Dangers of easy data access without proper assessment of the joint data usage - It is easy to use data incorrectly

  18. Anomaly Example: South Pacific Anomaly Anomaly MODIS Level 3 dataday definition leads to artifact in correlation

  19. …is caused by an Overpass Time Difference

  20. Different kinds of reported data quality • Pixel-levelQuality: algorithmic guess at usability of data point • Granule-level Quality: statistical roll-up of Pixel-level Quality • Product-levelQuality: how closely the data represent the actual geophysical state • Record-level Quality: how consistent and reliable the data record is across generations of measurements Different quality types are often erroneously assumed having the same meaning Ensuring Data Quality at these different levels requires different focus and action

  21. Sensitivity of Aerosol and Chl Relationship to Data-Day Definition Difference between Correlation of A and B: A: MODIS AOT of LST and SeaWiFS Chl B: MODIS AOT of UTC and SeaWiFS Chl Correlation Coefficients MODIS AOT at 550nm and SeaWiFS Chl Artifact: difference between using LST and the calendar UTC-based dataday

  22. Sensitivity Study: Effect of the Data Day definition on Ocean Color data correlation with Aerosol data Only half of the Data Day artifact is present because the Ocean Group uses the better Data Day definition! Correlation between MODIS Aqua AOD (Ocean group product) and MODIS-Aqua AOD (Atmosphere group product) Pixel Count distribution

  23. Factors contributing to uncertainty and bias in Level 2 • Physical: instrument, retrieval algorithm, aerosol spatial and temporal variability… • Input: ancillary data used by the retrieval algorithm • Classification: erroneous flagging of the data • Simulation: the geophysical model used for the retrieval • Sampling: the averaging within the retrieval footprint Borrowed from the SST study on error budget

  24. What is Level 3 quality? It is not defined in Earth Science…. • If Level 2 errors are known, the corresponding Level 3 error can be computed, in principle • Processing from L2L3 daily  L3 monthly may reduce random noise but can also exacerbate systematic bias and introduce additional sampling bias • However, at best, standard deviations (mostly reflecting variability within a grid box), and sometimes pixel counts and quality histograms are provided • Convolution of natural variability with sensor/retrieval uncertainty and bias – need to understand their relative contribution to differences between data • This does not address sampling bias

  25. Why can’t we just apply L2 quality to L3? Aggregation to L3 introduces new issues where aerosols co-vary with some observing or environmental conditions – sampling bias: • Spatial: sampling polar areas more than equatorial • Temporal: sampling one time of a day only (not obvious when looking at L3 maps) • Vertical: not sensitive to a certain part of the atmosphere thus emphasizing other parts • Contextual: bright surface or clear sky bias • Pixel Quality: filtering or weighting by quality may mask out areas with specific features

  26. Addressing Level 3 data “quality” • Terminology: Quality, Uncertainty, Bias, Error budget, etc. • Quality aspects (examples): • Completeness: • Spatial (MODIS covers more than MISR) • Temporal (Terra mission has been longer in space than Aqua) • Observing Condition (MODIS cannot measure over sun glint while MISR can) • Consistency: • Spatial (e.g., not changing over sea-land boundary) • Temporal (e.g., trends, discontinuities and anomalies) • Observing Condition (e.g., exhibit variations in retrieved measurements due to the viewing conditions, such as viewing geometry or cloud fraction) • Representativeness: • Neither pixel count nor standard deviation fully express representativeness of the grid cell value • Data Quality Glossary development: http://tw.rpi.edu/web/project/MDSA/Glossary

  27. More terminology • ‘Even a slight difference in terminology can lead to significant differences between data from different sensors. It gives an IMPRESSION of data being of bad quality while in fact they measure different things. For example, MODIS and MISR definitions of the aerosol "fine mode" is different, so the direct comparison of fine modes from MODIS and MISR does not always give good correlation.’ • Ralph Kahn, MISR Aerosol Lead.

  28. Three projects with data quality flavor • Multi-sensor Data Synergy Advisor (**) • Product level • Data Quality Screening Service • Pixel level • Aerosol Status • Record level

  29. Multi-Sensor Data Synergy Advisor (MDSA) • Goal: Provide science users with clear, cogent information on salient differences between data candidates for fusion, merging and intercomparison • Enable scientifically and statistically valid conclusions • Develop MDSA on current missions: • NASA - Terra, Aqua, (maybe Aura) • Define implications for future missions

  30. How MDSA works? MDSA is a service designed to characterize the differences between two datasets and advise a user (human or machine) on the advisability of combining them. • Provides the Giovanni online analysis tool • Describes parameter and products • Documents steps leading to the final data product • Enables better interpretation and utilization of parameter difference and correlation visualizations. • Provides clear and cogent information on salient differences between data candidates for intercomparison and fusion. • Provides information on data quality • Provides advice on available options for further data processing and analysis.

  31. Giovanni Earth Science Data Visualization & Analysis Tool • Developed and hosted by NASA/ Goddard Space Flight Center (GSFC) • Multi-sensor and model data analysis and visualization online tool • Supports dozens of visualization types • Generate dataset comparisons • ~1500 Parameters • Used by modelers, researchers, policy makers, students, teachers, etc.

  32. Giovanni Allows Scientists to Concentrate on the Science Exploration Initial Analysis Use the best data for the final analysis Derive conclusions Write the paper Submit the paper The Old Way: The Giovanni Way: Web-based Services: Jan Pre-Science Find data Minutes Retrieve high volume data Read Data Feb Extract Parameter Learn formats and develop readers Days for exploration Filter Quality Extractparameters Mirador Mar Use the best data for the final analysis Subset Spatially Giovanni Perform spatial and other subsetting DO SCIENCE Derive conclusions Reformat Apr Identify quality and other flags and constraints Write the paper Reproject Submit the paper Perform filtering/masking Visualize May Develop analysis and visualization Explore Accept/discard/get more data (sat, model, ground-based) Analyze Jun Web-based tools like Giovanni allow scientists to compress the time needed for pre-science preliminary tasks: data discovery, access, manipulation, visualization, and basic statistical analysis. Jul DO SCIENCE Aug Sep Scientists have more time to do science! Oct

  33. Data Usage Workflow

  34. Data Usage Workflow *Giovanni helps streamline / automate Subset / Constrain Reformat Filtering Re-project Integration

  35. Data Usage Workflow Precision Requirements *Giovanni helps streamline / automate Integration Planning Quality Assessment Requirements Intended Use Subset / Constrain Reformat Filtering Re-project Integration

  36. Challenge • Giovanni streamlines data processing, performing required actions on behalf of the user • but automation amplifies the potential for users to generate and use results they do not fully understand • The assessment stage is integral for the user to understand fitness-for-use of the result • but Giovanni does not assist in assessment • We are challenged to instrument the system to help users understand results

  37. Advising users on Product quality We need to be able to advise MDSA users on: • Which product is better? Or, a better formulated question: Which product has better quality over certain areas? • Address harmonization of quality across products • How does sampling bias affect product quality? • Spatial: sampling polar area more than equatorial • Temporal: sampling one time of a day only • Vertical: not sensitive to a certain part of the atmosphere thus emphasizing other parts • Pixel Quality : filtering by quality may mask out areas with specific features • Clear sky: e.g., measuring humidity only where there are no clouds may lead to dry bias • Surface type related

  38. Research approach • Systematizing quality aspects • Working through literature • Identifying aspects of quality and their dependence of measurement and environmental conditions • Developing Data Quality ontologies • Understanding and collecting internal and external provenance • Developing rulesets allows to infer pieces of knowledge to extract and assemble • Presenting the data quality knowledge with good visual, statement and references

  39. Data Quality Ontology Development (Quality flag) Working together with Chris Lynnes’s DQSS project, started from the pixel-level quality view.

  40. Data Quality Ontology Development (Bias) http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1286316097170_183793435_22228&partName=htmltext

  41. Modeling quality (Uncertainty) Link to other cmap presentations of quality ontology: http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1299017667444_1897825847_19570&partName=htmltext

  42. MDSA Aerosol Data Ontology Example Ontology of Aerosol Data made with cmap ontology editor

  43. RuleSet Development [DiffNEQCT: (?s rdf:type gio:RequestedService), (?s gio:input ?a), (?a rdf:type gio:DataSelection), (?s gio:input ?b), (?b rdf:type gio:DataSelection), (?a gio:sourceDataset ?a.ds), (?b gio:sourceDataset ?b.ds), (?a.ds gio:fromDeployment ?a.dply), (?b.ds gio:fromDeployment ?b.dply), (?a.dply rdf:type gio:SunSynchronousOrbitalDeployment), (?b.dply rdf:type gio:SunSynchronousOrbitalDeployment), (?a.dply gio:hasNominalEquatorialCrossingTime ?a.neqct), (?b.dply gio:hasNominalEquatorialCrossingTime ?b.neqct), notEqual(?a.neqct, ?b.neqct) -> (?s gio:issueAdvisory giodata:DifferentNEQCTAdvisory)]

  44. Thus - Multi-Sensor Data Synergy Advisor • Assemble semantic knowledge base • Giovanni Service Selections • Data Source Provenance (external provenance - low detail) • Giovanni Planned Operations (what service intends to do) • Analyze service plan • Are we integrating/comparing/synthesizing? • Are similar dimensions in data sources semantically comparable? (semantic diff) • How comparable? (semantic distance) • What data usage caveats exist for data sources? • Adviseregarding general fitness-for-use and data-usage caveats

  45. Provenance Distance Computation • Provenance: • Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility. • Based on provenance “distance”, we tell users how different data products are. • Issues: • Computing the similarity of two provenance traces is non-trivial • Factors in provenance have varied weight on how comparable results of processing are • Factors in provenance are interdependent in how they affect final results of processing • Need to characterize similarity of external (pre-Giovanni) provenance • Dimensions/factors that affect comparability is quickly overwhelming • Not all of these dimensions are independent - most of them are correlated with each other. • Numerical studies comparing datasets can be used, when available, and where applicable to Giovanni analysis

  46. Assisting in Assessment Precision Requirements Quality Assessment Requirements Integration Planning Provenance & Lineage Visualization Intended Use MDSA Advisory Report Subset / Constrain Reformat Filtering Re-project Integration

  47. Multi-Domain Knowledgebase

  48. Advisor Knowledge Base Advisor Rules test for potential anomalies, create association between service metadata and anomaly metadata in Advisor KB

  49. Semantic Advisor Architecture RPI

  50. …. complexity

More Related