510 likes | 709 Vues
Emerging Trends in Provenance. Deborah L. McGuinness Tetherless World Constellation Chair Rensselaer Polytechnic Institute SWPM Workshop at ISWC November 7, 2010 Shanghai, China. Outline. Some historical explanation & provenance settings Selected current provenance settings
 
                
                E N D
Emerging Trends in Provenance Deborah L. McGuinness Tetherless World Constellation Chair Rensselaer Polytechnic Institute SWPM Workshop at ISWC November 7, 2010 Shanghai, China
Outline • Some historical explanation & provenance settings • Selected current provenance settings • Virtual Observatory • Open Data • Discussion topics
Selected Background • Bell Labs: designing description logics & environments aimed at supporting applications such as configuration. • led to research on making DL-based systems useful – with focus on explanation • Stanford: focus on ontology-enabled xx, large hybrid systems, later x informatics • led to ontology evolution and diagnostic environments, renewed explanation, now from a broader perspective expanding beyond FOL and adding emphasis on provenance
Background cont. • Rensselaer Polytechnic Institute/ TWC: next generation web, web science research center, open data, next generation semantic eScience • Led to more connections with social platforms, empowering collections (of users, data, etc.)
Inference Web (IW) End Users Data Access & Data Analysis Services End-User Interact ion services Validate PML data Explanation via Graph Distributed PML data Explanation via Customized Summary Explanation via Annotation Access published PML data • Inference Web is a semantic web-based knowledge provenance management infrastructure: • Uses a provenance interlingua (PML) for encoding and interchange of provenance metadata in distributed environments • Provides interactive explanation services for end-users • Provides data access and analysis services for enriching the value of knowledge provenance • It has been used in a wide range of applications
Proof/Provenance Markup Language (PML) World Wide Web Enterprise Web D PML data PML data PML data PML data PML data PML data Enterprise Web D D D D D D PML data D … • A kind of linked data on the Web • Modularized & extensible • Provenance: annotate provenance properties • Justification: encodes provenance relations (including support for multiple justifications) • Trust: add trust annotation • Semantic Web based
Making Systems Actionable using Knowledge Provenance Mobile Wine Agent CALO Combining Proofs in TPTP Intelligence Analyst Tools Knowledge Provenance in Virtual Observatories GILA 7 NOW including Data-gov 7 7
User Require Provenance! Users demand it! If users (humans and agents) are to use, reuse, and integrate system answers, they must trust them. Intelligence analysts: (from DTO/IARPA’s NIMD) Andrew. Cowell, Deborah McGuinness, Carrie Varley, and David A. Thurman. Knowledge-Worker Requirements for Next Generation Query Answering and Explanation Systems. Proc. of Intelligent User Interfaces for Intelligence Analysis Workshop, Intl Conf. on Intelligent User Interfaces (IUI 2006), Sydney, Australia. Intelligent Assistant Users: (from DARPA’s PAL/CALO) Alyssa Glass, Deborah L. McGuinness, Paulo Pinheiro da Silva, and Michael Wolverton. Trustable Task Processing Systems. In Roth-Berghofer, T., and Richter, M.M., editors, KI Journal, Special Issue on Explanation, KunstlicheIntelligenz, 2008. Virtual Observatory Users: (from NSF’s VSTO) Deborah McGuinness, Peter Fox, Luca Cinquini, Patrick West, Jose Garcia, James L. Benedict, and Don Middleton. The Virtual Solar-Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. Proc. of the Nineteenth Conference on Innovative Applications of Artificial Intelligence (IAAI-07). Vancouver, British Columbia, Canada. And… as systems become more diverse, distributed, embedded, and depend on more varied data and communities, more provenance and more types are needed .
Two Application Scenarios: • Interdisciplinary next generation virtual observatories • Open Linked Data
CHIP Pipeline (Chromospheric Helium Image Photometer) Intensity Images (GIF) Velocity Images (GIF) Raw Image Data Captured by CHIP Chromospheric Helium-I Image Photometer Publishes Mauna Loa Solar Observatory (MLSO) Hawaii National Center for Atmospheric Research (NCAR) Data Center. Boulder, CO • Raw Image Data • Raw Data Capture • Follow-up Processing • on Raw Data • (e.g., Flat Field Calibration) • Quality Checking • (Images Graded: GOOD, BAD, UGLY) 10
Semantic Provenance Capture for Data Ingest Systemcs (SPCDIS) Fact: Scientific data services are increasing in usage and scope, and with these increases comes growing need for access to provenance information. Provenance Project Goal: to design a reusable, interoperable provenance infrastructure. Science Project Goal: design and implement an extensible provenance solution that is deployed at the science data ingest/ product generation time. Outcome: implemented provenance solution in one science setting AND operational specification for other scientific data applications. Extends vsto.org
ACOSData Ingest • Typical science data processing pipelines • Distributed • Some metadata in silos • Much metadata lost • Many human-in-loop decisions, events • No metadata infrastructure for any user • Community is broadening Chromospheric Helium Imaging Photometer (CHIP) Data Ingest ACOS – Advanced Coronal Observing System
The Advanced Coronal Observing System case for Provenance • Provenance metadata currently not propagated with or linked to the data products • Processing metadata • Origin (observation) metadata • Data products are the result of “black box” systems • Most users do not know what calibrations, transformations, and QA processing have been applied to the data product ??? Source Processing Product
Advanced Coronal Observing System (ACOS) Provenance Use Cases • What were the cloud cover and seeing conditions during the observation period of this image? • What calibrations have been applied to this image? • Why does this image look bad?
PML Usage in SPCDIS SourceUsage Source DateTime • Justification • Explanation • Causality graph • Provenance • Conclusion • Source • Engine • Rule • Trust • Trust/Belief metrics Engine Rule Rule hasInferenceRule hasInferenceEngine hasSourceUsage NodeSet NodeSet Justification Justification Conclusion Conclusion hasAntecedentList NodeSet Justification Conclusion
PML in Action • This is the PML provenance encoding for a “quick look” gif file, which is generated from two image data datasets The “antecedents” of the quicklook gif file are other node sets InferenceStep: how the gif file was derived hasAntecedents hasInferenceRule Node set for the quickloook gif file hasInferenceEngine hasConclusion: a reference to the gif file itself
A PML-Enhanced Image CHIP PML-Enhance Quick-Look CHIP Quick-Look provenance
Integrated View • Observer log’s information added into quicklook image’s provenance
Provenance aware faceted search Tetherless World Constellation
Current Issues • Successful interdisciplinary VO; needed provenance • Successful provenance integration for experts; needs to support more diverse audience • As the user base diversifies, what updates are needed? • Will a domain ontology for MLSO/NCAR-affiliated staff be understandable by citizen scientists?... No • How can our representational infrastructure be extended with contextual information relevant to user needs? E.g., linking data products from one part of the CHIP pipeline to specific solar events or events at MLSO (such as reports of bad weather) • Should provenance ontologies provide extensional capabilities to include domain-informed extensions – yes • [1] Stephan Zednik, Peter Fox and Deborah L. McGuinness, “System Transparency, or How I Learned to Worry about Meaning and Love Provenance!” Proceedings of IPAW 2010 • [2] James R. Michaelis, Li Ding, Zhenning Shangguan, Stephan Zednik, Rui Huang, Paulo Pinheiro da Silva, Nicholas Del Rio and Deborah L. McGuinness, “Towards Usable and Interoperable Workflow Provenance: Empirical Case Studies Using PML” Proceedings of SWPM 2009 • [3] AGU 2010 with papers with Fox, et al, McGuinness et al., Zednick et al,, West. et. al, Michaelis et al, …
User Annotations (James Michaelis) • Allowing users to annotate provenance elements is a potential solution • Allow a user community to make replies to questions from individuals • E.g., citizen scientists can get information extensions through help of project staff • Additionally, allow user community to assert information on provenance elements • Vision: to incrementally aggregate information attached to provenance traces, through these annotations.
User Annotations • Allowing users to annotate provenance elements is a potential solution • Allow a user community to make replies to questions from individuals • E.g., citizen scientists can get information extensions through help of project staff • Additionally, allow user community to assert information on provenance elements • Vision: to incrementally aggregate information attached to provenance traces, through these annotations.
User Annotations • Can expand information attached to provenance records in two ways: • Clarification: Providing an answer to a question about a provenance element (such as an expanded definition of its purpose). • Context Extension: Provide supplemental information outside the scope of a provenance record, which may aid in provenance understanding.
User Annotations • Types of annotations • Assertion: A user directly asserts a clarification or context extension • Clarification Request: A user makes a request for a clarification on a provenance element. • Context Extension Request: A user makes a request for a context extension. • Reply: A user replies to a clarification request or context extension request. • Discussions may feature participants with different backgrounds. At a high level, such users can be distinguished by Roles • (e.g., Staff, Citizen Scientist)
Use Case 1A Request Web Service Processing Details for Intensity Image 20101007. 232213.chp.hsh.gif Server Response Request Definition for function Flatten Alice Server Response Flatten: Apply flat field calibration to an image, using averaged bias and flat files for the corresponding processing day. Annotation Submission Type: Clarification Request Topic: Flatten (Function Definition)Text: Could someone provide a definition of “Flat Field Calibration”?
Use Case 1B Web Service Request Details for Annotation: Annotation_1 Server Response Type: Clarification Request Topic: FlattenText: Could someone provide a definition of “Flat Field Calibration”? Annotation Submission Bob Type: Reply Reply To: Annotation_1 Clarification On: FlattenAuthor: Bob Role: Staff Reply: A definition of Flat Field Calibration is given at the provided link. Link: http://www.phys.vt.edu/~jhs/SIP/processing.html
Annotation Structure – Use Cases 1A, 1B http://www.phys.vt.edu/~jhs/SIP/processing.html A definition of Flat Field Calibration is given at the provided link. Reply Has Link Type Staff Role Has Text Annotation_2 Bob Has Author Clarification For Reply To Type Clarification Request Topic Flatten Annotation_1 Has Text Could someone provide a definition of “Flat Field Calibration”? Has Author Alice
Use Case 2 Initial Server Response Web Service List of Intensity Images For 2010-08-01 – 2010-08-04 For each listed image i = {0 … n} Request Visualization of listed image i Server Response Visualization of image IID: image_i Bob Bob inspects each image to see if it has visual evidence of Coronal Mass Ejection related activity Annotation Submission Type: Assertion Author: Bob Topic: (all applicable images viewed)Text: CME Event observed in referenced images.
Related Work & Status • myExperiment[1] • Social networking site for exchanging workflow-centric materials • Support primarily for annotation on workflow-scripts, as opposed to provenance-based information • Tupelo[2] • Semantic Content Repository, designed to facilitate provenance storage/querying • Uses Open Provenance Model (OPM) • User annotations/discussions supported for URI-based content, but no specific focus on aggregating content directly on provenance elements • Status – draft PMLA module. Implementation and evaluation with SPCDIS [1] http://tupeloproject.ncsa.uiuc.edu/ [2] http://www.myexperiment.org/
Example Population Science Issues (with NIH) • Do policies (taxation, smoking bans, etc) impact health and health care costs? • What data should we display to help scientists and lay people evaluate related questions? • What data might be presented so that people choose to make (positive) behavior changes? • What does the following data show? • What are appropriate follow ups?
Drill Down Questions • Should we focus on prevalence? • What is prevalence (definition)? • How is it measured (overall / in this data set)? • Conditions under which the data was obtained (date, sample set, extenuating conditions, …) • Do we need more data, more inference, more xxx…
Our Position System Transparency supports user understanding and trust Our Research Goal: Provide interoperable infrastructure that supports explanations of sources, assumptions, and answers as an enabler for trust
Mashup Provenance from data-gov • Critical for making demos useful, understandable, and actionable Agency Dataset Demo
Provenance Events CSV2RDF visualize derive derive create revision Archive SemDiff Enhance derive
Study of Supreme Court Justices needs data from different sources Sample Application Domain (with Xian Li) Court cases, votes (Segal, and Spaeth. 1993 ; Schubert, 1965 ; Pritchett, 1948 ; Rohde, D. and Spaeth, 1976 ; ) Judicial Databases e.g. SCDB (Spaeth 1999 ) Newspaper Comments e.g. The New York Times Public opinions (Tate and Handberg. 1991 ) Biographical Directories e.g. Who's Who in America Personal attributes: education, nominator, … (Segal. and Spaeth, 1993, 2002 )
Surprise • Application reports that Robert H. Jackson was nominated by a Green Party President • There hasn't been a Green Party President Sample Use Case (with Li and Lebo)
Green Party President? • User believes that the System is Incorrect • Look for provenance of information to identify whether it is the source that is incorrect or the application interpreted the source incorrectly. Use Case
Distrust event ns:subject http://dbpedia.org/resources/Robert_H._Jackson ns:query_template http://dbpedia.org/sparql?query=select...%JUSTICE%... Provenance Encoding pmlj:InferenceStep pmlj:isConsequenceOf Query Creation ns:query_uri Query Execution pmlj:isConsequenceOf pmlj:InferenceStep ns:query_result ns:output_format … ns:service_uri Attribution located! “Green” “DBpedia”
Challenges for Data Aggregators (with Tim Lebo, Greg Williams)
Assumptions and Objectives • Most data are from third-party sources • Data are updated regularly and irregularly • Complete interpretation is not immediately possible • Subsequent interpretations should be backward-compatible • Distinguishing among sources • Minimizing manual modifications • Tracing to source data • Attributing data authors and curators
Approach • Capturing conversion provenance, exposed as linked data: • 1 – Following redirects 2 – Retrieving data file 3 – Unzipping 4 – Manual tweaks • 5 – Converter invocation 6 – Predicate lineage 7 – Tracing triple to table cell • 8 – Populating endpoint • Parameterized interpretation parameters
Future Directions • Presenting provenance information in LOGD dataset description pages • Extending visualization APIs to incorporate provenance within interface • Leveraging provenance connectivity to investigate latent associations among datasets and presentations US-UK Foreign Aid Comparison Queried as RDF Providing direct link to original data
Discussion • Provenance is growing in acceptance, need, and type • Some interlinguas have emerged that have significant usage and have shown significant value • Interdisciplinary eScience and open data are increasing the need and potentially pace. • A few trends we have observed: • Domain-specific extensions can be of value • Techniques for supporting interaction with large diverse communities are needed (we believe user annotation is one such critical technique) • Data aggregators face additional challenges if provenance is not available… and may accelerate the demand for provenance and provenance standards • Getting back to the portion of the source used is critical for some • Tracking manipulations is critical for some • Providing and creating provenance as part of a larger eco-system is key • Open (govt, science, etc) data (along with semantic web applications with embedded information about knowledge provenance and term meaning) is providing many new opportunities and will continue to change our lives. • Questions? dlm <at> cs <dot> rpi <dot> edu