250 likes | 264 Vues
A New Enterprise Data Management Strategy for the US Government: Support for the Semantic Web. Brand Niemann, Senior Enterprise Architect, US EPA Co-chair, Federal Semantic Interoperability Community of Practice ( SICoP ) Position Paper for the W3C Workshop on
E N D
A New Enterprise Data Management Strategy for the US Government:Support for the Semantic Web Brand Niemann, Senior Enterprise Architect, US EPA Co-chair, Federal Semantic Interoperability Community of Practice (SICoP) Position Paper for the W3C Workshop on RDF Access to Relational Databases Hosted by Novartis, Cambridge, MA October 25-26, 2007
Overview • 1. U.S. Government Data • 2. Federal Enterprise Architecture Data Reference Model • 3. SICoP White Papers for the Federal CIO Council • 4. SICoP White Paper Updates for the Federal Community • 5. Federal Statistical Data System • 6. U.S. EPA Report on the Environment 2007 • 7. DRM 3.0 and the Semantic Web • 8. RDF from Data Tables and Relational Databases • 9. Recommendations • 10. Post Script
1. U.S. Government Data • Not readily accessible to search engines and reuse projects: • See June 2007 W3C / WSRI Workshop. • Major data projects need an enterprise architecture for funding: • See Federal Enterprise Architecture. • Working on both of these problems: • Federal Sitemaps Initiative. • Position Paper for This Workshop.
2. Federal Enterprise Architecture Data Reference Model DRM 1.0 SICoP All Three Unify DRM 3.0 Ontologies Source: Expanding E-Government, Improved Service Delivery for the American People Using Information Technology, December 2005, pp. 2-3. http://www.whitehouse.gov/omb/budintegration/expanding_egov_2005.pdf with annotations by the author.
3. SICoP White Papers for the Federal CIO Council • SICoP White Paper Series Module 1 (February 16, 2005): Introducing Semantic Technologies and the Vision of the Semantic Web ("DRM of the Future"): • W3C Semantic Web and DARPA DAML Program/SICoP Semantic Web Applications for National Security (SWANS) Conference April 7-8, 2005 (40 exhibits). • DRM 2.0 Implementation Guide Version 1.0 (October 15, 2005) and DRM 2.0 Education Pilot. • SICoP White Paper Series Module 2 (January 6, 2006): Semantic Wave 2006 - Executive Guide to the Business Value of Semantic Technologies: • Semantic Wave 2007 Update at the 2007 Semantic Technology Conference.. • Also see Four SICoP Contributions to the 2007 Semantic Technology Conference. • SICoP White Paper Series Module 3 (June 18, 2007): Operationalizing the Semantic Web/Semantic Technologies: Advanced Intelligence Community R&D Meets the Semantic Web! (ARDA AQUAINT Program): • A roadmap for agencies on how they can take advantage of semantic technologies and begin to develop Semantic Web implementations. Semantic Interoperability – Yes!.
4. SICoP White Paper Updates for the Federal Community • SICoP is working on updates to each of their three White Papers as follows: • 1. Semantic Interoperability Data Management Strategy: Net-Centric Operations Industry Consortium (NCOIC) and Others , Brand Niemann, US EPA (September 2007 Draft): • Semantic Interoperability: The What, Why, Who, and How. • Semantic Interoperability: My NCOIC Roadmap. • 2. Semantic Wave 2008: Industry Roadmap to Web 3.0, Mills Davis, Project 10X (October 2007 Draft): • Semantic Social Computing, Web 2.0 Summit Brief, and Semantic Desktop Pilot (TWINE from Radar Networks). • 3. Semantic Interoperability with Relational Databases (e.g. Data marts and Data warehouses): Solving the Schema Mismatch Problem with Ontology, Lucian Russell, Private Consultant (December 2007 Draft). • In conjunction with the Interoperable Knowledge Representation for Intelligence Support (IKRIS) Program.
5. Federal Statistical Data System • About 200 programs in 70 agencies!: • Decennial Census - moving towards more frequent and detailed surveys (e.g. American Community Survey). • Annual Statistical Abstract - most popular government data publication (about 40 chapters in PDF & 1500 data tables in Excel). • FedStats pilot of federation of databases (distributed content network using XML for the data and XML for distributed queries). • Repurposed documents and databases and recombined data and metadata to support information sharing and reuse • 2003 Annual Statistical Abstract • Data Table Example (presentation) and (XML database with metadata)
5. Federal Statistical Data System • Like to do for government data tables in Excel: • MindSwap Utility: The ConvertToRDF tool is designed to take plain-text delimited files, like .csv files dumped from Microsoft Excel, and convert them to RDF. • Like to do for a few selected government relational databases: • Digital Harbor Composite Applications Pilots with Business Ontology (Voting and Census Data) • Like to do for lots of selected government relational databases: • Tried in 1999 without the benefit of RDF/OWL and newer technologies.
6. U.S. EPA Report on the Environment 2007 • Spent lots of time and money on peer review, production of comprehensive metadata, and electronic publication. • Specifically, EPA's 2007 Report on the Environment contains thorough documentation and standard metadata templates for the 86 indicators selected using six criteria based on EPA’s Information Quality Guidelines and a Peer Review Process described in Appendix B of the report. • Basis for showing a New Enterprise Data Management Strategy for the US EPA. • Want to use RDF and reason over this data and metadata.
6. U.S. EPA Report on the Environment 2007 The Summary Statistics of the Data Asset Database * One question without an indicator.
6. U.S. EPA Report on the Environment 2007 • The individual data tables with their elements and attributes were compiled into 5 multi-sheet spreadsheets, one for each of the 5 topics in the 2007 EPA Report on the Environment. • The multi-sheet spreadsheet for “water” is shown for the index (table of contents) and the Exhibit 5-2 indicator data tables. • Question: Is this the right thing to tell people to do to get ready for RDF/OWL?
7. DRM 3.0 and the Semantic Web • Knowledgebases are defined as: • A semantic model = ontology(s) + the database of instances built as a social contract between those the know how to build them and those that need them (business partners). An ontology is a formal description of the meaning of the information used by software systems. Just like relational databases use SQL as a query language, ontologies developed using Semantic Web standards are queried with a query language called SPARQL. SPARQL is a simple yet powerful language. A single SPARQL query can combine the selection criteria based on the data values as well as their meaning. Unlike relational databases and SQL which are tightly bound to a specific data model, ontologies are highly flexible making it possible to: • (1) easily accommodate changes in the data model, and • (2) create generic queries that work in multiple situations and don't need changing when the data model must change.
7. DRM 3.0 and the Semantic Web • Building DRM 3.0 Knowledgebases: Where Do the Semantics Come From?: • Interoperable Knowledge Representation for Intelligence Support (IKRIS) has now produced the ISO Common Logic Standard (ISO/IEC 24707:2007). • Building DRM 3.0 for the Federal Community, February 6, 2007: • Free Text (unstructured) - Language Computer Corporation - extract about 40 semantic relationships and build an ontology. • Databases (structured) - Princeton WordNet • Knowledgebase (reasoning) - Open CYC • Data Modeling and OWL: Two Ways to Structure Data, David Hay, Essential Strategies, Inc. • See next slides.
7. DRM 3.0 and the Semantic Web • Data Modeling and OWL: Two Ways to Structure Data, David Hay, Essential Strategies, Inc.: • Objectives of a Data Model: • Capture the semantics of an organization. • Communicate these to the business without requiring technical skills. • Provide an architecture to use as the basis for database design and system design. • Now: Provides the basis for designing Service Oriented Architectures. See http://www.semantic-conference.com/2007/handouts/2-UpBW/Hay_David_2_2UpBW.pdf
7. DRM 3.0 and the Semantic Web • Data Modeling and OWL: Two Ways to Structure Data, David Hay, Essential Strategies, Inc. (continued): • Synopsis: • Both data modeling and ontology languages represent the structure of business data (ontologies). • Data modeling represent data being collected, and filters according to the rules. • Ontology languages represent data being used, with ability to have computer make inferences. • Comment from Lucian Russell (SICoP White Paper 3 Author): • So ontology can improve data quality in legacy systems (David Hay agreed) and solve the Schema Mismatch Problem (recall slide 6).
8. RDF from Data Tables and Relational Databases • At the June 2007 W3C/WSRI Workshop entitled “Toward More Transparent Government on eGovernment and the Web”, SICoP suggested that a clear message about the role of RDF in data exchange and a series of pilots using government data sources would help educate and demonstrate the value of the Semantic Web (aka the Data Web) to the Federal Government. • Consumers and potential consumers of RDF data will provide use cases and goals (e.g. SICoP). • The W3C has a new Semantic Web Layer Cake (see next slide) in which RDF has moved into the XML space and has been expanded with query and rules!
9. Recommendations • A New Enterprise Data Management Strategy for the US Government Based on: • The premise of reusing the data and information rather than changing the data systems themselves: • Putting the business and technical rules, logic, etc. into the data itself using markup languages. • The concepts and standards of the Semantic Web: • Also called the Data Web or Web 3.0. • The most important tenets of the reuse are: • Bring the data and the metadata back together. • Bring the structured and unstructured data and information back together. • Bring the data and information description and context back together. • Looking for partners to work with Federal Government and US EPA data and metadata sources.
10. Post Script • Google 2.0 Embraces Semantic Web: • Google's new Programmable Search Engine might require more work from agency Webmasters, though increased site visibility may result from the effort. Government Computer News, May 18, 2007. • Equity research firm Bear, Stearns & Co. report concludes that Google can become the Semantic Web because of: • Recent patents filed by Ra Guha, co-creator of RDF. • Supporting infrastructure – 400,000 servers in 100 data centers. • Lack of interest /focus by Microsoft. • Ra Guha denied the accuracy of this story and report at the 2007 Semantic Technology Conference, May 22, 2007, SDForum Meeting.
10. Post Script • Common Barriers to Web Search Engine Crawling • EPA Web Sites with Uncrawlable Databases • Strategies for Access to EPA Databases Now Closed to Search Engine Crawlers
Common Barriers to Web Search Engine Crawling • What can make a site effectively invisible to search engine users: • Content “hidden” behind search forms • • Non-HTML links • • Outdated robots.txt crawling restrictions • • Server errors (crawler times out when fetching content) • • Orphaned URLs • • Rich media: audio, video • • Premium content Source: J.L. Needham (Google), Ensuring government is only one search away: Implementing the Sitemap protocol
EPA Web Sites with Uncrawlable Databases Total: 27 Sample list of EPA Web sites with uncrawlable databases: http://spreadsheets.google.com/pub?key=pUb62ZKHnzgqEoGF4LFf3Gw
Strategies for Access to EPA Databases Now Closed to Search Engine Crawlers • Get Web database vendors (Oracle, IBM, etc.) to support automatic generation of the Sitemaps Protocol Files. • Convert to HTML. • Repurpose to XML. • Repurpose to Semantic Knowledgebases (RDF/OWL).