480 likes | 626 Vues
Case Study: Integrating K-12 Education into the National Information Exchange Model. Dan McCreary Dan McCreary & Associates. Background. Dan McCreary - Dan McCreary & Associates President of consulting firm that focuses on metadata-driven IT strategy development infrastructures for:
E N D
Case Study:Integrating K-12 Education into theNational Information Exchange Model Dan McCreary Dan McCreary & Associates
Background • Dan McCreary - Dan McCreary & Associates • President of consulting firm that focuses on metadata-driven IT strategy development infrastructures for: • Service Oriented Architectures (SOA) • Model Driven Architecture and Development (MDA, MDD) • Data warehousing and Business Intelligence (BI) • Metadata management training • Hired in January of 2005 to build and populate a enterprise-wide metadata registry for the Minnesota Department of Education in partnership with Wisconsin Department of Public Instruction and Michigan Department of Education • Presentation Web site: • http://www.danmccreary.com/presentations/semweb2006
Agenda • Case study of building a “semantic garden” for K-12 metadata with a modest budget for a state agency (~$150K) • A place where your metadata can take root, grow and bloom • Target a broad audience with goal of concept retention – use of images and metaphors
1970 Sci-Fi Classic: “The Forbin Project” A New Intersystem Language! Lesson: Before you take over the world you must exchange semantically precise metadata!
Big Hairy Audacious Goals: Search Agents Legislator: What statewide programs increase test scores? DistrictSuperintendent: What “subgroups” in my district need the most help in math to meet NCLB guidelines? School Principal: What areas do new teachers need help in? Teacher: What areas do my students need the most help to pass statewide assessments?
“Shopping” for Metadata Your “shopping cart” is full of Data Elements
Key Business Drivers • Emphasis on “data driven decision making” • Need for longitudinal data analysis (i.e. a data warehouse) driven by the No Child Left Behind (NCLB) act • Required Consistency across: • Time • School districts • Grade-levels (K-12) • Assessment-subjects (reading, writing, math) • Need for cost-effective application interoperability and the desire to “break down application silos”
Technology Drivers • Desire to promote Service Oriented Architectures (SOA) • Web services • Build a library of exchange documents • Consistent web-form definitions • Desire to promote Model-driven Architecture (MDA) • Model driven development (MDD) • Model driven testing (MDT) • Migration from “procedural” to “declarative” programming • Procedural programming is over-emphasized and makes business logic only maintainable by programmers • Declarative programming and transformation is much more appropriate when a large metadata-databases are available • Metadata driven systems allow more non-programmers to maintain business logic • Avoid invention of new standards • Desire to “build upon" other machine-readable standards • ISO metadata registries do exist
Tightly Coupled Like a wine glass Fragile Breaks easily when there are changes in either the source or destination system Loosely Coupled Like a rubber ball Resilient Allows change and interoperability regression testing without breaking interfaces Example: the addition of new data elements Promotion of Loosely Coupled Systems
NCLB • US Department of Education effort to measure student “proficiency” deltas for nine subgroup populations (Asian, Black, Hispanic, Native American, Special Ed etc.) within each state over time and measure incremental gains in achievement levels • Introduced concept of Adequate Yearly Progress (AYP) for a School and School District – (if any sub-group fails your school and district fail) • Each state defines “proficiency” independently so state-to-state comparisons are not practical at this time • Multiple political interpretations of NCLB not discussed here: • Republican vs. Democratic • Rural vs. Suburban vs. Inner City • Public vs. Private Educational Funding • US Dept of Ed. releasing $53 million in grants for longitudinal data systems to individual states • Message from the Department of Education: “Build your statewide assessment metadata garden”
US Department of Justice/Department of Homeland Security initiative to build a federal metadata registry based on Global Justice XML Data Model (GJXDM) project • Complies with federal ISO/ICE 11179 metadata registry guidelines (with a few exceptions) • Introduced very successful tools for subschema generation in conjunction with large ontologies in building XML exchange documents • Introduced concepts of “Universal” and “Core” classification schemes • Available today in an XML Schema and an Excel spreadsheet • Subschema generation tools may be available in 3Q of 2006
You Are Here NIEM Scope Source: http://www.niem.gov/implementation.php
Domain Specific Student Common Teacher Universal Aircraft Activity Address Street Contact Document Boat Location Person Assessment Organization Case Image Long/Lat Event Residence Clothing Vehicle NIEM Type “Classification Scheme”
ConceptType PropertyType Activity Document Organization Person PersonBirthDate ActivityStartDate Student Teacher Enrollment ActivityEndDate EnrollmentStateDate StudentStateAssignedID Education Extensions High Level Structure of the NIEM • The NIEM loosely follows ISO-11179 metadata registry guidelines • The structure is a subclass hierarchy of “Concepts” • Start with a abstract Thing • Start with shared upper-ontology “Concepts” (blue) • Add properties that each have Representation Terms (orange) • Add subclasses and then subclass properties (yellow) Thing
Reuse and Extension Strategy • Match: If an NIEM data element met our needs, we used the NEIM data element and created an OWL sameAs statement with a high-precision match (Note: The definitions must match exactly) • Trim: If an NIEM data element has more detail than we needed, we created a local definition but created a sameAs link with a lower precision match level. • Extend: If the NIEM doesn’t have everything we need, create a local definition, add to the definition and create a sameAs link with a medium match level. • New: If there is no data element that matches what we need, we create an new one an put it in our local namespace. • Submit: If this is not a state-specific data element and we think other states may use it we can submit it to the NIEM for inclusion.
R2 RN R2 RN NIEM R3 R7 R3 R7 R4 R6 R4 R6 R5 R5 Mapping from Minnesota's metadata registry to N other metadata registries: The O(N2) problem Mapping from Minnesota's metadata registry to the NIEMThe O(N) problem A Semantic Equivalence Registry • Goal: create semantic maps to a single federal metadata standard, not many standards
ISO/IEC 11179 XML Tag Name • A standard naming convention for all XML data elements that “cross the wire” by most state and federal agencies that follow the ISO guidlines • Frequently called the “Data Element ISO name” Object Class Term (leftmost) Property Term (follows object class term) niem:PersonBirthDate Namespace (domain) Representation Term (rightmost)
The Data Mapping : The “Frontline” of Semantics • Left: A sample School District “flat file dump” from the Learning Management System (e.g. Moodle) of one school district (many data elements omitted for clarity) • Right: A mapping to a ISO named and defined Statewide XML schema standard for an on-line learning classes. Note because of names and definitions how much easier it is to quickly tell the semantics of the data element. Screen shot from Altova MapForce™
Goal: Add menu item for “Consult Semantic Broker” Need a “Semantically Aware” Mapper • Mapping tools have “auto connect matching children” but they require that the data element names be identical • They do not yet have the ability to “look up synonyms” in a metadata registry the equivalence of two data elements • We need semantic-aware tools!
Constrain Exchange Document Data Element Selection Schema creation using Altova XMLSpy™ and importing a GJXDM subschema • When creating an exchange document, we can now quickly select data elements from a list derived from a metadata registry that has semantically-precise definitions and namespaces • This can be done by business analysts (B.A.s) with under a week of training and does not require programmers • Constraints can be added to this document or a second constraint schema
MetadataRegistry A MetadataRegistry B The Hypertext Web The Semantic Web The current web is focused on linking published documents with HTML The semantic web is about linking data elements in published metadata registries Hypertext Links and Data Element Links
Challenges: Education Standards • Lack of machine-readable metadata registries for K-12 metadata with synonyms • Many standards • Minnesota historical 80-column fixed-with punch-card driven file format standards • US Dept of Ed. National Center for Education Statistics (NCES) • Common Core Data (CCD) • Educational Data Network (EDEN) • SchoolMatters • School Integration Framework (SIF) • XML Business Reporting Language (XBRL) • No published synonyms in any of the above standards • As of December 2005, no K-12 education-specific data elements in the NIEM metadata registry • Lack of useful data element definitions: • Document: “Details about inherent and frequently used characteristics of a document.”
Metadata Publishing Standards • Lack of a single standard to publish metadata elements (XML Schema, Topic Maps, ISO/IEC-11179, OWL, XMDR) that includes metadata registry concepts • OWL one of few standards with “synonym” statements but few tools currently support OWL and inter-metadata registry synonym statements • OWL appears to be the best candidate for “over the wire” representations and the most easily extensible but it is not a metadata registry standard
Challenge: We Need Semantic Aware Tools • Lack of semantically-precise production tools • Altova XMLSpy™ – excellent graphical schema design and management but no semantics in the XML schema standards • Stanford Medical Informatics Protégé (Open-Source) • Altova SemanticWorks™ (1st release in October of 2005) • ISO/IEC 11179 metadata registry tools are expensive • Frequently above $100K before customization • Some lack workflow and public/private publishing • Several excellent solutions if you have >$1M budget and consulting dollars • Ideal: A zero-footprint, AJAX-based, drag-and-drop, semantically-aware Open-Source schema design and data mapping tool that consults one or more synonym registries • Predict this is 3-4 years away (unless I get a grant)
Tools Used • Built initial version using a collection of Open-Source tools and inexpensive Altova tools (XMLSpy™, MapForce™ and SemanticWorks™) • Model-driven-development using a XML Schemas for the model of the registry • Define XML Schemas for all metadata registry structures (meta-metadata) • XSL transforms of the data dictionary schema • XSL transforms of the XSL transforms for impact analysis • XML Transforms for metadata publishing and visualizations • Apache Ant build scripts to publish to public web site and private intranet site • Eclipse 3.1 IDE to build and maintain ant scripts • Saxon 8 XSLT Java libraries • Extensive use of XSLT 2.0 and XPath 2.0 • FreeMind open source mind mapping tool with excellent XML interfaces • Various data element editing forms • (Castor, Struts, JSP, ASP, MS-Access)
Diagram From ISO-11179 Specification DATA ELEMENT DATA ELEMENT CONCEPT (1:1) (1:1) (1:N) (1:N) Object Class Object Class (1:N) (1:1) Property Property (1:1) Representation Taken from Figure 1 "Fundamental Model for Data Elements" ISO/IEC 11179:1:2004(E) page 11 (non-normative)
UML Model for RDF RDF Statement Subject Predicate ResourceValuedStatement LiteralValuedStatement Object Object Resource Literal Property TypedLiteral See Lee W. Lacy: OWL: Representing Information Using the Web Ontology Language p 82
Data Element Name ISO-Definition Enumerated Value Code Definition UML Model of Metadata Registry • A Data Dictionary is composed of many Data Elements • All Data Elements must have required names and ISO definitions • Each Data Element must be either a Concepts or a Property of a Data Element Concept • Each property is associated with a single concept and has a Property Name and a Representation Term • Some properties (where the representation term is of type Code) have one or more Enumerated Values (simplified for clarity) Data Dictionary Data Element Concept Property Property Name subClassOf Representation Term
Representation Terms(ebXML Core Component Tech Spec v1.9) • Amount – Monetary value with units of currency. • BinaryObject – Set of finite-length sequences of binary octets. (secondary: Graphic, Picture, Sound, Video) • Code – Character string that for brevity represents a specific meaningwhere the values are enumerated and each value has a clear definition. • DateAndTime – Date + time; a point in time where both date and time are known. (secondary: Date, Time) • Identifier – Character string used to establish identity of, and uniquelydistinguish one instance of an object within an ID scheme. (authorized abbreviation: ID) • Indicator – Boolean (exactly two mutually exclusive values). • Measure – Numeric value determined by measurement with units. • Number – Assigned or determined by calculation. (secondary: Value, Rate, Percent) • Quantity – Non-monetary numeric value or count with units. • Text – Character string generally in the form of words. (secondary: Name)
Publishing Metaphor • Publishing implies high-quality information is shared with a large audience • Emphasis on multi-state reviews and clarity to a diverse base of consumers • Commitment to accuracy and change control
The Psychology of Sharing and Trust • Research done in mid-1990s by Adele Goldberg and others • Groups only tend to share objects with other people or systems they trust • We need to create systems for building trust • Have a define a peer review process (see 11179 standards) • Have experts with credibility play a role in approval • Publish list of users of metadata • Publish test cases • Publish change control process • Publish success stories
Initial Draft Metadata Publishing Workflow Funnel • Develop a simple workflow system for publishing data elements • Include harvesting areas of simple glossary-of-terms found in documentation, web sites and by using metadata “scrapers” to inventory all columns in relational database systems • Get stakeholder teams to “accept” a data elements, review them and take on the data stewardship role for these data elements • Commit to change-control only after data elements are marked “approved for publication” by over 50% of the stewardship team • Exclude sensitive information from public web sites (data sources) Glossary Of Terms UnderReview Approved for Publication Metadata Harvesters
Subversion XML Form Editors Data Elements (500 Small XML Files) RDBMS Data Dictionary (Single, Large XML File) Transforms (Saxon 8) Apache Ant HTML PDF FreeMind MindManager SQL Excel OWL Intranet OLAP Cubes Public Web Server Protégé SemanticWorks Model-Driven Development
Visualization • People will not trust what they don’t understand • They tend to understand concepts if you make them clear • Visualizations are the best way to promote clarity to a subgroup • Focus attention and remove “chart junk” • Quickly display a subgroup’s data elements under review • Let them pick the colors! • 50 line XSLT Sample from FreeMind: Open Source mind mapping tool
Results http://education.state.mn.us/datadictionary
Store Semantic Mappings to Foreign Data Elements Directly in the Metadata Registry Current metadata registry standardsdo not clearly specify where and howsemantic equivalence and precision is stored.
Owl:sameAs and owl:equivalentClass • OWL is different from XML Schema because it addresses data element semantics • XML Schema has no way of declaring two data types as "equivalent" • XML Schema was designed to create a way to validate a data set used in messaging systems • OWL was designed to manage metadata • Example: • owl: Class Equivalency Operator "equivalentClass“ • OWL “sameAs” operator for instance equivalence • NIEM:Person = SUMO:Human = CYC:Individual Metadata Equivalence Mappings Metadata Registry A Metadata Registry B
Metadata Registry Metadata Translation Service RDF Queries Metadata Mappings XML Results Model A Model B SQL or XMLA Queries In ModelB Data Warehouse (RDBMS) XMLResponse In Model A TDS In ModelB Future: Semantic Mappers and Semantic Brokers Report Request In Model A Gartner: Vocabulary-based transformation XMLA: XML for Analysis
What Data Elements Are Important? • It costs time and money for each data element you add to your metadata registry (over $1,000 per data element) • The more unimportant data elements are in your metadata registry, the harder it becomes to detect duplicates • Prioritization criteria should be developed to determine what Data Elements should have priority • Metadata “scraping tools” developed to pull candidate Data Elements from databases, spreadsheets and documents • We developed a six-step criteria for determining the value of a data element in the data dictionary • Anything can be in a Glossary but only about 10% of Glossary data items are promoted to a data element Low Value Data Elements High Value Data Elements
Wikipedia Rocks! • It is currently burdensome to add new metadata to the registry • Would like to add “Edit this data element” (ala Wikis) • Ideally a “Semantic Wiki” See: Wikipedia: “Semantic Wiki”
Wantlist Standards <?xml version="1.0" encoding="UTF-8"?> <w:wantList w:release="3.0.3" xmlns:w="http://gjxdmtools.gtri.gatech.edu/wantList/1"> <w:element w:prefix="j" w:name="ContactEmailID" w:isReference="false"/> <w:element w:prefix="j" w:name="ContactTelephoneNumber" w:isReference="false"/> <w:element w:prefix="j" w:name="Person" w:isReference="false"/> <w:element w:prefix="j" w:name="PersonBirthDate" w:isReference="false"/> <w:element w:prefix="j" w:name="PersonGivenName" w:isReference="false"/> <w:element w:prefix="j" w:name="PersonSurName" w:isReference="false"/> </w:wantList> • Metadata management tools could share data elements wantlists with other tools. • If you don’t have an appropriate data element, you should be able to look it up in clearinghouse of metadata with precise ISO definitions (e.g. Swoogle) • Web service queries and metadata translation services could be used
McCreary’s Top 10 Recommendations • Organizations and applications that exchange data should be encouraged to publish their metadata in a machine-readable format to facilitate agent interoperability • Published data dictionaries should drive exchange document creation standards and published web services and metadata registry “shopping cart” tools should be accessible to non-programmers • Data warehouse initiatives should attempt to reuse and integrate existing federal metadata standards • Federal and state agencies should follow ISO/IEC11179 and Data Reference Model (DRM) guidelines and use formal representation terms for all data element properties • Fundamentals of metadata publishing and transformation training should be encouraged by data architects and integration managers • Metadata standards should continue to be developed with the goal of building semantic integration brokers and agents • Producers of data mapping software should integrate semantic equivalency statements into automated mapping systems • XML integration appliance vendors should include semantic integration services to make integration easier • Organizations should perform ROI analysis on semantic integration • Awards should be given to organizations that publishing useful and high-quality metadata
Things to Ponder… • Just like the ARPANET and DAML, some worthy standards come from US federally funded efforts. But they will need to “evolve” before they are widely adopted outside government projects. • Before you “take over the world”, you need to publish your metadata with your stakeholders • Metadata publishing is 80% social engineering and 20% technical engineering and is achieved through building shared meaning via trust building systems • Standards are complex. Sometimes the more general they are, the more widely adopted they are but the more abstract they become. Some standards frequently need an expert interpreter to adjust for local business needs • People need to understand something before they trust it. One of the best ways is to build tools to allow users to visualize their data elements • When planting a metadata garden, start small and keep weeding out the unimportant and redundant data elements
Metadata Publishing Open The Door To The Semantic Web! • Metadata publishing is hard • It is a foundation upon which the Semantic Web will be built • The benefits are indirect and need strong executive sponsorship • Metadata publishing is no “silver bullet” • I believe it is the most direct way to get to the Semantic Web • This will be the most practical way to build intelligent agents Agents
References • Web site for paper: • www.danmccreary.com/presentations/semweb2006 • Data dictionary for Minnesota Department of Education • education.state.mn.us/datadictionary • ISO-11179 metadata registry standards • National Information Exchange Model (NIEM.gov) • Wikipedia Articles • Metadata registry • ISO/IEC 11179 • Representation term • Metadata publishing • Semantic broker
Questions & Answers If software is ever going to be able to effectively inter-operate (in ways that were not explicitly preconceived and engineered), it will be because applications share enough of the semantics of their data elements. Doug Lenat, Cycorp Semantic Technology Conference 2005
Contact Information Dan McCreary, President Dan McCreary & Associates Dan <at> danmccreary.com http://www.danmccreary.com also: http://www.LinkedIn.com