Naming in XML Documents

Naming in XML Documents Dr. Ramon Lawrence IDEA Lab University of Iowa ramon-lawrence@uiowa.edu

Outline • Motivation • Overall Goals • Background • Naming and Ontologies • Semantic Naming of XML Elements • Semantic Querying of Named XML Documents • Support for Document Evolution and Linking • Future Work and Conclusions

Motivation • Motivation #1 - Naming is important despite limited research focus. • Names are a gateway to structure, but can also be used to avoid structure. • Users understand names better than structure, but naming is not considered in many models. • Motivation #2 - XML querying can be improved by minimizing use of path expressions. • XML query languages are complex and highly structured-based (even more than SQL). • Path expressions are similar to navigating in hierarchical models which was proven undesirable. • Queries cannot adapt to document changes.

DTD with Decent Naming <!ELEMENT list-manufacturer (manufacturer+)> <!ELEMENT manufacturer (mn-name, model+)> <!ELEMENT mn-name (#PCDATA)> <!ELEMENT model (mo-name, year, front-rating side-rating, rank, vehicle+)> <!ELEMENT mo-name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT front-rating (#PCDATA)> <!ELEMENT side-rating (#PCDATA)> <!ELEMENT rank (#PCDATA)> <!ELEMENT vehicle (color, price, vendorName, option+)> <!ELEMENT color (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT vendorName (#PCDATA)> <!ELEMENT option (#PCDATA)> Are Names Really That Important? Poorly Named DTD <!ELEMENT LM (M+)> <!ELEMENT M (MN, MO+)> <!ELEMENT MN (#PCDATA)> <!ELEMENT MO (N, Y, F, S, R, V+)> <!ELEMENT N (#PCDATA)> <!ELEMENT Y (#PCDATA)> <!ELEMENT F (#PCDATA)> <!ELEMENT S (#PCDATA)> <!ELEMENT R (#PCDATA)> <!ELEMENT V (C, P, VN, O+)> <!ELEMENT C (#PCDATA)> <!ELEMENT P (#PCDATA)> <!ELEMENT VN (#PCDATA)> <!ELEMENT O (#PCDATA)>

Overall Goal • The overall goal is to develop a naming methodology for XML tags that has two desirable properties: • 1) Provides more semantics and context information to users. • 2) Allows semantic querying of XML documents to simplify query formulation and handle document evolution. • The naming methodology must NOT enforce a strict standard on naming, but encourage better naming by providing a useful technique.

BackgroundXML Tag Names and Standards • The development of standard tag sets for given problem domains has been the focus of many organizations. • ebXML, RosettaNet, CML, XFRML, MathML • Our goal is not to define THE tag set for all XML, but rather suggest a methodology for constructing tag sets. • Applicable to the Semantic Web effort.

BackgroundXML Querying • There has been many XML query languages proposed: • LOREL, XML-QL, XML-GL, XSL, XQL • Even the graphical XML query language, XML GL, only supports querying with path expressions. • Why would we go back in time and make querying harder for the user? • The relational model replaced the hierarchical model because of its declarative query syntax.

Running Example

Converting the ER Model to XML • Modeling in XML requires a decision on how to hierarchically organize the information in the XML document. • Once selected, the hierarchical organization becomes the only view of the data and requires the user to formulate queries based on the hierarchy chosen. • Nesting of elements in XML has ambiguous semantics as the nesting may represent: • specialization/generalization (IS-A), Part-Of/HAS-A, ordering, grouping, general relationship (join) • Without tag names, impossible to determine relationship between nested elements.

Two XML DTDs for ER Diagram (1) DTD1 <!ELEMENT list-manufacturer (manufacturer+)> <!ELEMENT manufacturer (mn-name, model+)> <!ELEMENT mn-name (#PCDATA)> <!ELEMENT model (mo-name, year, front-rating, side-rating, rank, vehicle+)> <!ELEMENT mo-name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT front-rating (#PCDATA)> <!ELEMENT side-rating (#PCDATA)> <!ELEMENT rank (#PCDATA)> <!ELEMENT vehicle (color, price, vendorName, option+)> <!ELEMENT color (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT vendorName (#PCDATA)> <!ELEMENT option (#PCDATA)>

Differences: 1) Different hierarchical organization 2) Different modeling of manufacturer mn-name 3) Naming differences op-name Two XML DTDs for ER Diagram (2) DTD2 <!ELEMENT list-vendor (vendor+)> <!ELEMENT vendor (vendorName, vehicle+)> <!ELEMENT vendorName (#PCDATA)> <!ELEMENT vehicle (color, price, op-name+,mn-name, model)> <!ELEMENT color (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT op-name (#PCDATA)> <!ELEMENT model (mo-name, year, front-rating, side-rating, rank)> <!ELEMENT mo-name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT front-rating (#PCDATA)> <!ELEMENT side-rating (#PCDATA)> <!ELEMENT rank (#PCDATA)>

A Simple Query on Both DTDs • Query: • Return the manufacturer name and vehicle price for all vehicles with price < $30,000 and the vehicle model is in the top 10 for safety tests. • DTD1: • DTD2: • select M.mn-name, M.model.vehicle.price • from list-manufacturer.manufacturer M • where M.model.rank <= 10 and • M.model.vehicle.price < 30000 select V.mn-name, V.price from list-vendor.vendor.vehicle V where V.model.rank <= 10 and V.price < 30000

Ontologies and Naming • Assume the existence of some ontology to extract terms with definitions. • May use WordNet or problem-specific ontology. • Assumption: Human users have a “built-in” ontology, or view of the world, based on their experience and knowledge of the language. • By selecting common terms from a shared dictionary (language), both the producer (XML document source), and consumer (XML document user) will understand the semantics of a data element by terms used to defined the name. • Caveat: Understanding is to some degree of accuracy. (Hopefully >= 90%).

Ontologies and Naming (2) • Assumption #2: As more context information is provided by the producer (in the form of additional terms), the consumer is more confident that their world view is consistent with that of the the producer. • Consumer understands the producers view even if they originally do not share the same view. • Important: At no time is their intelligence demonstrated by software. The intelligence is embedded into the names assigned by the producer, and extracted by the consumer. • The system never needs to build its own world view to aid the users in reconciling theirs.

Structure of a Semantic Name • A semantic name is a tag name for an XML element of the following form: • semantic_name ::= [CT_Term] | [CT_Term].PN • CT_Term ::= CT | CT ; CT_Term | CT , CT_Term • CT ::= <dictionary term> • PN ::= <dictionary term> • A semantic name is intended to capture structure-independent semantics by combining multiple dictionary terms.

Name is context-independent. <!ELEMENT Vehicle--Price (#PCDATA)> DTD1 with Semantic Naming <!ELEMENT V (Manufacturer+)> <!ELEMENT Manufacturer (Manufacturer--Name, Manufacturer-Model+)> <!ELEMENT Manufacturer--Name (#PCDATA)> <!ELEMENT Manufacturer-Model (Manufacturer-Model--Name, Manufacturer-Model--Year, Manufacturer-Model-NHSCTest--FrontRating, Manufacturer-Model-NHSCTest--SideRating, Manufacturer-Model-NHSCTest--Rank, Vehicle+> <!ELEMENT Manufacturer-Model--Name (#PCDATA)> <!ELEMENT Manufacturer-Model--Year (#PCDATA)> <!ELEMENT Manufacturer-Model-NHSCTest--FrontRating (#PCDATA)> <!ELEMENT Manufacturer-Model-NHSCTest--SideRating (#PCDATA)> <!ELEMENT Manufacturer-Model-NHSCTest--Rank (#PCDATA)> <!ELEMENT Vehicle (Vehicle--Color, Vehicle--Price,Vendor--Name,Vehicle-Option--Name+)> <!ELEMENT Vehicle--Color (#PCDATA)> <!ELEMENT Vehicle--Price (#PCDATA)> <!ELEMENT Vehicle-Option--Name (#PCDATA)> <!ELEMENT Vendor--Name (#PCDATA)> <!ELEMENT Manufacturer-Model-NHSCTest--Rank (#PCDATA)>

Semantic Querying • Using semantic tag names introduces a tradeoff between increased semantic description and longer tag names. • Path expressions are difficult to formulate and complicate XML querying. • Since semantic names are structure independent, queries can be posed without using path expressions.

A Context View • A context view is a structure-independent hierarchy of concepts in the XML document. • The hierarchy is constructed automatically from the tag names in the XML document/DTD. • User’s query on the context view, and their queries are mapped to LOREL queries on the XML documents.

Vehicle Manufacturer Vendor Color Price Name Model Name Option Name Name Year NHSC Test Front Rating Side Rating Rank Building the Context View [Vehicle] [Manufacturer] [Manufacturer].Name [Manufacturer;Model]

Vehicle Manufacturer Vehicle Manufacturer Vendor Price Color Model Name Price Name Model Name Option < 30000, (return) (return) Name Name Year NHSC Test NHSC Test Front Rating Side Rating Rank Rank <= 10 Querying the Context View • Return the manufacturer name and vehicle price for vehicles with price < $30,000 and the vehicle model is in the top 10 for safety tests.

list-manufacturer manufacturer mn-name model rank mo-name vehicle year side-rating front-rating option color vendorName price Mapping to DTD1

Semantic Naming in DTD1 V [Manufacturer] [Manufacturer].Name [Manufacturer;Model] [Manufacturer;Model;NHSCTest].Rank [Manufacturer;Model].Name [Manufacturer;Model ;NHSCTest].SideRating [Vehicle] [Manufacturer;Model].Year *(FR) [Vehicle].Color [Vehicle;Option].Name [Vendor].Name [Vehicle].Price

[Manufacturer].Name (return) [Manufacturer;Model;NHSCTest].Rank [Vehicle].Price <30000, return Query Mapping to DTD1 V [Manufacturer] [Manufacturer].Name [Manufacturer;Model] [Manufacturer;Model;NHSCTest].Rank [Manufacturer;Model].Name [Manufacturer;Model ;NHSCTest].SideRating [Vehicle] [Manufacturer;Model].Year *(FR) [Vehicle].Color [Vehicle;Option].Name [Vendor].Name [Vehicle].Price

Mapping to DTD2 list-vendor vendor vendorName vehicle mn-name op-name model color price mo-name front-rating side-rating year rank

[Manufacturer].Name (return) [Vehicle].Price < 30000,return [Manufacturer;Model;NHSCTest].Rank <= 10 Mapping to DTD2 V [Vendor] [Vendor].Name [Vehicle] [Vehicle;Option].Name [Manufacturer].Name [Vehicle].Color [Vehicle].Price [Manufacturer;Model] [Manufacturer;Model;NHSCTest]. FrontRating [Manufacturer;Model].Name [Manufacturer;Model;NHSCTest]. SideRating [Manufacturer;Model;NHSCTest].Rank [Manufacturer;Model].Year

Mapping Algorithm • Perform a breadth-first traversal of DTD x to build a mapping table T. • Each entry in T contains a tag name tn, and a set of path expressions P. Each p in P provides a path in DTD x to element named tn. • If DTD x is a tree, each tn has a unique path. • If DTD x is a graph, there may be multiple possible paths. Can return path union or get user input. • After all path mappings have been determined, build a spanning tree connecting paths. • Unique spanning tree for tree DTDs, may have multiple spanning trees for graph DTDs.

Conclusions • Naming is important because names for structures are a user’s first contact with structural data representations. • Naming can be exploited to hide the structure by embedding more information into names. • The names assigned to XML elements have been standardized within organizations, but no work has been done on examining what constitutes good names. • By using names that are structure-independent, semantic querying is possible. • Semantic querying does not use path expressions. • Semantic queries support document evolution.

Future Work • Test performance and cost with renaming on real-world XML document sets. • Does the increased XML document size affect query performance? • Develop formal query algebra for semantic queries.

References • Publications: • Using Unity to Semi-Automatically Integrate Relational Schema, Demonstration at ICDE’2002. • Querying Relational Databases without Explicit Joins, R. Lawrence and K. Barker, DASWIS 2001. • Integrating Relational Database Schemas using a Standardized Dictionary, SAC’2001 - ACM Symposium on Applied Computing, March, 2001. • Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pg 127-136, Oct. 2000. • Further Information: • http://www.cs.uiowa.edu/~rlawrenc/ • http://idealab.cs.uiowa.edu

Extra Slides Extra Slides...

Naming in XML Documents