530 likes | 830 Vues
XML Publishing. XML publishing: overview SilkRoute (AT&T) XPERANTO (IBM Research) Clio (IBM Research and U. Toronto) Schema-directed XML Publishing: ATG (Bell Labs and U. Edinburgh) PRATA: implementation of ATG Extensions of ATG. Magical Publishing Box!. XML!.
E N D
XML Publishing • XML publishing: overview • SilkRoute (AT&T) • XPERANTO (IBM Research) • Clio (IBM Research and U. Toronto) • Schema-directed XML Publishing: • ATG (Bell Labs and U. Edinburgh) • PRATA: implementation of ATG • Extensions of ATG QSX (LN 5)
Magical Publishing Box! XML! XML -- it’s an exchange format Data base Data base HTML Data base! • Most data is stored in pre-existing databases • Need to provide XML wrappers to export data QSX (LN 5)
Web XML XML DB DB XML publishing • An XML view definition language: specifying desired mapping • Efficient implementation of the view language Q: XML view QSX (LN 5)
From relations to XML Views <Actor> <Lname>Viterelli</Lname> <Fname>Joe</Fname> <Movie year=“1999”> Analyze This </Movie> <Movie year=“2001”> See Spot Run </Movie> </Actor> <Actor> <Lname>Winter</Lname> <Fname>Alex</Fname> <Movie year=“1988”> Bill and Ted’s Excellent Adventure </Movie> Actor(aid, lname, fname) : { <001, “Viterelli”, “Joe”>, …} Movie(mid, title, year) : { <011, “Analyze This”, 1999>, <032, “See Spot Run”, 2001> } Appearance(mid, aid) : { <001, 011>, <001, 032> } QSX (LN 5)
Commercial systems -- canonical publishing Canonical publishing: the universal-relation approach • Embedding single SQL query in XSL stylesheet • Result: canonical XML representation of relations Systems: • Oracle 10g XML SQL facilities: SQL/XML, XMLGen • IBM DB2 XML Extender: SQL/XML, DAD • Microsoft SQL Server 2005: FOR-XML, XSD incapable of expressing practical XML publishing: default fixed XML document template QSX (LN 5)
Canonical publishing A.lnameA.fnameM.yearM.title Viterelli Joe 1999 See Spot Run Viterelli Joe 2001 Analyze This IBM DB2 SQL Statement giving big relation <SQL_stmt> SELECT A.lname,A.fname, M.year, M.title FROM Movie M, Actor A, Appearance Ap WHERE M.mid=Ap.mid AND M.aid=A.Aid ORDER BY aid </SQL_stmt> <Actor> <Lname>Viterelli</Lname> <Fname>Joe</Fname> <Movie year=“1999”> Analyze This </Movie> <Movie year=“2001”> See Spot Run </Movie> </Actor> <Actor> <Lname>Winter</Lname> <Fname>Alex</Fname> … +Formatting Template annotated with columns of the universal relation <element_node Actor> <element_node Lname> <text_node> <Column name=“A.lname”/> </text_node> … QSX (LN 5)
? ? ? Middleware Approach View Definition Query-cost Estimates Source Capabilities Query Generator Request SQL Queries Tagger QSX (LN 5)
XPERANTO • Commercial system: IBM DB2 XML extender, SQL/XML • Middleware (vendor-independent): XPERANTO Extending SQL with XML constructors: selectXML-aggregation from R1, . . ., Rn where conditions XML constructors (XML-aggregation): functions • Input: tables and XML trees (forest) • Output: XML tree QSX (LN 5)
Actor (aid, lname, fname) Appearance (mid, aid) Movie (mid, title, year) XML publishing with XPERANTO (SQL/XML) <Actor> <Lname>Viterelli</Lname> <Fname>Joe</Fname> <Movie year=“1999”> . . . </Movie> <Movie year=“2001”> . . . </Movie> </Actor> <Actor> <Lname>Winter</Lname> <Fname>Alex</Fname> … Relational schema: Extended SQL: selectXMLAGG( ACT(lname, fname, select XMLAGG ( MOV(title, year) from Appearance Ap, Movie M where Ap.aid = A.aid and Ap.mid = M.mid group order by A.lname, A. fname )) from Actor A QSX (LN 5)
XML constructors • Actor constructor: create function ACT(lname: str, fname: str, mlist: XML) <Actor> <Lname> {lname} </Lname> <Fname> {fname} </Fname> mlist </Actor> • Movie constructor (mlist) create function Mov(title: str, year: int) <Movie year=“{year}”> {title} </Movie> Verbose and cumbersome • small document: tedious • large documents: unthinkable QSX (LN 5)
root Q3 Q1 Q2 tag1 tag3 tag2 Q4 Q5 tag5 tag6 SilkRoute Annotated template: embedding SQL in a fixed XML tree • Middleware: SilkRoute • Commercial: SQL Server 2005 XSD, IBM DB2 DAD Advantages: • More `modular’ comparing to the universal relation approach • Limited schema-driven: conforming to a fixed doc template QSX (LN 5)
Clio Schema mapping based on inter-schema constraints Given two schemas S1 and S2, we want to translate instances of S1 to instances of S2. Constraint-driven approach: • Start with a set of referential integrity constraints (foreign keys) from S1 to S2 (schema matching) • Derive schema mapping from S1 to S2, by reasoning about the given integrity constraints (chasing technique) Pros and cons: • Generic system, and can be used for XML publishing • Semi-automated • Chasing may not terminates for XML DTD/Schema • Recursive target schema (S2)? unclear QSX (LN 5)
Getting real: data exchange on the Web All members of a community (industry) agree on a DTD and then exchange data w.r.t. it:e-commerce, health-care, ... XML Publishing: • mapping relational data to XML • conforming to the predefined DTD Web DTD XML XML Q: XML view DB1 DB2 QSX (LN 5)
Data exchange: insurance company and hospital Daily report: Relational database R at the hospital: Patient (SSN, name, tname, policy#, date) inTreatment (tname, cost) outTreatment (tname, referral#) Procedure (tname1, tname2) treatment • in hospital: composition hierarchy in Procedure • outside of the hospital: referral# hospital insurance company XML view R XML QSX (LN 5)
Example: insurance company and hospital DTDDpredefined by the insurance company: report patient* patient SSN, pname, treatment, policy# treatment tname, (inTreatment + outTreatment) inTreatment treatment* outTreatment referral# How to define a mapping such that for any instance DB of R, • (DB)isan XML document containingall the patients and their treatments (hierarchy, referral#) from DB, and • (DB) conforms to D? QSX (LN 5)
report ... patient patient patient patient treatment name Policy# inTreatment tname ... “Joe” treatment <-- unbounded treatment ... ... Challenge: recursive type XML data: unbounded depth -- cannot be decided statically treatment tname, (inTreatment + outTreatment) inTreatment treatment* --- recursive SSN “123” QSX (LN 5)
... patient patient patient patient treatment <-- data-dependent tname outTreatment inTreatment ... treatment treatment referral# Challenge: non-determinism The choice of a production (element type definition) treatment tname, (inTreatment + outTreatment) -- depends on the underlining relational data report QSX (LN 5)
Existing systems • fixed XML tree template or ignoring DTD-conformance • middleware: SilkRoute (AT&T), XPERANTO (IBM), … • systems: SQL Server 2005, IBM DB2 XML extender, … incapable of coping with a predefined DTD (e.g. recursion) • type checking: define a view and then check its conformance • undecidable in general, co-NEXPTIME for extremely restricted view definitions • no guidance on how to define XML views that typecheck one gets an XML view that typechecks only after repeated failures and with luck QSX (LN 5)
XML Publishing • XML publishing: overview • SilkRoute (AT&T) • XPERANTO (IBM Research) • Clio (IBM Research and U. Toronto) • Schema-directed XML Publishing: • ATG (Bell Labs and U. Edinburgh) • PRATA: implementation of ATG • Extensions of ATG QSX (LN 5)
ATG = DTD + Semantic rules attributes Attribute Translation Grammar (ATG) • DTD: normalized; element type definitions e ::= PCDATA | | e1, …, en | e1 + … + en | e* • Attributes: $e associated with each element type e $e: tuple-valued, to pass data value as well as control • Rules: associated with each e ; for e’ in , $e’ = Q($e) • SQL query Q extracts data from DB • parent attribute $e as a constant parameter in Q QSX (LN 5)
report ... patient patient patient patient Semantics: conceptual evaluation • Top-down report patient* $patientselect SSN, name, tname, policy from Patient --- SQL query recall Patient (SSN, name, tname, policy#) • Data-driven: a patient element for each tuple in Patient relation $patient $patient $patient $patient QSX (LN 5)
Inherited attributes Inherited: $child is computed using $parent • patient SSN, name, treatment, policy# $SSN = $patient.SSN, $name = $patient.name $treatment = $patient.tname $policy# = $patient.policy recall $patient = (SSN, name, tname, policy) • SSN PCDATA $PCDATA = $SSN report ... patient patient patient patient $patient name treatment policy# $SSN SSN “123” “Joe” “LU23” QSX (LN 5) $PCDATA
Qc($treatment)=1 Qc($treatment)=null Coping with non-determinism treatment tname, (inTreatment + outTreatment) $tname = $treatment ($inTreatment, $outTreatment)= case Qc($treatment).tag --- conditional query 1: ($treatment, null) else: (null, $treatment) Qc: select 1 as tag frominTreatmentwhere tname= $treatment • conditional query: the choice of production • $parent as constant parameter in SQL query ... treatment treatment tname tname QSX (LN 5) inTreatment outTreatment
treatment inTreatment tname ... treatment treatment Coping with recursion inTreatment treatment* $treatment select tname2 from Procedure where $inTreatment = tname1 recall Procedure (tname1, tname2) • $parent as constant parameter in SQL query Q • inTreatment is further expanded as long as Q(DB) is nonempty QSX (LN 5) $treatment $treatment
policy# data-dependent <-- unbounded DTD-directed publishing with ATGs • DTD-directed: the XML tree is constructed strictly following the productions of a DTD --- DTD conformance • Data-driven: the choice of productions and expansion of the XML tree (recursion) depends on relational data report ... patient patient patient patient treatment name SSN inTreatment tname ... “123” “Joe” treatment treatment ... ... QSX (LN 5)
ATGs vs. existing systems • DTD-conformance: • ATGs: provide guidance for how to define DTD-directed publishing • Other systems: based on a fixed tree template • Expressive power: strictly more expressive than others • ATGs: capable of expressing XML views supported by other systems • Other systems: cannot handle recursion/nondeterminism QSX (LN 5)
ATGs vs. Attribute Grammars (AGs) • AGs: • Definition: w.r.t. a CFG • Evaluation: parsea string with the CFG, then evaluate attributes given the parse tree • ATGs: combining DTD and database operations • Definition: w.r.t. an ECFG (DTD) and SQL queries • Evaluation: given DB, extract relevant data from DB with SQL queries to build an XML tree of the DTD It does not make sense to “parse” a database w.r.t. a DTD ATGs are not a mild variation of AGs QSX (LN 5)
XML Publishing • XML publishing: overview • SilkRoute (AT&T) • XPERANTO (IBM Research) • Clio (IBM Research and U. Toronto) • Schema-directed XML Publishing: • ATG (Bell Labs and U. Edinburgh) • PRATA: implementation of ATG • Possible extensions of ATG QSX (LN 5)
PRATA: middleware based on ATGs R ATG graph query plan generation, evaluation relations XML ATG tagging parsing cost estimate query statistics query results • ATG graph: representing the ATG • relations: representing root-leaf paths ---> • tagging: one pass QSX (LN 5)
XML Publishing • XML publishing: overview • SilkRoute (AT&T) • XPERANTO (IBM Research) • Clio (IBM Research and U. Toronto) • Schema-directed XML Publishing: • ATG (Bell Labs and U. Edinburgh) • PRATA: implementation of ATG • Extensions of ATG QSX (LN 5)
constraints Extension: Capturing integrity constraints • XML schema: • Type (DTD) • Integrity constraints: keys, foreign keys • Schema-directed XML publishing: automatically guarantee that the target document both conforms to the type and satisfies the constraints -- in a single framework DTD XML: ATG+ DB QSX (LN 5)
Data integration in XML • multiple, heterogeneous data sources – multi-source queries, query decomposition, object fusion, … • distributed sources: scheduling of query execution • schema-conformance, … query answer schema (D, ) XML query translation integration Integration middleware updates DB4 DB1 DB2 DB3 QSX (LN 5)
DB1 DB2 DB3 DB4 Example: hospital and insurance company patient (SSN, name, policy) visitInfo (SSN, trId, date) billing (trId, price) cover (policy, trId) daily XML report treatment (trId, name) procedure (trId1, trId2) Given a date, for each patient of that day, report: • SSN, name (DB1) • treatments (hierarchy) covered by insurance (DB1, DB3, DB4) • cost of all and onlythose treatments received (DB2) QSX (LN 5)
Predefined schema (D, ): DTD report patient* patient SSN, name, treatments, bill treatments treatment* treatment trId, tname, treatments bill item* item trId, price report date patient patient patient treatments bill SSN name treatment item treatment item . . . trId tname treatments trId price QSX (LN 5)
trId tname Predefined schema (D, ): XML constraints constraints: relative to each patient, • key: each treatment is charged only once patient( item.trId item ) • foreign key: every treatment has a billing record patient ( treatment.trId item.trId ) report date patient patient patient SSN name treatments bill treatment treatment item item . . . QSX (LN 5) treatments trId price
db ... country country ... ... name name province province province province “Holland” “Belgium” name capital name capital “Limburg” inProvince “Limburg” inProvince “Hasselt” “Maastricht” “Limburg” “Limburg” More on XML integrity constraints • absolute: on the entire document key: country.name country • relative: on a subdocument rooted at a country key: country (province.name province) foreign key: country (capital.inProvince province.name) (value inclusion) QSX (LN 5)
Challenge: context dependency bill subtree: all and only the trIds in the treatments subtree • controlled derivation: thebill subtree cannot be started before the treatments subtree is completed. • information passing: downward, upward, sideways report date date patient SSN name treatments SSN bill trIdS trId trId unbounded QSX (LN 5)
Challenges • DTD-conformance: recursive, nondeterministic • integrity constraints: validation during document generation • multi-source queries: a single one involves several databases • context-dependency: not strictly top-down or bottom-up Previous work: • XML publishing: single data source, no constraints, top-down. • XML integration: little support for XML schema-conformance. • XML query languages + type checking: provide no guidance for how to ensure schema-conformance; optimization hard. QSX (LN 5)
(D, ) middleware DB3 DB1 DB2 AIG Middleware for schema-directed integration view definition language optimization techniques • A lightweight language: Attribute Integration Grammar (AIG) • Cost-based optimization in light of context-dependency DTD + constraints semantic attributes = semantic rules QSX (LN 5)
Attribute Integration Grammar (AIG) • DTD: element type definitions e ::= PCDATA | | e1, …, en | e1 + … + en | e* • Attributes: associated with each element type e • Inh(e): inherited from parent/siblings (top-down/sideways) • Syn(e): synthesized from children (bottom-up) Syn(e), Inh(e): tuple or set/bag-valued • Rules: associated with each production e , for e’ in • Inh(child): Inh(e’) = Q(Inh(parent),Syn(sibling) ) • Syn(parent):Syn(e) = U(Syn(children)) -- union Q: multi-source SQL query with parameters • Dependency: e2 must be evaluated before e1 if Inh(e1) = Q( … Syn(e2) … ) -- acyclic graph (DAG) QSX (LN 5)
date AIG semantics: conceptual evaluation • following the dependency ordering; starting from the root report patient* Inh(patient) Q1 (Inh(report)) select Inh(report) as date, p.SSN, p.name, p.policy from DB1: patient p, DB1: visitInfo v where p.SSN = v.SSN and v.date = Inh(report) Recall DB1: patient (SSN, name, policy), visitInfo (SSN, trId, date) • Parameter in a query: Inh(report) as a constant • Data driven: the number of patients depends on Q1 report Inh patient patient patient QSX (LN 5)
Multi-source query patient SSN, name, treatments, bill Inh(SSN) = Inh(patient).SSN, . . . , Inh(treatments) = Q2(v) -- v = Inh(patient) select t.trId, t.tname from DB1: visitInfo i, DB3: cover c, DB4: treatment t where i.SSN = v.SSN and i.date = v.date and t.trId = i.trId and c.trId = i.trId and c.policy = v.policy • a single query uses DB1, DB3 and DB4 • tuple- and set-valued attributes (Inh(SSN), Inh(treatments)) Recall DB1: patient(SSN, name, policy),visitInfo(SSN, trId, date) DB3: cover (policy, trId) DB4: treatment (trId, name),procedure (trId1, trId2) QSX (LN 5)
Inh report patient patient Inh patient SSN treatments Inh Inh bill name Inh Syn Initial top-down pass: context-dependent patient SSN, name, treatments, bill Inh(SSN) = Inh(patient).SSN, Inh(name) = Inh(patient).name, Inh(treatments) = Q2(Inh(patient)) Inh(bill) = Syn(treatments)<-- halt • DTD-directed: generate children following the production • Inh(bill):defined with sibling -- Syn(treatments), dependency ordering: evaluate billafter treatments <- - halt QSX (LN 5)
patient Inh(treatments) treatments . . . treatment treatment treatments Initial top-down pass: recursion treatments treatment* Inh(treatment) Inh(treatments) -- set of (trId, tname) Data driven:treatments expansion depends on Inh(treatments) • empty: expansion terminates, Syn(treatments)is empty; • nonempty: expands. QSX (LN 5)
Syn Leaf step • treatment trId, tname, treatments Inh(trId) = Inh(treatment).trId, . . ., • trId PCDATA Syn(trId) =Inh(trId) treatments Inh: (trId, tname) treatment treatment tname treatments Inh trId QSX (LN 5)
treatments treatment treatment tname treatments trId Syn Syn Syn Syn Bottom-up step: synthesize attributes • treatment trId, tname, treatments • treatments treatment* Syn(treatments) =U Syn(treatment) Processing of an element e: Inh(e) subtree(e) Syn(e) Syn(treatment) = Syn(trId)Syn(treatments) QSX (LN 5)
patient treatments Inh bill trId item item trId Syn Sideways step: controlled derivation • patient SSN, name, treatments, bill Inh(bill) = Syn(treatments) • bill item* Inh(item) Q’(Inh(bill) ) select trId, price from DB2: billing where trId in Inh(bill) -- set membership test DTD-directed: each step of construction follows a production Recall DB2: billing (trId, price) QSX (LN 5)
Constraint compilation Captured with rules on synthesized attributes of patient: • trIdB: bag-valued, collecting trId’s under item • trIdS1: set-valued, collecting trId’s under treatment • trIdS2: set-valued, collecting trId’s under item • key: patient( item.trId item ) unique (Syn(patient).trIdB)-- no duplicates in the bag • foreign key: patient ( treatment.trId item.trId ) subset (Syn(patient).trIdS1, Syn(patient).trIdS2) • compilation: semantic rules and attributes for constraints are automatically generated and evaluated QSX (LN 5)
Advantages of AIG • DTD-directed view definition: automatically ensures conformance to DTD -- recursive, nondeterministic • Constraint compilation: automatically captures integrity constraints in a uniform framework • performance: avoid post-materialization checking • optimization: jointly with query evaluation • exception handling: actions when constraints are violated • Controlled derivation: supports context-dependent generation. Information passing: top-down, bottom-up, sideways • Multi-source queries: optimizer-based decomposition • One sweep: each node is visited at most twice – evaluates its inherited attribute, subtree, then its synthesized attribute QSX (LN 5)
More on Commercial System: MS SQL Server 2005 • Annotated schema (XSD): fixed tree templates • nonrecursive schema • associate elements and attributes with table and column names Given a relational database, XSD populates an XML elements/attributes with corresponding tuples/columns • FOR-XML • An extension of SQL with an FOR-XML construct • Nested FOR-XML to construct XML documents Summary: • incapable of supporting schema-directed publishing • can’t define recursive XML views (bounded recursion depth) QSX (LN 5)