The M ediation of I nformation using X ml project

TheMediation of Information usingXml project BY:Amir Atauna & Michael Brautbar

What is a Mediator and Why is it Needed? Huge quantity of information on the web. Users wants to find information on the web that is related to their problem. Problem: The information is distributed across many sources, each source provides a different interface and exports the data in a different format.

Mediator systems will assist the users by providing them integrated views of the data they are interested in. Example: a Web-shopping mediator will provide to the Web value-shopper a view where the lowest prices for each product are provided. The goal of MIX is to facilitate the development of such mediators.

Is the mediator concept new? No, the TSIMMIS mediator uses the semistructured model OEM (Object Exchange Model). Wrappers export the source data translated to OEM. The mediator export an integrated view of the wrapper data based on a view definition provided by the administrator.

The view definition is expressed in the Mediator Specification Language (MSL). At runtime the mediator receives queries, which refer to the view objects and expressed in MSL. First, the incoming query is combined with the view definition into a query which refers directly to source data. Then the optimizer finds a plan to execute the latter query by sending queries to the wrappers and combining their results in the mediator.

The wrappers translate the queries they receive into queries understood by the sources. The MSL specifications can be very “loose” on the amount of info they provide on the structures they provide. This is a valuable feature when working with dynamic semistructured sources. There are two weak points: - The user does not know the structure ot the underlying data and this impedes his efforts to formulate a reasonable queries.

Second - the mediator may not have complete or any information of the metadata and structure of each source and this leads to a heavy loss of performance MIX solves this problems with DTDs

The Philosophy of MIX: The Web as a Distributed Database The developer of this system strongly believe that the Web will emerge as a distributed database and XML (or some extension/modification of XML) will be the data model of this huge database. The MIX mediator views XML as a database model and uses the mediator concept as known in the DB area.

Sources will be exporting an XML view of their data along with semantic descriptions of the content (Source DTDs) and descriptions of the interfaces (XML queries) that may be used for accessing the data. Users and applications will then be able to query these view documents using some XML query language. The MIX mediator uses the source DTDs to assist the user in query formulation and the query processors in running queries more efficiently.

MIX’s query evaluation is done in a lazy approach (on demand), i.e. XML queries (expressed in XMAS) are unfolded and rewritten at runtime. In the other approach, the eager (warehousing), the data integration occurs in a separate materialization step, before the actual user queries.

Conventional data repositories are not expected to be converted to XML. Wrappers technologies that allow us to logically view an information source (which may be a relational database, a collection of html pages, or even a legacy information system) as a large XML source. The wrappers are able to translate XMAS queries into queries or commands that the underlying source understands. They are also able to translate the result of the source into XML.

Creating Mediated Views Using MIX mediator and Querying them with BBQ The XML documents have to be integrated. One goal of MIX is to develop integrated views and fast. For this the developers use XMAS as the view definition language.

The BBQ (Blended Browsing and Querying ) user interface enables the users to formulate XMAS queries using a GUI that reminds of query-by-example interfaces in relational database

The MIX Architecture

The graphical user interface BBQ allows the construction of queries. In order to accomplish the integration, the MIX mediator comprises several modules. - Its main inputs are XMAS queries generated by the BBQ, and the mediator view definition (also in XMAS) for the integrated view. - The resolution module resolves the user query with the mediator view definition, resulting in a set of unfolded XML queries that refer to the wrapper views.

- The simplification module is used to further simplify the XML queries based on the underlying XML DTDs. - The DTD inference module can be used to automatically derive view DTDs from source DTDs and queries for supporting the integration task of the mediation engineer (This is done off-line). - The translation module maps the simplified queries into the XMAS algebra.

- The optimization module can be used to further optimize the XMAS queries. - The execution engine issues XMAS queries against the wrappers, and returns the requested XML data to the user, after integrating the retrieved data according to the mediator view. The wrappers are used to export data in a uniform format to the mediator

The XMAS Language • The data model of the sources of the mix mediator are valid XML docs • We need a way to formulate queries that can relate to data in multiple XML docs • XML document structure may be tightly structured as in a relational databases or to have no structure at all

The XMAS Language Cont • So we need a query language that is as strong as relational algebra • Preferable features of the language : • Simple formulation of queries • Will logically describe what we want to say

Solution : XMAS • XMAS stands for XML matching and structuring language • Declarative ,high level language • Build upon ideas of languages like XML - QL , MSL.

Body (the “where” clause) : specifies the data which is to be extracted from the XML sources • Head (the “construct” clause) : describes how the extracted data is arranged into a new answer XML document. In this part we may use the “collection” operator and the “ordering” operator. (Will be explained later on) • ( Body and head roughly resembles the select and where in SQL)

Predicate : defines conditions on the variables occurring in the sources • Lets look at an example • <!Element neighborhoods (neighborhood)*> <!Element neighborhood (zip, name, type, population)> <!Element zip (#pcdata)> <!Element name (#pcdata)> <!Element type (#pcdata)> <!Element population (#pcdata)>

For Example We Can Have The Following XML Doc For That DTD • <Neighborhoods <neighborhood> <zip>91901</zip> <name>alpine</name> <type>rural/town</type> <population>13238</population> </neighborhood> <neighborhood> <zip>91903</zip> <name>alpine</name> <type>rural/town</type> <population>4783</population> </neighborhood> …

Query Example • Suppose we want to retrieve all names of “big” neighborhoods ,say where population is greater than 30000 • In XMAS we can write the following query:

Construct • <Big_neighborhoods> • <Big_neighborhood> • <Name>$n</> • </> {$N} • </> • Where • <Neighborhoods> • <Neighborhood> • <Name>$n</> • <Population>$p</> • </> • </> • IN "http://www.Pnaci.Edu/dice/mix/tutorial/neighborhoods.Xml” • And $p>30000

How Does It Work • Lets look at the body of the query above. This tree pattern mimics the tree structure of the input XML document • The variables $N and $P are used to “get a hold” of the data at the corresponding locations in the tree structure representing the input XML doc. In other words , the tree pattern specifies that : the root element of the XML doc is of type big_neighborhoods

Within big_neighborhoods there must be some big_neighborhood subelement ,which itself contain name and population subelements • In this way , the tree pattern specifies a list of pairs of variable bindings for $N and $P • From this list we want to select only those which satisfy the condition $P > 30000 • To summarize , the body defines a list [(n1; p1); ...; (nk; pk)] of all variable bindings for ($N,$P), which match (or satisfy) the body

The “head” consists of an XML tree pattern which contains some or all the of the variables of the body • In the example above , the head define a root element big_neighborhoods with a big_neighborhood subelement, having in turn a name subelement. The latter is used to hold the bindings for $N which have been obtained through the body • Using {$N} expresses that we want to have only one big_neighborhoods element that has a number of big_neighborhood subelements. (One for each name $N obtained from the body)

The Collection Operator • Is used to collect all binding of the subelemnt to be put under the father element • Has two kinds : implicit and explicit • The usage for the explicit version is {$N} where $N is a free variable in that level • For example (of the explicit usage), consider the previous example

The Collection Operator Cont • We create exactly one big neighborhood element for each binding n1; ...; nk of $N (thereby biding the value of $N within the big neighborhood element to one ni), and all these elements are collected as subelements of the parent element

The Collection Operator Cont • For elements in the head which do not have an explicit collection label, an implicit collection label may be used • The implicit collection variables of an element E are those which are free in E • The usage for the explicit version is [ ... ] where ‘[ ‘ is before the beginning of the section and ‘]’ is at it’s end

The Collection Operator Cont • For example consider the following code <answer> [<a> $A [<b> $B [<c> $C </c>] </b>] </a>] </answer> • The above corresponds to a nested loop structure

The Ordering Operator • All subelemnts binding may be ordered by a given order • If no order is specified a default order is used.(Based on the order in which the data was found) • Example :consider the next DTD and the given query after it

<!Element home empty> <!Attlist home zip pcdata #required pcdata #required > • And the query is: CONSTRUCT <answer> <homes> { $H} order by $H.Price </homes> WHERE <home> $H </> IN "http://www.Mine.Xml"

So ,Mmm ,Is XMAS So Powerful ? • Home buyer's scenario. A user who wants to buy a home . he wants to make use of information available from the web to guide this decision. A possible query that the user may issue is: find all houses with 3 bedrooms, 2 baths, interior area at least 1600 sq.Ft., Priced between $ 250k and $ 350k, in regions where the school rating is at least 70 (out of 100) and the crime rate is no more than 15 incidents per year. Group the answers by region and order them by price. For each home also show the nearby schools."

Strong As Relational Algebra • As mentioned before , one of the features of XMAS is that it is as expressive as relational algebra . some examples for this : • Selection : selection on a variable is made in the ‘predicate’ part of the query: • Projection: write in the head just those variable that you want to project

A natural join can be obtained by equating variables in the body • Cartesian product may also be expressed easily

CONSTRUCT <neighborhoods_med> <neighborhood_med> $N $S </> {$N, $S} </> WHERE <neighborhoods> $N: <neighborhood> <zip>$Z</> </> </> IN "http://www.npaci.edu/DICE/MIX/tutorial/neighborhoods.xml" AND <schools> $S: <school> <zip>$Z1</> </> </> IN "http://www.npaci.edu/DICE/MIX/tutorial/schools.xml" AND $Z=$Z1 Cartesian product is easily expressed by removing the condition $Z=$Z1

Merry XMAS

DTD Inference

The MIX mediator and the advantages of living with DTD-provided structure The MIX mediator employs DTDs to assist the user in information discovery, query formulation and to allow the query processor to derive more efficient plans. The view DTD inference module derive view DTD given the source DTDs and the view.

The view DTD is passed to the DTD-based query interface to enable query formulation. A DTD inference algorithms developed for a limited class of XMAS queries/views. - pick-elements XMAS queries, i.e., queries whose SELECT clause has a single variable, called pick-variable, that binds to elements and WHERE clause consists of a single condition that is applied to only one source.

It is easy to compute a loose DTD for a view but it is critical to the query interface and the query processor to get the one that describe the view as precisely as possible.

Also “precise” view DTDs may have other applications than ours, for example, it may be used as a toolkit for generating XSL style sheets for presentation of the view. A criterion for judging the precision of a view DTD is tightness. A DTD d1 is tighter then a DTD d2 if every document described by d1 also described by d2. The tightness criterion can be a benchmark for other powerful view definition languages and view inference algorithms.

So the view DTD inference algorithm attempts to derive to tightest DTD that contains all the possible documents that may appear as the content of the view. Unfortunately, even the tightest view DTD describes structures that can never appear as the view’s content. For this the view DTD inference algorithm derive an extended form of DTDs that typically does not have non-tightness problems known as Specialized DTDs.

Model and Query Language Framework The focus is on XML documents that meet the following requirements: - XML always valid i.e. Have a DTD. - There are no other attributes than the ID attribute and all elements have an ID attribute. - There are no empty elements but elements with empty content are allowed. - Mix content elements are not allowed i.e elements whose content mixes strings with elements

The M ediation of I nformation using X ml project

The M ediation of I nformation using X ml project

Presentation Transcript

I ntegrated R esource I nformation S ystems – The IRIS PROJECT

Impliance: an I nformation M anagement Ap pliance

I nformation and

MIT M ASS I NFORMATION T ECHNOLOGY

I nformation

Influence of N ationality A spects on M ediation

Useful I nformation

History of the Orange County M ediation C onference

M edical I nformation S ystem Everest

I nformation Session

P atient I nformation M anagement S ystem

W eb I nformation M anagement S ystem

I nformation

M ANAGING THE I NFORMATION S YSTEMS F UNCTION

a x i u m

I nformation

A ssessment I nformation M anagement

a x i u m

2009 ML Project: