1 / 43

Chapter 11

How to integrate information, which is usually scattered physically ... Aardvark wants to create an integrated database. 1000 dealers do not all use the same ...

ivanbritt
Télécharger la présentation

Chapter 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Chapter 11 Information Integration

    Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ. shlee@computing.soongsil.ac.kr

    2. How to integrate information, which is usually scattered physically

    This is an unavoidable question to all of us. Approaches (homogenous) Distributed DBMS (80s) Federated databases, Multidatabases, remote data access (90s) Data warehouse, mediator (late 90s)

    3. Why Information Integration is Difficult (1)

    Heterogeneous sources Examples (Aardvark Automobile Co.) 1000 dealers Each dealer maintains a database of their cars in stock Aardvark wants to create an integrated database 1000 dealers do not all use the same database schema Dealer 1: Cars(serialNo, model, color, autoTrans, cdPlayer, ) Dealer 2 Autos(serial, model, color) Options(serial, option)

    4. Why Information Integration is Difficult (2)

    Furthermore Data type differences: Serial numbers might be represented by character strings or integers Value differences: The color black might be represented by an integer code, the string BLACK, or the code BL Semantic differences: One dealer distinguish station wagon from minivans, while another doesnt Missing values: A source does not record information that all or most of the other sources provide

    5. Modes of Information Integration

    Federated databases The sources are independent, but one source can call on others to supply information Warehousing Copies of data from several sources are stored in a single database, called a (data) warehouse Mediation A mediator is a software component that supports a virtual database, which the user may query as if it were materialized The mediator stores no data of its own

    6. Federated Database Systems

    A federated database system is a federation of existing databases systems (called local database systems, LDBS) and provides applications with a uniform means of access to data that are managed by more than one of these database systems In theory, local databases should preserve local autonomy

    7. Local Autonomy (1)

    Design autonomy Ability of an LDBS to choose its own design decisions wrt any matter, including data model, query language, constraints, system functions, semantic interpretation of data, Execution autonomy Ability of an LDBS to execute local operations without interference from external operations and to decide the order in which to schedule external operations

    8. Local Autonomy (2)

    Communication autonomy Ability of an LDBS to decide whether and when to communicate with other database systems Association autonomy Ability of an LDBS to decide whether and how much to share its functionality and resources with others. For example, an LDBS may export only part of its database to external users or even disassociate itself from an LDBS for some reasons.

    9. Federated Database Example

    A federated collection of four local databases DB1 DB2 DB2 DB2

    10. Federated Database

    If n databases each need to talk to the n 1 other databases, then we should write n(n 1) pieces of code to support queries between systems This approach is easy to implement in some circumstances !!!

    11. Query Translation Example

    Dealer 1: NeededCars(model, color, autoTrans) Dealer 2: Autos(serial, model, color), Options(serial, option) /* Dealer 1 queries Dealer 2 for needed car For (each tuple (:m, :c, :a) in NeededCars) { if ( :a = TRUE) { /* automatic transmission wanted */ SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND Options.option = autoTrans AND Autos.model = :m AND Autos.color = :c; } else { /* automatic transmission not wanted */ SELECT serial FROM Autos WHERE Autos.model = :m AND Autos.color = :c AND NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = autoTrans ); } }

    12. Mediators

    A mediator supports a virtual view or collection of view Dont store any data of its own Source 1 Source 2 Wrapper Wrapper Mediator query result result query query query result result query result

    13. Mediator Example (1)

    A view that is a single relation AutosMed(serialNo, model, color, autoTrans, dealer) A query to the mediator SELECT serialNo, model FROM AutoMed WHERE color = red The mediator can forward the same query to each of the two wrappers The translation work can be done by the wrappers alone

    14. Mediator Example (2)

    A suitable translation for Dealer 1 Cars(serialNo, model, color, autoTrans, cdPlayer, ) SELECT serialNo, model FROM Cars WHERE color = red; A suitable translation for Dealer 2 Autos(serial, model, color), Options(serial, option) SELECT serial, model FROM Autos WHERE color = red; Each wrapper returns to the mediator a serialNo-model pairs and serial-model pairs, respectively The mediator can take the union of these sets and return the result to the user

    15. Wrappers in Mediator-Based Systems

    Sources could be DBMSs (in various models), file systems, Web servers, Handles all connection/query-translation problems peculiar to sources Mediator systems require more complex wrappers than do most warehouse systems Techniques Wrapper generator Template-based Filter techniques Etc.

    16. Templates for Query Patterns

    Templates are queries with parameters that represent constants Example SELECT * SELECT serialNo, model, color FROM AutosMed => autoTrans, dealder1 WHERE color = $c FROM Cars WHERE color = $c; In general there would be 2n templates if we have the option of specifying n attributes The number of templates could grow unreasonably large

    17. Wrapper Generators

    Wrapper generator The software that creates the wrapper A table that holds the various query patterns contained in the templates

    18. Filters

    It is not always realistic to write a template for every possible from of query Another approach to supporting more queries is to have the wrapper filter the results of queries

    19. Filters Example

    Suppose the only template we have is the one that finds cars given a color The mediator needs to find blue Gobi model cars SELECT * FROM autosMed WHERE color = blue and model = Gobi A possible way to answer the query Use the template (with $c = blue) Store the result in a temporary relation Select from TempAutos the Gobis

    20. Data Warehousing

    Growing industry since mid 90s Ranges from desktop to huge Lots of buzzwords, hype Slice & dice, rollup, MOLAP, pivot,

    21. Information as a Competitive Weapon

    Organizations have collected large amounts of data. Now it is time to use it to their advantage.

    22. Can You Easily Answer These Questions?

    What are Personnel Services costs across all departments for all funding sources? What are the effects of outsourcing specific services? What is the correlation between expenditures and collection of delinquent taxes? What is the impact on revenues and expenditures of changing the operating hours of the Dept. of Motor Vehicles? What is the economic impact of the small business initiative in our district?

    23. What is a Warehouse (1)

    Collection of diverse data Subject oriented Aimed at executive, decision maker Often a copy of operational data With value-added data (e.g., summaries, history) Integrated Time-varying Non-volatile AND

    24. What is a Warehouse (2)

    Collection of tools Gathering data Cleansing, integrating Querying, reporting, analysis Data mining Monitoring, administering warehouse

    25. Warehouse Architecture

    Metadata

    26. Motivating Examples

    Forecasting Comparing performance of units Monitoring, detecting fraud Visualization

    27. Why a Warehouse

    Two approaches: Query-driven (lazy) Warehouse (eager)

    28. Query-driven approach

    29. Advantages of Query-driven

    No need to copy data Less storage No need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

    30. Advantages of Warehousing

    High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarizes (store aggregates) Add historical information

    31. OLTP vs. OLAP

    OLTP (On-Line Transaction Processing) Describes processing at operational sites OLAP (On-Line Analytical Processing) Describes processing at warehouse

    32. OLTP vs. OLAP

    OLTP Mostly updates Many small transactions Mb-Tb of data Current snapshot Raw data Clerical users Consistency, recoverability critical Warehouse Mostly reads Queries are long and complex Gb-Tb of data History Summarized, consolidated data Decision-makers, analysts as users

    33. OLAP Example

    The schema for the warehouse Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city,state,phone) A typical decision-support query SELECT state, AVG(price) FROM Sales, Dealers WHERE Sales.dealer = Dealers.name AND date >= 1999-01-04 GROUP BY state; Common OLTP query Find the price at which the auto with serial number 123 was sold

    34. Warehouse Models and Operations

    Data models Relations Stars and snowflakes Cubes Operations Slice and dice Roll-up, drill-down Pivoting other

    35. Star Schemas

    Star schema = fact table + dimension tables

    36. Example-1 (1)

    Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state, phone) Sales is a fact table serialNo, date, dealer are dimensions The one dependent attribute is price, which is what OLAP queries will typically request in an aggregation Autos relation and Dealer relation are dimension tables Attribute serialNo in the fact table is a foreign key, referencing serialNo of dimension table Autos Join between fact table and dimension tables, is frequently done

    37. Example-1 (2)

    A time dimension table Days (day, week, month, year) Since grouping by various time units is frequently desired by analysts It helps to build into the database a notion of time, as if there were a time dimension table such as above

    38. Example-2 (1)

    39. Example-2 (2)

    40. Slicing and Dicing

    Dicing For example, in the time dimension, we might partition (group by clause) according to days, weeks, months, years, or not partition at all Partitioning is also possible for cars and dealers Slicing Through the where clause, a query focuses on partitions along one or more dimensions

    41. Example 1

    A query in which we ask for a slice in one dimension (the date), and dice in two other dimensions (car and dealer) The date is divided into four groups,

    42. More Examples

    SELECT color, SUM(price) FROM Sales NATURAL JOIN Autos WHERE model = Gobi GROUP BY color; This query dices by color and then slices by model SELECT dealer, month, SUM(price) FROM (Sales NATURAL JOIN Autos) JOIN Days on date = day WHERE model = Gobi and color = red GROUP BY color;

    43. How to support cube-structured data for OLAP

    ROLAP, or Relational OLAP Data may be stored in relations with a specialize structure called a star schema MOLAP, or Multidimensional OLAP A specialized structure, the data cube, is used to hold the data

More Related