1 / 14

Data Integration

Data Integration. Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives). Agenda. Overview of Data Integration. A Problem. Often people build databases in isolation, then want to share their data

dayo
Télécharger la présentation

Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

  2. Agenda • Overview of Data Integration

  3. A Problem • Often people build databases in isolation, then want to share their data • Different systems within an enterprise • Different information brokers on the Web • Scientific collaborators • Researchers who want to publish their data for others to use • Even with normalization and the same needs, different people will arrive at different schemas • Goal of data integration: tie together different sources, controlled by many people, under a common schema

  4. Motivating example (1) • FullServe: company that provides internet access to homes, but also sells products to support the home computing infrastructure (ex: modems, wireless routers, etc) • FullServe is a predominantly American company and decided to acquire EuroCard, an European company that is mainly a credit card provider, but has recently started leveraging its customer base to enter the internet market

  5. Motivating Example (2) FullServe databases: Employee Database FullTimeEmp(ssn, empId, firstName, middleName, lastName) Hire(empId, hireDate, recruiter) TempEmployees(ssn, hireStart, hireEnd, name, hourlyRate) Training Database Courses(courseID, name, instructor) Enrollments(courseID, empID, date) Services Database Services(packName, textDescription) Customers(name, id, zipCode, streetAdr, phone) Contracts(custID, packName, startDate) Sales Database Products(prodName, prodId) Sales(prodName, customerName, address) Resume Database Interview(interviewDate, name, recruiter, hireDecision, hireDate) CV(name, resume) HelpLine Database Calls(date, agent, custId, text, action)

  6. Motivating Example (3) EuroCard databases: Employee Database Emp(ID, firstNameMiddleInitial, lastName) Hire(ID, hireDate, recruiter) CreditCard Database Customer(CustID, cardNum, expiration, currentBalance) CustDetail(CustID, name, address) Resume Database Interview(ID, date, location, recruiter) CV(name, resume) HelpLine Database Calls(date, agent, custId, description, followup)

  7. Observations • Why data resides in multiple DBs in a company, rather than in a single well-organized DB? • When companies go hrough internal restructuring, they do not always align their DBs • Most DBs are created by a group within the company with a specific need • Not all the information needs in the future are anticipated

  8. Motivating Example (4) Some queries employees or managers in FullServe may want to pose: • The Human Resources Department may want to be able to query for all of its employees whether in the US or in Europe • Require access to 2 databases in the American side and 1 in the European side • There is a single customer support hot-line, where customers can call about any service or product they obtain from the company.When a representative is on the phone with a customer, it´s important to see the entire set of services the customer is getting from FullServe (internet service, credit card or products purchased). Furthermore, it is useful to know that the customer is a big spender on its credit card. • Require access to 2 databases in the US side and 1 in the European side.

  9. Another example: searching for a new job (1)

  10. Another example: searching for a new job (2) • Each form (site) asks for a slighly different set of attributes (ex: keywords describing job, location and job category or employer and job type) • Ideally, would like to have a single web site to pose our queries and have that site integrating data from all relevant sites in the Web,

  11. Goal of data integration • Offer uniform access to a set of data autonomous and heterogeneous data sources: • Querying disparate sources • Large number of sources • Heterogeneous data sources (different systems, diff. Schemas, some structured others unstructured) • Autonomous data sources: we may not have full access to the data or source may not be available all the time.

  12. Why is it hard? • Systems reasons: even with the same HW and all relational sources, the SQL supported is not always the same • Logical reasons: different schemas (e.g. FullServe and EuroCard temporary employees), diff attributes (e.g., ID), diff attribute names for the same knowledge (e.g., text and action), diff. Representations of data (e.g. First-name, last name) also known as semantic heterogeneity • Social and administrative reasons

  13. Practical goals of data integration • Ideally: data integration system access a set of data sources and automatically configures itself to correctly and efficiently answer queries over multiple sources • Actually: • Build tools to reduce the effort required to integrate a set of data sources • Improve the ability of the system to answer queries in uncertain environments • Trade-off: • user effort vs accuracy

  14. Referências • Draft of the book on “Principles of Data Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation). • Slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives) • T. Landers and R. Rosenberg. An overview of multibase. In Proceedings of the Second International Symoposium on Distributed Databases, pages 153–183. North Holland, Amsterdam, 1982. • Gio Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, pages 38–49, March 1992. • Daniela Florescu, Alon Levy, and Alberto Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59–74,September 1998. • Alon Y. Halevy, Naveen Ashish, Dina Bitton, Michael J. Carey, Denise Draper, Jeff Pollock, Arnon Rosenthal, and Vishal Sikka. Enterprise information integration: successes, challenges and controversies. In SIGMOD Conference, pages 778–787, 2005.

More Related