Crossing the Structure Chasm

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004

The Structure Chasm Authoring Writing text Creating a schema Using someone else’s schema Querying keywords Data sharing Easy Committees, standards But we can pose complex queries

Why is This a Problem? • Databases used to be isolated and administered only by experts. • Today’s applications call for large-scale data sharing: • Big science (bio-medicine, astrophysics, …) • Government agencies • Large corporations • The web (over 100,000 searchable data sources) • The vision: • Content authoring by anyone, anywhere • Powerful database-style querying • Use relevant data from anywhere to answer the query • The Semantic Web • Fundamental problem: reconciling different models of the world.

Outline • Other benefits of structure: • (Semantic) email • Personal data management • A tour of recent data sharing architectures • Data integration systems • Peer-data management systems • The algorithmic problems: • Query reformulation • Reconciling semantic heterogeneity • What can we do with a large corpus of schemas?

Adding Structure to Email • Email is often used for lightweight data management tasks: • Organizing a PC meeting + dinner. • Arranging a ‘balanced’ potluck • Giving away opera tickets • Announcing an event and associated reminders. • Some specialized tools/services: • Outlook scheduling, evite.com • Can we delegate some email tasks easily?

“Start a potluck process” “Too many desserts. Appetizer or entrée?” “What willyou bring?” Constraints STOP Check OK email bringing “I’ll bring a dessert” john@cs Dessert “I’ll bring an appetizer” mary@ee Appetizer “I’ll bring a dessert” “I’ll bring a dessert” “I’ll bring a dessert” “I’ll bring an entree” “Here is what everyone isbringing…” jayant@u Dessert jane@cs Entree Semantic Email Processes Originator Process Database Recipients

Semantic Email[Etzioni, McDowell, (Ha)Levy] • Creating the structure? • We’ll help with template interfaces • Incorporating additional knowledge? • I always bring desserts • I don’t schedule morning meetings • Another data sharing challenge. • But it’s free: (and cross platform) www.cs.washington.edu/research/semweb

Homepage Web Page Person Cached Organizer, Participants Document Author Author Event Sender, Recipients Softcopy Softcopy Paper Presentation Message Cites Personal Data Management [Semex: Sigurdsson, Nemes, H.] Data is organized by application Mail & calendar HTML Files Presentations Papers

Finding Publications Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives

Publication Bernstein Following Associations (1)

Following Associations (2) “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein

Following Associations (3) Cited by Publication Citations Publication Bernstein

Following Associations (4) Cited Authors Publication Bernstein

Structure for Personal Data • High-level concepts are given, but later • extend and personalize concept hierarchy, • share (parts) of our data with others, • incorporate external data into our view. • Concepts are populated automatically with instances • Need Instance level reconciliation: • Alon Halevy, A. Halevy, Alon Y. Levy – same guy!

Data Integration • Goal: provide a uniforminterface to a set of autonomous data sources. • First step towards data sharing. • Many research projects (DB & AI) • Mine: Information Manifold, Tukwila, LSD • Recent industry: • Startups: Nimble, Enosys, Composite, MetaMatrix • Products from big players: BEA, IBM

Relational DBMS Refresher Students: Takes: • Schema: the template for data. • Queries: Courses: SELECT C.name FROMStudents S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Q Q1 Q2 Q3 Data Integration: Higher-level Abstraction Mediated Schema Semantic mappings … …

Mediated Schema Entity www.biomediator.org Tarczy-Hornoch, Mork Sequenceable Entity Structured Vocabulary Experiment Phenotype Gene Nucleotide Sequence Microarray Experiment Protein OMIM HUGO Swiss- Prot GO Gene- Clinics Locus- Link Entrez GEO Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

Semantic Mappings Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName • Differences in: • Names in schema • Attribute grouping • Coverage of databases • Granularity and format of attributes BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Inventory Database A Artists ASIN ArtistName GroupName Inventory Database B

Q Q’ Q’ Q’ Issues for Semantic Mappings • Formalism for mappings • Reformulation algorithms Mediated Schema Semantic mappings • How will we create them? … …

Beyond Data Integration • Mediated schema is a bottleneck for large-scale data sharing • It’s hard to create, maintain, and agree upon.

Q3 Q1 Q4 Q5 Q6 Q Q2 Peer Data Management Systems • Mappings specified locally • Map to most convenient nodes • Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork] CiteSeer Stanford UW DBLP Waterloo UBC Toronto

PDMS-Related Projects • Hyperion (Toronto) • PeerDB (Singapore) • Local relational models (Trento) • Edutella (Hannover, Germany) • Semantic Gossiping (EPFL Zurich) • Raccoon (UC Irvine) • Orchestra (Ives, U. Penn)

A Few Comments about Commerce • Until 5 years ago: • Data integration = Data warehousing. • Since then: • A wave of startups: • Nimble, MetaMatrix, Calixa, Composite, Enosys • Big guys made announcements (IBM, BEA). • [Delay] Big guys released products. • Success: analysts have new buzzword – EII • New addition to acronym soup (with EAI). • Lessons: • Performance was fine. Need management tools.

Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Data Integration: Before Mediated Schema

Data Integration: After XML Query XML Relational Data Warehouse/ Mart Legacy Flat File Web Pages Front-End Lens Builder™ User Applications Lens™ File InfoBrowser™ Software Developers Kit NIMBLE™ APIs Management Tools Integration Layer Nimble Integration Engine™ Metadata Server Cache Compiler Executor Security Tools Common XML View Integration Builder Concordance Developer Data Administrator

Sound Business Models • Explosion of intranet and extranet information • 80% of corporate information is unmanaged • By 2004 30X more enterprise data than 1999 • The average company: • maintains 49 distinct enterprise applications • spends 35% of total IT budget on integration-related efforts Source: Gartner, 1999

Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Languages for Schema Mapping Mediated Schema GAV LAV GLAV

R1(x,y,n) :- Book(x, y, z, t), Author(x, n), t < 1970 R5(x,y) :- Book(x,y,”Humor”) Local-as-View (LAV) Book: ISBN, Title, Genre, Year Author: ISBN, Name R1 R2 R3 R4 R5 Books before 1970 Humor books

Query Reformulation Query: Find authors of humor books Book: ISBN, Title, Genre, Year Plan: R1 Join R5 Author: ISBN, Name R1 R2 R3 R4 R5 Books before 1970 Humor books

Query Reformulation Find authors of humor books before 1960 Book: ISBN, Title, Genre, Year Plan: Can’t do it! (subtle reasons) Author: ISBN, Name R1 R2 R3 R4 R5 ISBN, Title, Name ISBN, Title

Query Reformulation • Query is posed on mediated schema that contains no data. • Sources are answers to queries (views). • Problem: answering queries using views • (Conceptually) Need to invert query expression. • Traditional databases also use this: • Can you reuse previously cached results?

Answering Queries Using Views • NP-Complete for basic queries [LMSS, PODS 95]. • Results depend on: • Query language used for sources and queries, • Open-world vs. Closed-world assumption • Allowable access patterns to the sources • A lot of beautiful theory!

Theory? • A lot of beautiful theory. “There is in these words the beautiful maneuverability of the abstract, rushing in to replace the intractability of the concrete.” Milan Kundera The Book of Laughter and Forgetting

Practical Query Reformulation • A lot of nice theory. • But also very practical algorithms: • MiniCon [Pottinger and H., 2001]: scales to thousands of sources. • Every commercial DBMS implements some version of answering queries using views. • See [Halevy, 2001] for survey.

CiteSeer Stanford UW DBLP Waterloo UBC Toronto Reformulation in PDMS • Can’t follow all paths naively • Pruning techniques [Tatarinov, H.] • Can we pre-compute some paths? • Need to compose mappings • [Madhavan, H., VLDB-2003]

CiteSeer Stanford UW DBLP Waterloo UBC Toronto Open PDMS Research Issues • Managing large networks of mappings: • Consistency • Trust • Improving networks: finding additional mappings • Indexing: • Heterogeneous data across the network • Caching: • Where? What?

Semantic Mappings Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName • Need mappings in every data sharing architecture • “Standards are great, but there are too many.” BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Inventory Database A Artists ASIN ArtistName GroupName Inventory Database B

Why is it so Hard? • Schemas never fully capture their intended meaning: • Schema elements are just symbols. • We need to leverage any additional information we may have. • ‘Theorem’: Schema matching is AI-Complete. • Hence, a human will always be in the loop. • Goal is to improve designer’s productivity. • Solution must be extensible.

Matching Heuristics • Multiple sources of evidences in the schemas • Schema element names • BooksAndCDs/Categories ~ BookCategories/Category • Descriptions and documentation • ItemID: unique identifier for a book or a CD • ISBN: unique identifier for any book • Data types, data instances • DateTime  Integer, • addresses have similar formats • Schema structure • All books have similar attributes • Use domain knowledge In isolation, techniques are incomplete or brittle: Need principled combination. All these techniques consider only the two schemas.

Mediated Schema Mediated Schema Using Past Experience • Matching tasks are often repetitive • Humans improve over time at matching. • A matching system should improve too! • LSD: • Learns to recognize elements of mediated schema. • [Doan, Domingos, H., SIGMOD-01, MLJ-03] • Doan: 2003 ACM Distinguished Dissertation Award. data sources

Example: Matching Real-Estate Sources Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...

Learning Source Descriptions • We learn a classifier for each element of the mediated schema. • Training examples are provided by the given mappings. • Multi-strategy learning: • Base learners: name, instance, description • Combine using stacking. • Accuracy of 70-90% in experiments.

Corpus-Based Schema Matching • Can we use previous experience to match two newschemas? • Can a corpus of schemas and matches be a general purpose resource? • Information Retrieval and NLP progressed by using corpora – • Can the same be done for structured data?

Music Books Authors Authors Items Artists Publisher Information Litreture CDs Categories Artists Corpus of Schemas and Matches multi-strategy learning Data Instances Learner Structure Learner Name Learner Data Type Learner Description Learner Meta Learner Corpus-Based Schema Matching • Can we use previous experience to match two newschemas? Classifier for every corpus element Learn general purpose knowledge Reuse extracted knowledge to match new schemas

The Corpus vs. Other Matchers

Exploiting Previous Experience

Crossing the Structure Chasm