330 likes | 491 Vues
Peer Data-Management Systems: Plumbing for the Semantic Web. Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos. Agenda. Elements of the Semantic Web Piazza: a peer data-management system
E N D
Peer Data-Management Systems:Plumbing for the Semantic Web Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos
Agenda • Elements of the Semantic Web • Piazza: a peer data-management system • A database guy’s contribution to the semantic web • The key issue: mapping between different models: • Some recent progress and current directions. • The critical issue: crossing the structure chasm. • The talk I’m not giving today: • A critique of the Semantic Web. • Work and thoughts are in progress
The Semantic Web (my view) • Web sites include structural annotations • You can pose meaningful queries on them. • Ontologies provide the semantic glue. • Internal implementation of web sites left open. • Agents perform tasks: • Query one or more web sites • Perform updates (e.g., set schedules) • Coordinate actions • Trust each other (or not). • I.e., agents operating on a gigantic heterogeneous distributed database.
Getting there • Robust infrastructure for querying • Peer data management systems. • Facilitate mapping between different structures. Need tools for: • Locating relevant structures • Easily joining the semantic web. • Get data into structured form • Should we worry about the legacy web?
Agenda • Elements of the Semantic Web (personal view) • Piazza: a peer data-management system • A database guy’s contribution to the semantic web • The key issue: mapping between different models: • Some recent progress and current directions. • The critical issue: crossing the structure chasm.
Piazza: Peer Data-Management Goal: To enable users to share data across local or wide area networks in an ad-hoc, highly dynamic distributed architecture. • Peers can: • Export base data • Provide views on base data • Serve as logical mediators for other peers • Every peer can be both a server and a client. • Peers join and leave the PDMS at will.
Relationship of PDMS to… • P2P overlay networks (the “S” word) • Data integration systems (no central logical mediated schema) • Federated databases (scale, ad-hoc nature) • Distributed databases (no central administration)
Representing Data • A spectrum of possibilities: • Relational tables, some integrity constraints • XML: can encode relational, hierarchical, OO • Xquery – emerging standard query language (SQL for XML) • RDF: “XML on drugs”. • Sees only the logic; ignores other aspects. • DAML+OIL • Full blown Knowledge representation language. • They all have semantics; just different expressive powers. • We keep the data simple. Mappings between data at different peers are more complex.
Piazza Querying • Semantic mappings between peers provide glue: LH:CritBed(bed, hosp, room, PID, status) H:CritBed(bed, hosp, room) & H:Patient(PID, bed, status) 9DC:SkilledPerson(PID, "Doctor") :- H:Doctor(SID, h, l, s, e) 9DC:SkilledPerson(PID, "EMT") :- H:EMT(SID, h, vid, s, e) • Query processing phases: • Reformulate a query into queries over stored data. • Minicon algorithm (++) for answering queries using views. • Extensions in Piazza enable chaining multiple peer mappings. • Find best plan for the query and execute it: • Tukwila data integration engine – an efficient processor for network bound XML/relational data.
Efficiency Issues in Piazza • Intelligent data placement: • We may want to place views over data at key points in the PDMS: • Save work for frequently asked queries. • Increase availability in cases of failures. • Akamai for structured data • A form of automated reformulation. • Large search space of possibilities • Surprising lower bounds on very simple cases [Chirkova et al, VLDB 2001]. • Efficient propagation of updates: • Approach: publish updategrams as first-class citizens.
Additional Piazza Issues • The catalog of data sources • What does a catalog of structured data sources look like? • How can it be browsed by humans? • How do we facilitate joining a PDMS? • How can the catalog be distributed physically? • Systems issues: • Architecture of a Piazza node: what are the components? • Naming issues • Security • Piazza collaborators: Etzioni,Gribble, Ives, Levy, Suciu, Mork, Rodrig, Tatarinov.
Agenda • Elements of the Semantic Web • Piazza: a peer data-management system • A database guy’s contribution to the semantic web • The key issue: mapping between different models: • Some recent progress and current directions. • The critical issue: crossing the structure chasm.
It’s All About the Mappings It’s not about understandingthe data: It’s about understanding each other. • Whenever you see a model for some domain, there is another one hiding around the corner. • Mappings provide semantic relationships between different peers. • Specifying mappings: inherently a human-assisted task. • Goal: make it easy, fast, incremental. • Not a new problem!
Example Semantic Mapping • Mapping between XML DTDs house address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone
Desiderata from Proposed Solutions • Accuracy, efficiency, ease of use. • Extensible: accommodate in a principled fashion: • User feedback • Domain constraints • General heuristics • “Memory”, knowledge reuse: • System should exploit knowledge from previous matching tasks [LSD]. • Some underlying semantics.
Why Matching is Difficult • Structures represent same entity differently • different names => same entity: • area & address => location • same names => different entities: • area => location or square-feet • Intended semantics is typically subjective! • IBM Almaden Lab = IBM? • Schema, data and rules never fully capture semantics! • not adequately documented, certainly not for machine consumption. • Often hard for humans (committees are formed!)
Learning for Mapping • We started simple: generating semantic mappings between a mediated schema and a large set of data source schemas. • Key idea: generate the first mappings manually, and learn from them to generate the rest. • Technique: multi-strategy learning (extensible!) • L(earning) S(ource) D(escriptions) [SIGMOD 2001]. • Recent and current work: • (simple) Ontology mapping [WWW-02] • Complex mappings [COMAP] • Semantics [Madhavan et al., AAAI-02]
Data Integration (a simple PDMS) Find houses with four bathrooms priced under $500,000 mediated schema Query reformulation and optimization. source schema 1 source schema 2 source schema 3 realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code.
Learning from the Manual Mappings Mediated schema price agent-name agent-phone office-phone description listed-pricecontact-namecontact-phoneofficecomments Schema of realestate.com If “office” occurs in the name => office-phone realestate.com listed-price contact-name contact-phone office comments $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” occur frequently in data instances => description homes.com sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle $190K (512) 342 1263 Great lot
Multi-Strategy Learning • Use a set of baselearners: • Name learner, Naïve Bayes, Whirl, XML learner • And a set of recognizers: • County name, zip code, phone numbers. • Each base learner produces a prediction weighted by confidence score. • Combine base learners with a meta-learner, using stacking.
Observed label (X1,C1) (X2,C2) ... (Xm,Cm) Object Classification model (hypothesis) Training examples Base Learners • Training • Matching • Name Learner • training: (“location”, address) (“contactname”, name) • matching: agent-name => (name,0.7),(phone,0.3) • Naive Bayes Learner • training: (“Seattle, WA”,address) (“250K”,price)matching: “Kent, WA” => (address,0.8),(name,0.2) labels weighted by confidence score X
Meta-Learner: Stacking[Wolpert 92,Ting&Witten99] • Training • uses training data to learn weights • one for each (base-learner,mediated-schema element) pair • weight (Name-Learner,address) = 0.2 • weight (Naive-Bayes,address) = 0.8 • Matching: combine predictions of base learners • computes weighted average of base-learner confidence scores area Name Learner Naive Bayes (address,0.4) (address,0.9) Seattle, WA Kent, WA Bend, OR Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)
The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Training data for base learners Base-Learner1 .... Base-Learnerk Meta-Learner Base-Learner1 Base-Learnerk Predictions for instances Hypothesis1 Hypothesisk Prediction Combiner Domain constraints Predictions for elements Constraint Handler Weights for Base Learners Meta-Learner Mappings
Domain Constraints • Encode user knowledge about the domain • Specified by examining mediated schema • Examples • at most one source-schema element can match address • if a source-schema element matches house-id then it is a key • avg-value(price) > avg-value(num-baths) • Given a mapping combination • can verify if it satisfies a given constraint area: address sold-at: price contact-agent: agent-phone extra-info: address
Empirical Evaluation • Four domains • Real Estate I & II, Course Offerings, Faculty Listings • For each domain • create mediated DTD & domain constraints • choose five sources • extract & convert data listings into XML (faithful to schema!) • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 • Ten runs for each experiment - in each run: • manually provide 1-1 mappings for 3 sources • ask LSD to propose mappings for remaining 2 sources • accuracy = % of 1-1 mappings correctly identified
Matching Accuracy Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%
Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)
Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) • More experiments in the paper [Doan et. al. 01]
Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system
The Next Steps • Learning is a useful component. But it needs to be combined with: • User feedback • Domain constraints • General heuristics • Need a representation of mappings: • First step – see [Madhavan et al., AAAI-02] • Also defines key inference problems for such a representation, • Provides answers for the mapping language used in Piazza. • Ultimately, some first-order probabilistic representation. • Need benchmarks to measure progress.
Agenda • Elements of the Semantic Web • Piazza: a peer data-management system • A database guy’s contribution to the semantic web • The key issue: mapping between different models: • Some recent progress and current directions. • The critical issue: crossing the structure chasm.
Can We Cross the Structure Chasm? • There are two worlds: • U-world: the current web, keyword search, google • S-world: databases, knowledge bases, structured queries • The web succeeded because it’s in the u-world. • For the semantic web to succeed, we need to make it dead simple for people to: • Structure data, locate relevant data and data sets, query. • However: • People have a hard time structuring their data • It’s harder to query structured data: need to know a terminology. • It’s harder to understand each other in the S-world. • DB and KR people have no clue how to deal with this. • More expressive power in the languages won’t help.