Data Management for Peer-to-Peer Computing: A Vision

Data Management for Peer-to-Peer Computing: A Vision Philip A. Bernstein Microsoft Research Fausto Giunchiglia Univ. of Trento Anastasios Kementsietsidis Univ. of Toronto John Mylopoulos Univ. of Toronto Luciano Serafini Univ. of Trento Ilya Zaihrayeu Univ. of Trento

A Peer to Peer (P2P) Research Project • Database and AI researchers intermittently connect to exchange research ideas … about P2P DBs Toronto Seattle Trento (Italy)

Is There a Role for P2P DBs ? • Peers come and go, but must still be able to interoperate. • To us, the big question is how to cope with DBs that • are incomplete, overlapping, and mutually inconsistent • dynamically appear and disappear • have limited connectivity. • Scenario • Databases of medical patients • Complete integration is likely to be infeasible • But dynamic integration of DBs relevant to one patient could have high value.

Contributions • Why P2P databases are different • A P2P database scenario • A logic for P2P databases • Architecture and implementation issues

A Model for P2P Databases • Each peer is a node with a database. It exchanges data and services with acquaintances (i.e. other peers). • The set of acquaintances changes often, due to • site availability • changing usage patterns • Peers are fully autonomous. • No global control or central server. • Hence no global schema • It would be impractical to build one for each peer • And it might be impossible because of inconsistencies

D: Doctor P: Pharmacist H: Hospital A Motivating Scenario • A patient may be described in several DBs, which use different patient id formats, disease descriptions, etc. • When a patient is admitted to the hospital, H becomes acquainted with D • The acquaintance is dropped when treatment is over • When the doctor prescribes a drug, D becomes acquainted with P • A patient is injured skiing, so more DBs get involved Ski Clinic

Proposal: Local Relational Model (LRM) • A logic for P2P data integration • Instead of a global schema, each peer has • coordination formulas – each specifies semantic interdependencies between two acquaintances • binarydomain relations – each specifies how symbols in one database translate to symbols in an acquaintance’s database. • Each expression in a coordination formula is relative to just one participating database • Use coordination formulas and domain relations for query and update processing.

A Coordination Formula • p: pharmacist DB medication(PrescriptionID, Pid, Prod) • d: doctor DB treatment(Tid, Pid, Description, Type) where type {“hospital”, “home”}. • (i:x).A(x) means for all vin the domain of database i, A(v) is true. • A coordination formula (p:y).(p:z).(p: (x).medication(x, y, z)  d: (w).treatment(w, y, z, “home”) ) “There’s a row in treatment in the doctor DB for each row in medication in the pharmacist DB”

Domain Relation • A row <d1,d2> in domain relation rikspecifies that valued1 in DBicorresponds to value d2 in DBk • rikmay be partial • rik,rki need not be symmetric • Example - DBicontains lengths in meters and DBk in kilometers (total but not symmetric) • rik(x) = roundToClosestK(x) rik(653)=1, rik(453)=0 • rki(x) = x*1000 rki(1)=1000

Queries • A query is a coordination formula of the form A(x)  i: q(x), where • A(x) is a coordination formula • x has n variables • i is the database against which the query is posed • q is a new n-ary predicate symbol • A relational space is a pair <db,r> where db is a set of DBs and r associates an rik with each pair of DBs • <db,r> ⊨ f A relational space <db,r> satisfies a coordination formula f • The answer to a query: {ddomi| <db,r> ⊨ ((i:x).A(x)  i:x=d)} n

Interpreting a Query • A query ((i:P(x) j:R(y)) k:S(x,y) )  h: q(x,y) • Evaluate P,R,S in i,j,k (respectively) • Map these results via rih,rjh,rkhto sets si,sj,sk • And then compute ((si sj)  sk)

View-Based Data Integration • The standard model for data integration is based on • Global as view • Local as view • How does this relate to LRM? • Could use LRM to express view definitions • Either map “standard” view defns to LRM, or • Specify a restricted LRM for view definitions • Could customize LAV/GAV query interpretation for such LRM views and queries • This is work in progress.

User Interface Update Mgr Query Mgr P2P Network Wrapper LRM layer Local Information Source A Node Implementation Architecture • A classic multi-database system, with • A protocol for adding/dropping acquaintances • LRM query processing (domain mapping logic) that can cope with chains of acquaintances • Dynamic approach to materialized view creation • Tools to help a user establish an acquaintance (e.g. yellow pages, defining domain relations & coord formulas, ...)

Summary • Why P2P databases are different • A P2P database scenario • A logic for P2P databases (LRM) • Coordination formulas and domain relations • Query semantics • Architecture and implementation issues

Theoretical Results • Provide inference rules for coordination formulas • Prove that the rules are sound and complete. • Define a generalized relational theory as a theory with domain closure, distinct domain values, and a finite number of possible relation extensions (CWA). • Define relational multi-context system <T,R> as a family of relational languages (one per database) with a generalized relational theory (in T) and a set of coordination formulas (in R) for each one. • Prove that for any relational multi-context system, there’s a unique maximal relational space that satisfies it. (Generalizes Reiter’s result on CWA.) • Other results on recursive queries and query evaluation.

Data Management for Peer-to-Peer Computing: A Vision