Download
information management in p2p serge abiteboul inria futurs and univ paris 11 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11 PowerPoint Presentation
Download Presentation
Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

350 Vues Download Presentation
Télécharger la présentation

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Information Management in P2PSerge AbiteboulINRIA-Futurs and Univ. Paris 11 P2P Data Management, 2006, S. Abiteboul

  2. Introduction

  3. Success stories at the time of the Internet bubble • Google: management of Web pages • Mapquest: management of maps • Amazone: book catalogue • eBay: product catalogue • Napster (emule, bearshare, etc.): music database • Flickr: picture database • Wikipedia: dictionary • del.icio.us: annotations • InFrance: Meetic: dating database Kelkoo: comparative shopping They are all about publishing some database P2P Data Management, 2006, S. Abiteboul

  4. The trend is towards peer-to-peerand interactivity • P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network • seti@home; kazaa; cabal • Switch from centralized servers to communities and syndication • Interaction and Web 2.0 • Motivations: Social, organizational P2P Data Management, 2006, S. Abiteboul

  5. Information management in a P2P network • Private terminology: data ring • Information is heterogeneous, distributed, replicated, dynamic • Which info: Data + meta-data + knowledge + services • Peers are heterogeneous, autonomous and possibly mobile • From sensors to PDA to mainframe • Typically very large number of peers • Variety of requirements: QoS, performance, security, etc. P2P Data Management, 2006, S. Abiteboul

  6. Acknowledgement • Xyleme:Scalable XML warehousing • Sophie Cluet, Guy Ferran (Xyleme) & many others • ActiveXML:Language for P2P data management • Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others • KadoP: P2P scalable XML indexing • Ioana Manolescu, Nicoleta Preda & others • Data Ring: Infrastructure for P2P data management • Alkis Polyzotis (UC Santa Cruz) P2P Data Management, 2006, S. Abiteboul

  7. Outline • Introduction – the data ring • Calculus for P2P data management (ActiveXML) • Algebra for P2P data management (ActiveXML algebra) • Indexing in P2P (KadoP) • Conclusion • Goal of the tutorial: present issues and technology on p2p information management • Warning: it is very biased – it is not a survey P2P Data Management, 2006, S. Abiteboul

  8. Outline • Introduction – the data ring • Calculus for P2P data management (ActiveXML) • Algebra for P2P data management (ActiveXML algebra) • Indexing in P2P (KadoP) • Conclusion P2P Data Management, 2006, S. Abiteboul

  9. 1. Introduction – the data ring

  10. The information in a data ring • Data: tuples, collections, documents, relations… • Services: data sources, possibly some processing • Meta-data about resources: attribute/values pairs, annotations • Ontologies to explain data and metadata • View definitions • Data integration information, e.g., mappings between ontologies • Physical data: Indices and materialized views P2P Data Management, 2006, S. Abiteboul

  11. Functionalities of the data ring • Storage, persistence, replication • Indexing, caching, querying, updating, optimization • Schema management, access control • Fault tolerance, self tuning, monitoring • Resource discovery, history, provenance, annotations, multi-linguism, • Semantic enrichment, uncertain data • Each functionality may be achieved by a peer or by the network P2P Data Management, 2006, S. Abiteboul

  12. A mainframe database A file system Web server A PC A PDA A telephone A sensor A home appliance A car A manufacturing tool A telecom equipment A toy Another data ring And now, what is a peer? Any connected device or software with some information to share A net address and some names of resources (e.g. document or service) P2P Data Management, 2006, S. Abiteboul

  13. Advantages and disadvantages of P2P • Scaling • Performance • Optimization of parallelism • Avoid bottleneck • Replication • Availability • Replication • Cost • Avoid the cost of server • Share operational cost • Dynamicity • add/remove new data sources • Complexity • Performance • Cost for complex queries • Communication cost • Availability • Peers can leave • Consistency maintenance • Difficult to support transaction • Quality • Difficult to guarantee quality P2P Data Management, 2006, S. Abiteboul

  14. Crash course on Web standards Owl RDFS XML • Data exchange format = XML • Labeled ordered trees • Its main asset = XML schema • There is much more • Distributed computing protocol = Web services • SOAP Simple Object Access Protocols • WSDL Web Service Definition Language • UDDIUniversal Description, Discovery and Integration • BPEL Business Process Execution Language • Query languages = XPath and XQuery • Declarative query language for XML + full-text + update language • Knowledge representation = Owl or RDF/S Xquery Xpath SOAP WSDL P2P Data Management, 2006, S. Abiteboul

  15. Information used to live in islands but with the Web, this is changing:uniform access to information… …the dream for distributed data management

  16. Do you like the standards? • It is the wrong question! • Correct questions: What can you do with it? What is missing? • Is Xquery the ultimate query language for the Web? No • It is a language for querying centralized XML • We will see what it is missing • We will not talk much about semantics P2P Data Management, 2006, S. Abiteboul

  17. Automatic and distributed management of the data ring • No centralized server • No information administrator (no info manager) • Most users are non-experts • E.g., scientists • Requirements • Ease of deployment (zero-effort) • Ease of administration (zero-effort) • Ease of publication (epsilon-effort) • Ease of exploitation (epsilon-effort) • Participation in community building notably via annotations Happy database admin P2P Data Management, 2006, S. Abiteboul

  18. What should be made automatic • Self-statistics from the monitoring of the data ring • In particular, define the statistics that are needed* • Self-tuning based on the self-statistics • Choose the most appropriate organization • Decide to install access structures: indexes, views, etc. • Control replication of data and services • Self-healing • Recovery from errors • E.g., replacement of a failing Web service • And automatic file management P2P Data Management, 2006, S. Abiteboul

  19. Any hope? • Technology exists (database self-tuning, machine learning, etc.) • But self-tuning for databases has advanced very slowly • Why can this work? • There is no alternative (for db, this was just a cool gadget) • KISS (keep it simple stupid!) • The power of parallelism • This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true) P2P Data Management, 2006, S. Abiteboul

  20. Distributed access control • Goal: Control access to ring resources • Access to resources is based on access rights (ACL) • Who is controlling ACLs? • A node manages ACLs for a collection of distributed resources • Easy but against the spirit and possible bottleneck • The network manages access control • Anybody can get the data • The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes • Some techniques exist P2P Data Management, 2006, S. Abiteboul

  21. Monitoring • What is monitored? • Web service calls and database updates • The Web • Web pages • RSS feed • What is produced? • A stream of events • As a continuous service • As a RSS feed • As a Web site/page • Info-surveillance • Self-statistics and tracing • Basis for error diagnosis P2P Data Management, 2006, S. Abiteboul

  22. Streams are everywhere • In query processing • In indexing (KadoP) • In recursive queries (AXML-QSQ) • In messaging, monitoring and pub/sub • That is why we will use an algebra over streams of trees and not simply an algebra over trees P2P Data Management, 2006, S. Abiteboul

  23. Example: Edos distribution system • A system for the management of Linux distribution • Joint work with Mandriva Software and U. Tel Aviv • Community of open-source developers: thousands • System releases: about 10 000 software packages + metadata • Functionalities • Query the metadata • Query subscription • Retrieve packages • Publish a new release or update an existing one P2P Data Management, 2006, S. Abiteboul

  24. Exemple: WebContent • WebContent: an ANR platform for the management of web content • Web surveillance • Business, technical, web watching… • Participation of Gemo • WP3: knowledge • WP5: P2P content management • Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.) P2P Data Management, 2006, S. Abiteboul

  25. Taxonomy of such applications • Parameters • Number of peers and quantity of data • How volatile the peers are • The query/update workload • The functionalities that are desired • Edos: peers and documents in thousands, mostly append for updates, peers not too volatile • An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers • Mostly interested in the first case P2P Data Management, 2006, S. Abiteboul

  26. Thesis • XQuery is fine for local XML processing and publishing • Not sufficient for distributed data management • The success of the relational model, i.e., of tables on a server : • A logic for defining tables • An algebra for describing query plans over tables • By analogy, we need for trees in a P2P system • A logic for defining distributed tree data and data services • An algebra for describing query plans over these • Proposal: ActiveXML logic and algebra P2P Data Management, 2006, S. Abiteboul

  27. Outline • Introduction – the data ring • Calculus for P2P data management (ActiveXML) • Algebra for P2P data management (ActiveXML algebra) • Indexing in P2P (KadoP) • Conclusion P2P Data Management, 2006, S. Abiteboul

  28. 2. Active XML:a logic for distributed data management

  29. The basis • AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework • Simple idea: XML documents with embedded service calls • Intensional data • Some of the data is given explicitly whereas for some, its definition (i.e. the means to acquire it when needed) is given • Dynamic data • If the data sources change, the same document will provide different information P2P Data Management, 2006, S. Abiteboul

  30. Example(omitting syntactic details) <resorts state=‘Colorado’> <resort> <name> Aspen </name> <sc> Unisys.com/snow(“Aspen”) </sc> <depth unit=“meter”>1</depth> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> … </resorts> • May contain calls • to any SOAP web service : • e-bay.net, google.com… • to any AXML web services • to be defined P2P Data Management, 2006, S. Abiteboul

  31. Marketing  Philosophy Active answer = intensional and dynamic and flexible Embedding calls in data is an old idea in database Manon: What’s the capital of Brazil? Dad: Let’s ask Wikipedia.com! Manon: How do I get a cheap ticket to Galapagos? Dad: Let’s place a subscription on LastMinute.com! Manon: What are the countries in the EC? Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more! P2P Data Management, 2006, S. Abiteboul

  32. Active XML peer AXML peer soap • Peer-to-peer architecture • Each Active XML peer • Repository: manages Active XML data • Web client: calls the services inside a document • Web server: provides (parameterized) queries/updates over the repository as web services • Exchange of AXML instead of XML P2P Data Management, 2006, S. Abiteboul

  33. What is an AXML peer? • PC • Now open source ObjectWeb – queries in OQL • Peer on a mass storage system • eXist (open source XML database) queries in XQuery • Xyleme queries in XyQL • PDA or cell phone • Persistence in file system and XPATH • On going: the entire network • Data is stored in a P2P network - KadoP • More: java card, a relational database… P2P Data Management, 2006, S. Abiteboul

  34. When to activate the call? Explicit pull mode: active databases Implicit pull mode: deductive databases Push mode: query subscription What to do with its result? How long is the returned data valid? Mediation and caching Where to find the arguments? Under the service call: XML,XPATH or a service call A key issue: call activation P2P Data Management, 2006, S. Abiteboul

  35. Another key issue: what to send? • Send some AXML tree t • As result of a query or as parameter of a call • The tree t contains calls, do we have to evaluate them? • If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data? • Hi John, what is the phone number of the Prime Minister of France? • Find his name at whoswho.com then look in the phone dir • Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr • (33) 01 56 00 01 P2P Data Management, 2006, S. Abiteboul

  36. Blasphemous claim: Active XML is the proper paradigm for data exchange! Not XML + not XQuery Brings to a unique setting distributed db, deductive db, active db, stream data warehousing, mediation This is unreasonable? Yes! Plenty of works ahead… to make it work But first, the algebra Active XMLcool idea – complex problems P2P Data Management, 2006, S. Abiteboul

  37. Outline • Introduction – the data ring • Calculus for P2P data management (ActiveXML) • Algebra for P2P data management (ActiveXML algebra) • Query processing • Query optimization • Indexing in P2P (KadoP) • Conclusion P2P Data Management, 2006, S. Abiteboul

  38. 3. Active XML algebra

  39. Motivation • Relational model: centralized tables • optimization: algebraic expression and rewriting • Active XML model: distributed trees • optimization: algebraic expression and rewriting • Distributed query optimization based on algebraic rewriting of Active XML trees • Based on experiences with AXML optimization P2P Data Management, 2006, S. Abiteboul

  40. We focus on positive AXML Set-oriented data Positive/monotone services Services = tree-pattern-query-with-join queries Services produce streams Optimized by a local query optimizer Evaluated by a local query processor Out of our scope Active XML peers output stream π Local query processing join π  input stream input stream P2P Data Management, 2006, S. Abiteboul

  41. The problem • An AXML system • A set of peers • For each peer a set of documents and services • Extensional data is distributed • Intensional data (knowledge) is distributed • Defined using query services (TPQJ queries) • These services are generic: any peer can evaluate a query • A query q to some peer • Evaluate the answer to q with optimal response time P2P Data Management, 2006, S. Abiteboul

  42. AXML algebra l En … E1 E2 s@p En … E1 E2 send@p #n@p’ E1 eval@p receive@p E1 E1 • (AXML) algebraic expressions: AXML logic d@p Each such expression lives at some peer Includes the AXML trees P2P Data Management, 2006, S. Abiteboul

  43. Algebraic expressions annotations • Executing service call: • Terminated service call:  • Subtlety q@p(5): definition of intensional data eval(q@p(5)): request to evaluate it; during query optimization  q@p(5): query is being evaluated; during query processing  q@p(5): query evaluation is complete P2P Data Management, 2006, S. Abiteboul

  44. Evaluation rules: local rules l l eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p → … tn … t1 t2 → t1 t2 tn t2 t1 tn E1 ●s@p E1 → s@p … tn … t1 t2 for l ≠ sc, s ≠ send, receive P2P Data Management, 2006, S. Abiteboul

  45. Evaluation rules: transfer rules #x@p #x@p newRoot()@p’ eval@p  receive@p eval@p’ send@p’ s@p’ s@p’ s@p’ … … … t1 t1 t1 t2 t2 t2 → & #x@p • Site p asks p’ to do the work and send the result to p  P2P Data Management, 2006, S. Abiteboul

  46. Evaluation rules: more transfer rules eval@p’ eval@p’ send@p’ send@p’ ●s@p’ ●s@p’ #x@p #x@p … … … … Z Z • When a query is evaluated, results appear • They are sent to the place that requested them • Also some rules for eof #x@p → & … Z P2P Data Management, 2006, S. Abiteboul

  47. Evaluation • Reminder: setting • An AXML system • A request to evaluate query q at peer p – eval@p( q ) • Rewrite the trees in peer workspaces until termination of the process • Results • For positive XML, this process converges… to a possibly infinite state • This process computes the answer to q • May be fairly inefficient: need for optimization! P2P Data Management, 2006, S. Abiteboul

  48. More rewrite rules to evaluate a query more efficiently Optimization

  49. Query optimization • Well-known optimization techniques for distributed data management • Pushing selections • Semijoin reducers • Horizontal, vertical, hybrid decomposition • Recursive query processing and query-subquery • Some “specific” AXML optimizations • Pushing queries over service calls • Lazy service call evaluation … • Optimizing subscription management • All are captured by the algebraic framework P2P Data Management, 2006, S. Abiteboul

  50. Example: pushing selections eval@p eval@p eval@p q q1 q1@p eval@p (q2(d@p2)) d@p2 q1@p → → → eval@p2 @p2(q2@p2(d@p2)) (q2(d)) • Same rule applies if d@p2 is replaced by a continuous query Suppose q ≡ q1((q2)): P2P Data Management, 2006, S. Abiteboul