1 / 67

Unit 12 Advanced Concepts: Distribution and NoSQL DBMSs

Unit 12 Advanced Concepts: Distribution and NoSQL DBMSs. Processing Of Distributed Data In Context Overlaying Our Standard Levels. Lower Levels Needed All DBMSs Not Only For Relational DBMSs. New Level: Distribution Level Lower Levels Are Enhanced. An Important Note: Wider Applicability.

hei
Télécharger la présentation

Unit 12 Advanced Concepts: Distribution and NoSQL DBMSs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit 12Advanced Concepts:Distribution and NoSQL DBMSs

  2. Processing Of Distributed Data In ContextOverlaying Our Standard Levels

  3. Lower Levels Needed All DBMSsNot Only For Relational DBMSs

  4. New Level: Distribution LevelLower Levels Are Enhanced

  5. An Important Note: Wider Applicability • The material on distribution in this unit is applicable to Database Management Systems beyond just Relational Database Management Systems

  6. Introduction

  7. Characteristics of Some Applications:Standard SQL Databases May Not Be Suitable • An interesting example of a high-stress application: security trading system • Fast response • Fault tolerance: continue operating in the presence of faults • Fast application development • Correctness less important for decision making (and also maybe also for execution): note conceptual similarity to OLAP but sometimes with different goals • May run on clusters of machines, so really a distributed database + user application • Minimize the use relational databases: too heavy weight • Instead, may use NoSQL (Not SQL or Not Only SQL) systems • We will look at some concepts of distributed computing systems • But before that, let’s look at some NoSQL examples

  8. NoSQL ExampleDocument-oriented DBMS

  9. Document-oriented Example: MongoDB • All SQL DBMSs are alike; each NoSQL DBMS is NoSQL in its own way • We will start by briefly looking at MongoDB, which is a document-oriented (with key-value pairs) NoSQL DBMS • MongoDB database consists of collections • Collection consist of a set of documents • A document consists of a set of fields • A field is a (key: value) pair • MongoDB uses JSON (Java Script Object Notation)

  10. Oracle vs. MongoDB • Generalizes relations, as we see next

  11. Example of a MongoDB Document • { "email": "zk1@nyu.edu", "first_name": "Zvi", "last_name": "Kedem", "phone_numbers": [ { “voice": "+1 212 998 3101" }, { "fax": "+1 212 995 4123" } ]} • We have a document, which • contains four fields • one of the fields is an array of two documents • Note: not even in 1NF

  12. NoSQL ExampleGraph-oriented DBMS

  13. Relation Implementing a Binary Relationship(We Have Seen This Much Earlier) • Implement as

  14. Implementing General Relationship • Binary many-to-one mappings replaced by foreign key constraints “between tables” • Adding more tables to implement general relationships

  15. Before Relational DBMSsA Collection of Two-level Trees Parent Child Child Parent • We want to go from parent to child and from child to parent • Where were the animals Alice likes discovered? • From Alice to 2 children in the “intermediate” structure • From the 2 children, to their parents: Horse and Cat and see Asia and Africa • We follow edges and do not compute joins • How to implement?

  16. Replacing a Tree by a Linked List

  17. Replacing a Table by Two Linked Lists:Replacing Foreign Keys by Pointers • Can say: Replacing symbolic addresses by physical addresses • Historically, replacing physical addresses by symbolic addresses

  18. Replacing a Table by Two Linked Lists:Replacing Foreign Keys by Pointers • A network database (nothing to do with distribution) • Network = Graph • Can say: Replacing symbolic addresses by physical addresses • Historically, replacing physical addresses by symbolic addresses • That’s how we got relational databases

  19. Neo4jA Graph Database • Consists of • Vertices • Directed Edges (Arcs) • Example • Vertices are instructors and courses • Edges relate instructors to courses • Example queryMATCH (instructor { id: 'zk1' })--courseRETURN course.titlewill give us the titles of all the courses that zk1 taught • Execution by searching through indexes (transparently) and graph traversal

  20. NoSQL ExampleColumn-oriented DBMS

  21. Columnar DBMS • It is not really very NoSQL • The main difference from “standard” SQL model is storage • SQL databases typically store a table in a file as a sequence of records, one per tuple • Columnar databases typically store a table as a collection of files • Typically, each file stores one column of the table • Columnar databases are likely more efficient in answering OLAP queries dealing with one column • But they may use a standard SQL syntax as they essentially deal with tables

  22. Example: Row-oriented Storage(Typical SQL DBMS Implementation) • 1 file with 5 records (drawn in two lines here)

  23. Example: Column-oriented Storage(Typical Columnar DBMS Implementation) • 3 files with 5 records each

  24. Distributing the Data

  25. Distributing The Data:Vertical and Horizontal Partitioning

  26. Distributing The Data:Very Special Case of Horizonal Partitioning

  27. Collection of Machines Each Running a DBMS • Each machine runs some DBMS, not necessarily a relational database system • But each has some version of • Physical Implementation: file system, indexes, … • Query Processor • Recovery Mechanism • Concurrency Mechanism • The new issue: coordinate the concurrent execution of several machines

  28. Issues to Revisit • ACID properties • Query execution planning • We will talk very briefly about • Recovery • Concurrency • Query execution planning • But first to concisely state/summarize • In practice give up or relax some ACID conditions/guarantees • In practice hope for better performance (especially in a distributed environment and to support scalability) and flexibility • We start by examining how to guarantee ACID in a general distributed environment

  29. Recovery

  30. Global Recovery • We have a local recovery manager on each machine • It is able to guarantee • A: Atomicity • C: Consistency • D: Durability for transactions executing on its own machine • We need to guarantee ACD for transactions that run on more than one machine • So for example, such a transaction must be either committed or aborted globally, that is the work on each machine must be either committed or aborted (rolled back)

  31. Our Old Example: Money Transfer Items a and b are stored on a disk attached to some machine running a DBMS Transfer $5 from account a to b 1. transaction starts 2. read a into xa (local variable in RAM) 3. xa := xa− 5 4. write xa onto a 5. read b into xb (local variable in RAM) 6. xb := xb + 5 7. write xb onto b 8. transaction ends If initial values are a = 8 and b = 1 then after the execution a = 3 and b = 6

  32. Old Example: New Scenario • There are 3 DBMS machines: nodes in a cluster • There is M1 that is the coordinator • There is M2 that is a participant • There is M3 that is a participant • User interacts with M1 • M2 stores a on its local disk • M3 stores b on its local disk

  33. Our New Example: Money Transfer User asks to transfer $5 from account a to b M1 will be the coordinator M2 + M3 will be the participants Very rough sketch of execution1. M1 starts a global transaction2. M1 tells M2 to subtract 5 from a3. M1 tells M3 to add 5 to b 4. M2 starts a local transaction to subtract 5 from a5. M3 starts a local transaction to add 5 to b 6. M1 + M2 + M3 cooperate so “everything” is atomically committed or aborted: all transactions commit or abort

  34. Two-Phase Commit ProtocolGeneral Flowchart (Simplified)

  35. Two-Phase Commit ProtocolAll Commit

  36. Two-Phase Commit ProtocolA Participant Aborts  All Abort

  37. Two-Phase Commit ProtocolA Participant Not Ready  All Abort

  38. Two-Phase Commit ProtocolSome (Other) Participant Not Ready  All Abort

  39. Two-Phase Commit ProtocolCoordinator Decides: Global Commit

  40. Two-Phase Commit ProtocolA Participant Is Uncertain  It Must Wait

  41. Two-Phase CommitMany Optimizations Possible • A participant can report it is ready on its own initiative • A participant can report that it must abort on its own initiative • If a participant crashes while uncertain it can ask other participants whether they know what the decision was • …

  42. Another Issue: Global Deadlock Handling • Assume a system with strict two-phase locking (locked held until after commit) • The system uses two-phase commit • T1 “spawned” two transactions • T[1,1] executing at site S1, which contains items a and b • T[1,2] executing at site S2, which contains items c and d • Only after global commit of T1, T[1,1], T[1,2] can their locks be released • T2 “spawned” two transactions • T[2,1] executing at site S1 • T[2,2] executing at site S2 • Only after global commit of T2, T[2,1], T[2,2] can their locks be released

  43. Another Issue: Global Deadlock Handling S1 S2 T[1,1] locks a T[2,1] locks b T[1,1] waits to lock b T[1,2] locks c T[2,2] locks d T[2,2] waits to lock c • For T[1,1] to continue, T[2,1] has to release a lock • Can only happen after T2, T[2,1], T[2,2] committed • For T[2,2] to continue, T[1,2] has to release a lock • Can only happen after T1, T[1,1], T[1,2] committed

  44. Another Issue: Global Deadlock Handling • We have a global deadlock • There is no local deadlock anywhere • Difficult to detect

  45. Concurrency

  46. Global Concurrency Management • We assume that know how to manage recovery, that is a distributed transaction either commits or aborts at all sites on which it executes • ACD is guaranteed • We need to guarantee I (Isolation) also for transactions that run on more than one machine • Each machine is running a local concurrency manager, which we assume operates using rigorous locking • All locks are held until after local commit or abort on every machine • In case of global commit, all the locks are held until after global commit decision: the coordinator writes commit record on its log • This guarantees global serializability

  47. Extension to Multiple Copies (Replication)One Machine vs. Multiple Machines

  48. Advantages of Data Replication • It may be useful to replicate some data • To improve fault-toleranceIf Machine 1 crashes, we can still access “the blue data” on Machine 2 • To improve efficiencyBoth Machine 1 and Machine 2 can access “the blue data” locallySo they do not have to use the network to access that data and can access it fast

  49. Problems With Data Replication • We need to keep the replicated data consistent • “The blue data” has to be the same on Machine 1 and on Machine 2 • So, if some transaction running on Machine 1 modifies “the blue data”, we must make sure that the same modification is made (preferably transparently by the system) to “the blue data” on Machine 2 • So perhaps we could use the following protocolIf a transaction wants to modify “the blue data” on one machine, we must make sure transparently that the data is modified in the same way on both machinesA transaction wants to read “the blue data”, it can read the data from any machine

  50. A Nightmare Scenario: Network Partition • The network partitions into two sets that cannot communicate with each other • Machine 1 • Machine 2 • No transaction can modify “the blue data” • Because if this is possible, it can only do it on one of the machines • Then “the blue data” is not consistent • A transaction that reads “the blue data” on Machine 1 will get a different results than a transaction that reads “the blue data” on Machine 2

More Related