1 / 23

CPS 216: Advanced Database Systems

CPS 216: Advanced Database Systems. Shivnath Babu. Outline for Today. What this class is about: Data management What we will cover in this class Logistics. What does a Database System mean to you? (Hint: What are they used for? Give examples). User/Application. Data. Data Management.

kpowers
Télécharger la présentation

CPS 216: Advanced Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPS 216: Advanced Database Systems Shivnath Babu

  2. Outline for Today • What this class is about: Data management • What we will cover in this class • Logistics What does a Database System mean to you? (Hint: What are they used for? Give examples)

  3. User/Application Data Data Management Query Query Query DataBase Management System (DBMS)

  4. Example: At a Company Query 1: Is there an employee named “Nemo”? Query 2: What is “Nemo’s” salary? Query 3: How many departments are there in the company? Query 4: What is the name of “Nemo’s” department? Query 5: How many employees are there in the “Accounts” department? Employee Department

  5. High-level Query Q Answer Translates Q into best execution plan for current conditions, runs plan DataBase Management System (DBMS) DBMS Data

  6. Join (Cars.OwnerID = Owners.ID) Filter (Make = Honda and Model = Accord) Filter (Age <= 23) Example: Store that Sells Cars Owners of Honda Accords who are <= 23 years old Cars Owners

  7. High-level Query Q Answer Translates Q into best execution plan for current conditions, runs plan Keeps data safe and correct despite failures, concurrent updates, online processing, etc. DataBase Management System (DBMS) DBMS Data

  8. DBMS is multi-user • ExampleGet account balance from database;If balance > amount of withdrawal then balance = balance - amount of withdrawal; dispense cash; store new balance into database; • Homer at ATM1 withdraws $100 • Marge at ATM2 withdraws $50 • Initial balance = $400, final balance = ? • Should be $250 no matter who goes first

  9. Final balance = $250 Homer withdraws $100: read balance; $400if balance > amount then balance = balance - amount; $300 write balance; $300 Marge withdraws $50: read balance; $300if balance > amount then balance = balance - amount; $250 write balance; $250

  10. Final balance = $300 Homer withdraws $100: Marge withdraws $50: read balance; $400If balance > amount then balance = balance - amount; $350 write balance; $350 read balance; $400 if balance > amount then balance = balance - amount; $300 write balance; $300

  11. Final balance = $350 Homer withdraws $100: Marge withdraws $50: read balance; $400 if balance > amount then balance = balance - amount; $350 write balance; $350 read balance; $400 if balance > amount then balance = balance - amount; $300 write balance; $300

  12. Concurrency control in DBMS • Similar to concurrent programming problems • But data is not all in main-memory • Appears similar to file system concurrent access? • Approach taken by MySQL initially; now MySQL offers better alternatives • But want to control at much finer granularity • Or else one withdrawal would lock up all accounts!

  13. Recovery in DBMS • Example: balance transferdecrement the balance of account X by $100;increment the balance of account Y by $100; • Scenario 1: Power goes out after the first instruction • Scenario 2: DBMS buffers and updates data in memory (for efficiency); before they are written back to disk, power goes out • Log updates; undo/redo during recovery

  14. High-level Query Q Answer Translates Q into best execution plan for current conditions, runs plan Keeps data safe and correct despite failures, concurrent updates, online processing, etc. DataBase Management System (DBMS) DBMS Data

  15. Summary of modern DBMS features • Persistent storage of data • Logical data model; declarative queries and updates ! physical data independence • Multi-user concurrent access • Safety from system failures • Performance, performance, performance • Massive amounts of data (terabytes ~ petabytes) • High throughput (thousands ~ millions transactions per minute) • High availability (¸ 99.999% uptime)

  16. Parser Logical query plan Query Optimizer Physical query plan Query Executor Access method API calls Storage Manager Disk(s) Storage system API calls Modern DBMS Architecture Applications SQL DBMS File system API calls OS

  17. World of “Big Data” • Numbers reported by Google from 2007: • Data processed per month is 400 PB (PetaBytes) • Average job size is 180 GB • For 180 GB of data, it takes: • 30 minutes to read from disk (@100 MB/s) • 600 minutes to download at 5 MB/s • Big data is hard to move (but easy to store – few cents per GB) • Can throw parallelism at the problem

  18. Word Count over a Given Set of Web Pages see 1 bob 1 throw 1 see 1 spot 1 run 1 bob 1 run 1 see 2 spot 1 throw 1 see bob throw see spot run Can we do word count in parallel?

  19. The MapReduce Framework (pioneered by Google)

  20. Automatic Parallel Execution in MapReduce Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job

  21. Course Outline • Principles of query processing (25%) • Indexes • Query execution plans and operators • Query optimization • Data storage (15%) • Databases Vs. filesystems (Google FileSystem, Hadoop Distributed FileSystem) • Row-oriented Vs. column-oriented storage • Flash memory and Solid State Drives • Scalable data processing (30%) • Parallel query plans and operators • Systems based on MapReduce • Scalable key-value stores • Concurrency control and recovery (15%) • Consistency models for data (ACID, BASE, Serializability) • Write-ahead logging • Information retrieval and Data mining (15%) • Web search (Google PageRank, inverted indexes) • Association rules and clustering

  22. Course Logistics • Useful reference: Database Systems: The Complete Book, by H. Garcia-Molina, J. D. Ullman, and J. Widom • Web site: http://www.cs.duke.edu/courses/fall09/cps216 • Grading: • Project 40% • Homework Assignments 15% • Midterm 20% • Final 25%

  23. Summary: Data Management is Important • Core aspect of most sciences and engineering today • Core need in industry (esp., “big data”) • Cool mix of theory and systems • Chances are you will find something interesting even if you primary interest is elsewhere

More Related