14-848 Cloud Infrastructure
310 likes | 329 Vues
Learn about relational databases and the SQL language used to query and manipulate data. Understand how relational databases are organized on disk and their role in cloud infrastructure.
14-848 Cloud Infrastructure
E N D
Presentation Transcript
14-848Cloud Infrastructure Lecture 10 * Fall 2018 * Kesden
Socrative • https://api.socrative.com/rc/Nfu6Lp
Today’s Goals • A brief introduction to relational databases • A brief introduction to the SQL language used by relational databases • A brief look at how relational databases are organized on disk • Reflect on how they can be used within clouds • Reflect on why there are limits to their ability to scale
Not Among Today’s GoalS • A deep study of relational databases • Or, even minimal functional competency with SQL databases • If interested, take 15-615 or 15-645
Database “A database is an organized collection of data, stored and accessed electronically.” -- https://en.wikipedia.org/wiki/Database
Relational Database • “A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. “ • “The relational model (RM) for database management is an approach to managing data using a structure and language consistent with first-order predicate logic…where all data is represented in terms of tuples, grouped into relations. • “Virtually all relational database systems use SQL (Structured Query Language) for querying and maintaining the database.” -- https://en.wikipedia.org/wiki/Relational_database -- https://en.wikipedia.org/wiki/Relational_model
Relational DatabasesRole In Clouds • Most widely used type of database since the 1980s • Unlikely to lose that title, or see reduced growth in use, any time soon • Good for up to millions of records • Beyond that, for “Big Data” applications, we’ll talk about various NoSQL databases • Everything from managing system configuration data to customer accounts to inventories to order and shipping records, to HR data, etc.
Relational Database, Informally • A database wherein the data is structured in tables known as relations. • Rows represent entities and are unordered • A relational database, although drawn as a table, functions more like an unordered set of rows. • Columns represent attributes • Constraints can be enforced w.r.t. types and values • Schema for each table describes attributes and associated types. • Operations can be performed across tables.
Keys:What Are They? • Used to name rows within tables • Used to associate rows across tables • Can be tuple of attributes
KeYS:Example Role • Unique within relation • Used to associate relations
SQL (“Sequel”)Structured Query Language • Used by nearly all relational databases • Both a Data Definition Language (DDL) and a Data Manipulation Language (DML) • DDL: Create, define, modify, and administer (access restrictions, etc) relations • Structure, types, constraints, etc – not values therein • DML: Query and modify values: Tuples and attributes thereof • Hides organization of data, e.g. whether and how indexed, how stored, location, etc. • Big difference from older database languages that described tree paths, etc. • Name derived from earlier IBM predecessor, “Standard English QUeryLanguage”
Keys:Types • Candidate – Unique to row • Primary – Used as name for row • Foreign – Key used in another table • Think of a foreign key as a pointer
SQL:CREATE CREATE TABLE StudentBios ( StudentID CHAR(9) NOT NULL, FName VARCHAR(200)NOT NULL, LName VARCHAR(200)NOT NULL, YearEnrolled YEAR )
SQL:CREATE, REVISITED CREATE TABLE StudentCourses ( StudentID CHAR(9) NOT NULL, Course CHAR(6) NOT NULL, Semester CHAR(3) NOT NULL, Grade CHAR(2) PRIMARY KEY (StudentID, Course, Semester), FOREIGN KEY (StudentID) REFERENCES StudentBios ) Why is the primary key (StudentID, Course, Semester)? Not just StudentID?
SQL:INSERT INSERT INTO StudentBios (CHAR(9), VARCHAR(200),VARCHAR(200),YEAR) VALUES (123456789, ‘lkennedy’, ‘Laurie’, ‘Kennedy’, 2014)
SQL:SELECT SELECTFname, Lname from StudentBios WHERE YearEnrolled >= 2015
SQL:SELECT, Continued SELECTStudentID, Fname, Lname from StudentBios WHERE YearEnrolled <= 2015 ORDER BY FName,LName
SQL:DELETE DELETE FROM StudentBios WHERE StudentBios.Fname = ‘Laurie’
SQL:JOIN • Combines columns from one or more relations into a new relation • Rows are aligned based upon a matching key • Columns are selected from logically combined rows • Yes, a relation can be joined with itself, hence one or more • Complexity • Some keys may be present in some relations but not others • Depending upon which table is being iterated, some rows may be missed or some columns may be empty • This leads to different flavors of JOIN
SQL:JOIN SELECT AndrewID, Course, Semester, Grade FROM StudentBiosINNER JOIN StudentCourses WHEREStudentBIOS.StudentID = StudentCourses.StudentID • This effectively forms a new relation which is the union of the StudentBios and StudentCourses tables • The rows of the two are associated up based upon the StudentID of the the rows matching. • INNER JOIN requires matches in both tables • If there isn’t a matching key in both tables it isn’t included • The result has (AndrewID, Course, Semester, Grade) columns
SQL:Views CREATEVIEWStudentClassList AS SELECT AndrewID, Course, Semester FROM StudentBios INNER JOIN StudentCourses WHERE StudentBIOS.StudentID = StudentCourses.StudentID • This forms what is essentially a virtual table • Abstracts away complexity of underlying tables • Allows view to be created from tables that may be from different domains or systems • Allows protections to be applied to table, e.g. can access course list for student, but not course grades • May enable better system performance by aiding caching, etc.
Relational Databases:Memory Hierarchy: A Quick Review • Really approximate access times (latency): • Registers: Speed of processor • SRAM (Cache): 2-3nS = ~2.5x10-6 mS = ~2.5/1,000,000 mS • DRAM (Main Memory): 20-30nS = ~2.5x10-5 mS = 2.5/100,000 mS • FlashRAM (SSD): 0.035 mS • Disk Drive: 1 mS, sequential • Disk Drive: 10 mS, random • The name of the game is… • To build a system that performs nearly as well as if all of its memory is of the fastest type • At a cost nearly as low as if all of its memory is of the slowest type • Caching is the magic sauce
Relational Databases:Additional Considerations • Memory is limited • High throughput is key • ACID properties • Atomicity – A transaction is all done, or note done at all, but not half done (Succeed or fail) • Consistency – Find things consistent, leave things consistent. Don’t break things with side-effects. • Isolation – Even if there is concurrency, effects should be the same as if sequential • Durability – Once it is done, it is done (.) • Minimize overhead, while maximizing performance
Relational Databases:Implications OF ENVIRONMENT, GOALS • Caching is critical • Durability will require consideration of hardware • SSD or other FlashRAM layer? • Battery powered backup? • Etc • Indexes and data stored separately • Pack as many indexes into RAM as possible. Think like Haystack, but 40 years earlier. • Multiple indexes, separate for each use • Indexes and data different caching domains
Relational Databases:Typical Organization • Block/Page based storage of data pages • Often LRU cached • All same size for interchangeable allocation and caching, mapping to disk blocks, etc • Indexes constructed to speed queries • B-Tree based – sorted, range-based, etc • Caching may be page-based LRU. • Hash based – fast lookup without rich features • Trades disk space for lookup speed • Exact lookup only, not ranges, etc. • May be dynamic • Lookup results can be cached. • Indexes point into data pages • Commonly created by administrator, not based upon any intrinsic understanding by the database, itself
Relational Databases:Obtaining Robustness • Backups are hard because they may require a freeze • Blurry backups may not be useful because of ACID • Some algorithms can make blurry backup and fix-up • Replicated databases are a common option • A few replicas provide robustness without blowing scale • Need policies for working through failure and obtaining recovery • Need atomic commit protocol or master-slave, etc, to maintain consistency
Relational Databases:Scaling Up • Scaling to a point is easy enough • Buy bigger disk • Solution can be replicated • Scaling even bigger is easy enough • Buy an array • Solution can be replicated • Scaling even bigger gets complicated • Distributing whole tables makes joins more expensive • Sharding tables by row makes summarizing across attributes slow • Sharding tables by column makes lookup by entity slow • At a certain point, it may be better to focus on precise needs and structure a solution to meet them • NoSQL Databases: Row-Oriented, Column-Oriented, Key-Value Stores, etc.
Relational Databases:Scaling Up, Simple Ideas • Distill data as much as possible. • Make representation of data as dense as possible. • Keep data redundantly, if (absolutely needed) to allow distribution with performance
SQL:The STORY CONTINUES • There is a ton of theory behind relational databases • There is a ton more to the SQL language • There is a ton more to how SQL databases are organized, indexed, etc. • Take 15-615 or 15-645 if interested.
Moving Forward:Data (Not File) Storage at Scale • Key-Value Stores • Column, Row, and Document-Oriented Databases • RAM layer storage