Relational Database Internals

Relational Database Internals Alex Scotti Bloomberg LP

Outline of talk • History - origins and background • Internals - theory and practice • Internals - brief discussion of real systems • Future - observations, trends, predictions

History • The early database systems differed from the relational ones in 2 main regards • Data model • Transactional semantics • We’ll be more heavily focused on the transactional issues than the data modeling issues in this talk

Pre Relational Systems • Hierarchical Data Models • IMS • Network Data Models • CODASYL

IMS • Each record was typed by a “record type” (think “table”) • Relationships between records are represented as trees (hierarchies) between records linked by their “keys” • Writing a “query” consisted of writing a program to navigate through these links, traversing records until the right one was found.

Data types available were SEQUENTIAL, HASH, TREE • Each acted differently - A program written to use a tree could not have the data structure changed out from under it • lack of PHYSICAL INDEPENDENCE

CODASYL • A more complex “evolution” of the IMS idea, standardized by ANSI, implemented by several vendors • Honeywell, DEC, Univac • Idea is instead of pointers forming a strict hierarchy, they now form an arbitrarily complex “network.” • Able to represent graphs • Even HARDER to program with than IMS

In 1970, Ted Codd wrote the foundational paper “A Relational Model if Data for Large Shared Data Banks” • Codd was primarily a mathematician, not particularly concerned with transaction processing • However, the two problems were incredibly tightly coupled • Work at IBM began on “relational databases” • Including locking, logging, all sorts of things that became the core of an RDBMS

Codd’s insight • A database is nothing more than a “fact store” from which it should be possible to logically infer “new facts.” • Simple but amazingly powerful. • If the goal is to store facts, then there is no benefit from storing the same fact multiple times or in multiple forms. A fact does not become MORE TRUE by repeating it. The basis of “normalization.”

Codd’s insight • If the system knows enough about the data it is storing, you can ASK IT QUESTIONS rather than TELL IT WHAT TO DO. • Declarative vs Procedural programming model • AKA, The nail in the coffin for all the Pre Relational systems • Just a matter of time - If a system is easier to use, and performs fine, why wouldn’t you use it?

Codd’s terminology has (for the most part) become replaced by the SQL terminology, which we’ll be using throughout the rest of this talk • Generalized simplification • “Logically organize your data into tables”

Going further, Codd defined 12 “rules” that he hoped to define what it meant to be “relational” • Key points are • All information is represented in tables • Nulls must be uniformly handled by all datatypes • Physical representation must be abstracted from logical representation • Physical location and distribution of data must be invisible to users

Key points are • “Set based” operations for insert / update / delete • Integrity constraints must be enforceable by the database system • There must be no way AROUND the set of enforced constraints • There must be support for at least 1 “relational language”

Codd’s work became the basis for a “next generation” database product at IBM called System R. • System R was treated as a “production proof of concept.” At the end of the project there were several commercial customers. • Around the same time, work was going on at UC Berkeley on the “Ingres” system, also based on Codd’s idea.

Neither system was successful at commercializing a general purpose database. • The award goes to Oracle. • Oracle shipped a working commercial RDBMS to anyone who would pay before IBM. • Based also on Codd’s work. • No common code between System R, Ingres, and Oracle - 3 unique lineages all based on the same idea

IBM evolved the System R “prototype” into their second system : DB2. • Ingres went on to be the basis of numerous successful commercial products • Sybase was based on Ingres code • Informix contains Ingres code (through Illustra) • MSSQL contained Ingres code (through Sybase)

Newer systems - all inspired by the same ideas and following the same principles, but without direct code sharing • Mysql • SQLite • PostgreSQL

Internals • Buffer Pool • Log • Concurrency Control • Btrees • Relational Layer

Buffer Pool • Often known as “the cache” • A page/block oriented data structure • A page in the pool conceptually “maps” to a block on a disk. (not really always true) • Needs to interface with the systems BELOW and the systems ABOVE. • Below - Disks, File systems • Above - Btrees

Both page/block oriented interfaces above and below. • Conceptually, very similar to the VM subsystem of any modern UNIX • “demand paging” • Eviction policy based on LRU approximations, often with more “smarts” than VM. • Higher levels of the system often can pass down “hints” about intended access patterns all the way to the buffer pool.

Buffer Pool - Why? • Whats the story? it’s a cache, we get it. • Much more than that going on here!

Basics of transaction management begins with the buffer pool and the policies and protocols enforced there • Terminology • “pinned” - a page that cannot be evicted • “dirty” - a page that contains data that DOES NOT match the data on the disk • “clean” - the opposite

A dirty page BECOMES a clean page when the data in that page is DURABLY written to the disk • Can we really just write a page to the disk? Not really, it usually involves logging protocols - wait for the next section!

More terminology • “forcing” - when a transaction commits, it’s dirty pages are FORCED to durable storage before considering the commit complete • “stealing” - A dirty page which is a part of an UNCOMMITTED transaction can be written to the disk in an effort to produce usable space in the buffer pool

What is the simplest? • FORCE / NOSTEAL • What is the highest performing and most powerful? • NOFORCE / STEAL • Not surprisingly, most real world systems today implement a NOFORCE / STEAL buffer pool policy • Support for this policy requires logging

More terminology • OVERWRITE / NO OVERWRITE • Whether or not the buffer pool will write changes to a page ON TOP of an existing page, or leave the existing page alone and write to a NEW page. • OVERWRITE systems are higher performing • most real world systems implement an OVERWRITE buffer pool. • NO OVERWRITE example: System R, shadow paging

How does data actually get written to the disk? • The “clients” of the buffer pool (the layers above) never concern themselves with writing data. They work at a layer of abstraction where they “get buffer” and “dirty” them. • Pages get written out (cleaned!) as part of a background process. • Goal is to keep some portion of the buffer pool clean.

Why are we trying to keep writing out these pages to disk in the background? • To make the system more reliable? • NO! Completely unrelated. Reliability ensured through other means • To make sure that a READ doesn’t become a WRITE! • Need a page? Cant get one, all dirty. • You get to “clean one” (write it) now!

Logging • Basic Idea behind logging • Before you do something, write down what it is you intend to do. • Sounds slow. Why bother with this, just DO IT! • Nope - The opposite is true. Logging can make things quicker

The highest performing buffer pool policy of NOFORCE/STEAL actually REQUIRES logging • Without logging you would compromise with a lower performing policy • Logging has the capacity to perform “magic” • Converts RANDOM (slow) I/O into SEQUENTIAL (fast) I/O! • We’ll come back to this idea

Expanding on the basic idea of logging • Theres really two distinct things that you are “writing down” here • Write down what it is you are about to do: REDO logging - can “do it over” • Write down the procedure to follow to make it as if what you did NEVER HAPPENED: UNDO logging • Many times both of these pieces of information are embedded into single “log record” Or not. Conceptually 2 things.

Mechanics of logging - What’s the data structure? • In it’s basic form, a log is a simple sequential file. Conceptually it’s not unlike a tape drive. • Each “record” in the log is identified by a unique identifier, which is typically just the physical location of the record in the file. • Call this the Log Sequence Number (LSN)

“Log Buffer” - exactly what it sounds like - a buffer of memory in front of the log. • An obvious and common optimization to make it less expensive to “write to the log” • Recoverability is endangered unless the log exposes an interface to FLUSH THE BUFFER. (and it gets called at the right places) • All real systems work this way

Subsystems are said to “generate log records” (calling APIs provided by log subsystem) • Buffer pool may need to log the allocation of a new page • Btree may need to log a page split • Relational layer may log an INSERT statement • Customers of this subsystem all over the database

2 approaches to logging • “Physical logging” • “Logical logging”

Physical logging • Log entire page images • “redo record” : “log what the page is GOING TO look like” • “undo record” : “log what the page LOOKS LIKE NOW” • Problems? • Inefficient, expensive • Poor concurrency

Problems • Inefficiency mess • Why log 2 copies of a page when I only changed a few bytes? • Concurrency mess • Systems with concurrency control at a finer granularity than the page cannot log this way. We’ll come back to that.

On the other hand • Physical logging is appealing because it is simple, and it works because of a nice property of being “testable” • We can look at a log record ABOUT a page, then look at the page, and determine which state it’s in because we RECORDED the two possible states • This turns out to be an essential property of reovery

Logical logging • Log the high level operations only • SQL • INSERT INTO TBL(A) VALUES(1) • REDO • “INSERT INTO TBL(A) VALUES(1)” • UNDO • “DELETE FROM TBL WHERE A = 1”

Elegant! • Simple! • Compact! • but it doesn’t WORK! • That SQL INSERT could decompose into dozens of page writes. • Some may have been done, then crash. You can’t look at thee pages and tell which ones were done (UNDO THEM) and which ones weren’t

NO FORCE allows us to mark a transaction “committed” WITHOUT writing all of the pages. • Some may have been written, then crash • We can’t tell which one WERE NOT written (REDO them) and which ones WERE (leave them alone) • It’s often UNSAFE to perform actions multiple times

Making logical logging work - “Physiological Logging” • “Physical ABOUT pages, logical ABOUT the contents INSIDE the page” • The idea is to keep the logging centered on the idea of pages, which works well • But log less information than a physical scheme would require

Example Physiological Operation • “Add item X to page N” • Push down the logical concept into the page level - logical INSIDE the page • SQL INSERT statement will decompose into several independent physiological operations • Each one is INDEPENDENTLY TESTABLE / UNDOABLE / REDOABLE • AKA, “it works”

Logging for purposes of recovery • Key technique is based on something called the “pagelsn” • Intertwining of the buffer pool and the logger • Each time you modify a page, store the LSN of the log record describing that modification ON THE PAGE ITSELF • Testability • Look at the pagelsn to determine state

Write Ahead Logging (WAL) Protocol • Tightly integrated with buffer pool • Before a dirty page is written to disk, the UNDO information for that page must be durable • Before a transaction is considered committed, the REDO information for that transaction’s pages must be durable • And that’s how a NO FORCE / STEAL system can convert random I/O into sequential

Basic idea behind recovery after crash • REDO all COMMITTED transactions • Some pages MAY NOT be written • as allowed by NO FORCE • UNDO all UNCOMMITTED transactions • Some pages MAY HAVE BEEN written • as allowed by STEAL

Concurrency Control • Lets talk about ACID now (finally?) • We’ll use Chris Date’s definition

Relational Database Internals