320 likes | 324 Vues
For 2004 ACM Honored Class Feng Qian April 3, 2007. Course Design in Database Lecture 1. Outline. Introduction & Implementation in C++ ( 钱风 ) Implementation in Java ( 周牧星 ) Schedule and grading policy ( 马融 ). Introduction. Objective: Build an Database Management System (DBMS) in C++/Java
E N D
For 2004 ACM Honored ClassFeng Qian April 3, 2007 Course Design in DatabaseLecture 1
Outline • Introduction & Implementation in C++ (钱风) • Implementation in Java (周牧星) • Schedule and grading policy (马融)
Introduction • Objective: Build an Database Management System (DBMS) in C++/Java • A team project, hard but fun engineering work • No official base code is given, you are encouraged to work from zero • No strict technical restrictions, much more flexible than previous projects (Compiler / OS) • TAs: 马融、周牧星、曲文涛、钱风
What you will learn fromDBMS project • A Profound understanding of modern DBMS • A Proficient programming ability for large engineering projects • A Perfect training for debugging skill for thousands lines of codes • A Practical simulation of teamwork environment in industrial zone No pains, No gains
Components of DBMS • Storage • Buffer • Heap File • Index • SQL Parser • SQL Engine • Concurrency Control • Database Manager • Interface Bottom Layer Top Layer
Storage (Easy, 200 lines) • Arrange data on disk, in the form of Page • Similar to Virtual Memory in OS • High level layers only knows a Page ID as an entry point for access of the data
Storage • My Implementation • Virtual address space: 4GB • Self-grow page files: 4MB each • An advanced bitmap: segment tree • Main interfaces(All deal with file operations on disk) • CreateDB, OpenDB, CloseDB, DestroyDB • AllocatePage, DeallocatePage • WritePage, ReadPage
Buffer (Easy, 300 lines) • A traditional cache strategy, store frequently accessed pages in memory (instead of in the disk storage system)
Buffer • My implementation • Store 2048 frequently used pages • 2nd-chance replacement algorithm • Main interfaces: • AllocatePage, DeallocatePage • GetPage (Handle page missing) • ReleasePage • PinPage, UnpinPage (Reference count)
Heap File (medium, 1k lines) • provides the ability of sequential access of the records, as well as insertion, deletion, updating and etc basic features. • Manage table structures • Manipulate tables in unit of tuple • Generate iterators for upper layers • The heap file defines the format of table structures and tuples in pages.
Heap File • Use single page to store a table structure CREATE TABLE student (studentid INTEGER, name VARCHAR(10), age INTEGER, schoolid INTEGER); Table Info: Table ID: 2B, Num of columns: 2B, Num of tuples: 4B, First page ID: 4B Num of pages: 4B, Table name: 10B Each Field: Column Name: 2B, Column type: 2B, Column size: 2B, Not null: 1B Key: 1B Offset: 2B, Index info: 4B In the table structure page:
Heap File • Store the table content • Assume a tuple (123,“Zhang”,20,45) is stored in Page 5 • The unique identification (RID) of this tuple is defined by its page (5) and its offset in the page (0x010h) • Use linked multiple pages to store all tuples in a table Structure Page Global Bitmap Content Page 1 Content Page 2 Content Page 3 Content Page n …
Heap File • Main Interface of my implementation • CreateTable • DropTable • InsertTuple • DeleteTuple • UpdateTuple • SearchTuple
Index (difficult, 1000 lines) • To accelerate the access of tuples • The most simple type of index: Always keep tuples in order Screenshot of Microsoft Access 2007
Index • Typical index technique: B+ Tree BT Header Page some important information about the B+ tree is recorded in the BT Header Page, such as the root page id of the tree and the key type BT Index Page BTIndexPage is the internal node and only contains index to the child B+ tree page. BT Leaf Page BTLeafPage is the leaf node and contain index entries with RIDs
Index • Common B+ Tree operations • Search, Insert, Delete • BoundaryQuery (return an array of RIDs) • Maintain B+ Tree at the same time of manipulating the heap file. • Keep the balance of the tree • To be simple, you can build a B+ Tree for only the column with the “primary key” flag. • Please refer to algorithm books and papers for more information of B+ Tree
SQL Parser (100 to 3k lines) • Parse SQL statement into the grammar tree • Two ways • Write a naïve parser yourself • Use tools (YACC for C++ users, JCup for Java users)
SQL Parser • Your DBMS should support st the following SQL statements • SELECT fields FROM tables [WHERE Wclause] [GROUP BY fields] [HAVING fexpression] [ORDER BY fields] • CREATE TABLE tablename (ctclauses) • DROP TABLE tablename • INSERT INTO tablename (fields) VALUE (values) • DELETE FROM tablename [WHERE Wclause] • UPDATE tablename SET uclauses [WHERE Wclause] • Group functions: AVG | COUNT | MIN | MAX | SUM • Relation Op.: AND | OR • Data Types: CHAR | VARCHAR | DATE | TIME | DOUBLE | INTEGER | LONG
SQL Parser • Examples • CREATE TABLE student (studentid INTEGER, name VARCHAR(30) DEFAULT '(NONAME)', age INTEGER DEFAULT -1, dat DATE DEFAULT DATE '2000-01-01' NOT NULL); • SELECT star1.name, star2.name FROM MovieStar star1, MovieStar star2 WHERE Star1.address=star2.address AND Star1.name<Star2.name; • SELECT name FROM MovieExec WHERE (cert#) IN (SELECT producerC# FROM Movie WHERE ( title , year ) IN (SELECT movieTitle , movieYear FROM StarsIn WHERE starName = 1)); • SELECT COUNT(birthdate) FROM MovieStarGROUP BY gender HAVING gender='F';
SQL Engine (Medium, 500~3k lines) • Execute the parsed SQL statement • A possible implementation
SQL Engine • Relational Algebra Query Tree
SQL Engine • Optimized Execution Tree Refer http://www.ittc.ku.edu/~sgauch/647/s99/notes/11b/sld001.htm for details
Concurrency Control (easy, 300 lines) • Different users can open the same database in the same time • They also can read from the same table concurrently • They cannot access the same table when someone do writing operations on it • Thus, a concurrency control is needed
Concurrency Control • A possible implementation • Set up a table lock • Two kinds of locks: Read lock and Write lock When a table has a write lock, no operations are allowed on it. When a table has a read lock, only read requirements are allowed. • If a user want to do some operations on it, he must wait until the previous lock is released. • Avoid dead locking!
Database Manager (medium, 500~1k lines) • The bridge between access interface and the core system • Since we allow open more than one database simultaneously, the DB manager should maintain all opened databases in current system. • Maintain system tables for each database • package the result from the low-level system, and send it to the right session • Logger
Database Manager • Main functions in my implementation • Hold the connected sessions (multi-threaded) • Manage every opened database in current system • Package the result and do response • Logger
User Interface (Easy, 500~1k lines) • Two types of user interfaces • Graphics User Interface • XDBC (JDBC, ODBC or you defined) programming interface • Remote user interfaces should be implemented • C++ users: recommend raw socket • Java users: recommend RMI
User Interface - API • I “created” a simple API. Sample program: #include “FatwormAPI.h” CFatWormAPI::Connect(“192.168.1.12”,1202); CFatWormAPI::ExecuteCommand("LOGIN AAA AAA;"); CFatWormAPI::ExecuteCommand("OPEN DATABASE STU;"); CFatWormAPI::ExecuteCommand(“SELECT * FROM student2 WHERE studentid<1000;"); for (int i=0;i<CFatWormAPI::GetSelectCount();i++) { printf("%d %s %d %d\n", CFatWormAPI::GetInteger(i,0), CFatWormAPI::GetChar(i,1), CFatWormAPI::GetInteger(i,2), CFatWormAPI::GetInteger(i,3) ); } CFatWormAPI::DisConnect();
Components of DBMS - Review Local Access Client X D B C Socket/RMI Remote Server Database Manager SQL parsing & SQL Engine Concurrency Control Heap File B+ TreeIndexing Storage & Buffer
Implementation DBMS in C++ • Advantage • Very Fast (Most real DBMS are written in C++) • Direct memory access inline static int ReadInt(const BYTE * pData,int ofs) { return *((int *)(pData+ofs)); } • Powerful Marco #define MAKE_RID(PID,OFS) ((((DWORD)(PID))<<20) | (DWORD)(OFS)) • Disadvantage • “Access violation” program– Be careful when programming!
Thank you • Contact me via e-mail • shinyflute@gmail.com • shinyflute@sjtu.org