V Storage Manager

V Storage Manager Shahram Ghandeharizadeh Computer Science Department University of Southern California

Two New Methods • void PrintStats() • Prints the cache hit rate and byte hit rate observed by your implementation of the V storage manager. • void DumpCache() • For each datazone name, show its object-id and size occupying the cache. Conclude with a summary table showing: • Data zone name, # of objects, average object size, min object size, max object size

Minor Modifications • Create method should check to see if its input dzname already exists. If so, it should not create it a second time.

Trace Driven Evaluation • A trace consisting of the following information: Trs,02/26/2009 23:07:48.4856586,Get,24,113392298,2797,0 Trs,02/26/2009 23:07:48.4856586,Get,17,120188330,64,0 Trs,02/26/2009 23:07:48.6262836,Save,24,89768490,1404,0 Trs,02/26/2009 23:07:48.7200336,Get,11,12671850,11155,0 Trs,02/26/2009 23:07:48.7200336,Get,5,449482986,229,0 Trs,02/26/2009 23:07:49.2200336,Delete,11,49174954,0,0 Trs,02/26/2009 23:07:49.2200336,Get,2,14161514,117,0 Trs,02/26/2009 23:07:49.2200336,Delete,11,444312042,0,0 • Time stamp, • Possible commands are “Get”, “Delete” and “Save”, • Followed by data-zone name, key, and size of the value.

Trace Driven Evaluation • Gathered from a powerful production server at myspace. • Idea: Submit the requests as fast as possible. We probably cannot submit requests at the rate that this server is processing them. • Objective: A multi-threaded workload generator that issues the requests as fast as possible.

Workload Generator • Consists of a main thread and N worker threads. • The main thread is in charge of reading a trace file and populating main memory data structures for the worker threads. • The worker thread reads its corresponding main memory data structure element and invokes the corresponding Get, Insert, and Delete methods of V.

Main Thread • DZ-array: an array of 25 data-zones, each represented as a Vdt element. The data field is 1 character long and its size is one. The data field corresponds to the values 0 to 24. • Invokes the Create method of V storage manager for each element of the array. If the data-zone already exists, it returns with an error message. • Populates a linked list of P EachSecond elements, each consisting of m trace elements: struct EachSecond { char Assigned[MAXTHREADS]; char Complete[MAXTHREADS]; int NumTraceElts[MAXTHREADS]; int *key[MAXTHREADS]; int *zoneid[MAXTHREADS]; int *size[MAXTHREADS]; cmndType *cmnd[MAXTHREADS]; struct EachSecond *next; } OneSecond; • Starts up N worker threads, each with a unique-id 0 to N-1.

A worker thread with id j corresponds to element j of the array in EachSecond. Iterates from element 0 to NumTraceElts[j] of cmnd, zoneid, key, size. cmndType is {Get, Save, Delete}, invoking your implementation of Get, Insert, and Delete, respectively. Zoneid is an index into the DZ-Array of the main thread. Key is a 4 byte char: Vdt data is a char, its length is 4 bytes. Worker Thread struct EachSecond { char Assigned[MAXTHREADS]; char Complete[MAXTHREADS]; int NumTraceElts[MAXTHREADS]; int *key[MAXTHREADS]; int *zoneid[MAXTHREADS]; int *size[MAXTHREADS]; cmndType *cmnd[MAXTHREADS]; struct EachSecond *next; } OneSecond;

Synchronization • The main thread and worker threads are synchronized using handles (HANDLE hEvent). • Main thread creates an event and assigns it to hEvent. • After populating the P elements of EachSecond linked list and activating N threads to process the elements, the main thread does: For (int k = 0; k < N; k++) WaitForSingleObject(hEvent, INFINITE) • When a worker thread is done with one EachSecond element, it sets the Complete[thread-id] to 1 and does a SetEvent(hEvent). • Once the main-thread falls out of the for loop, it checks to see if the first N elements of Complete array in the current EachSecond are set. If so, this means all the N threads are done with this EachSecond element. • Main thread populates this EachSecond element with additional trace elements and links it to the end of the P list.

Termination Condition • Once the main thread hits the end of the trace file, it will wait for the N threads to complete. • Once a worker thread process all elements of the P linked list (encounters a “Null” for the next element), it terminates by returning.

An Overview of Query Optimization in Relational Systems (by S. Chaudhuri) Shahram Ghandeharizadeh Computer Science Department University of Southern California

Terminology • A SQL relational database management system consists of: • Query optimizer • Query execution engine: implements a set of physical operators. • An operator consumes one or more data streams and produces an output data stream, e.g., sort, sequential scan, index scan, nested loop join, etc. • Physical operator tree or execution plan ties different operators with one another.

Terminology (Cont…) • Physical property is any characteristics of a plan that is not shared by all plans for the same logical expression, but can impact the cost of subsequent operations. Concept of interesting order in System R.

Query Execution Engine • RSS of system R is a query execution engine. • Is Berkeley-DB a query execution engine?

Query Optimizer • Generates the most efficient execution plan for the execution engine. • Non-trivial because there can be a large number of possible trees. • The optimization criteria might be different: throughput versus response time. • Optimization as a search problem, providing: • A space of plans, the search space. • Cost estimation technique • Enumeration algorithm to search through the search space. • Ideal optimizer: • The search space includes low cost plans, • Cost estimation techniques are accurate, • Enumeration algorithm is efficient.

Basic Estimation Framework • Collect statistical summaries of data that has been stored. • Given an operator and the statistical summary for each of its input data streams, determine the: • Statistical summary of the output data stream, • Estimated cost of executing the operation.

EQUALITY JOIN (Cont…) • Example: Assume the following statistics on the Employee and Department relations: t(Dept)=1000 tuples, P(Dept)=100 disk pages, ν(Dept,dno)=1000, ν(Dept,dname)=500. t(Employee)=100,000 tuples and P(Employee)=10,000 pages. Note that 10 tuples of each relation fit on a disk page. Assume that a concatenation of one Employee and one Dept record is wide enough to enable a disk page to hold five such records. Lets compare the cost of two alternative algebraic expressions for processing a query that retrieves those employees that work for the toy department: • б dname=Toy(Emp Dept) • Emp б dname=Toy(Dept)

EQUALITY JOIN (Cont…) • б dname=Toy(Emp Dept) t(Tmp1) = t(Emp) × t(Dept) / ν(Dept, dno) = 100,000 ×1000 / 1000 = 100,000 P(Tmp1) = 100,000 / 5 = 20,000 C( ) = P(Dept) + P(Emp) ×P(Dept) = 100 +10,000 ×100 = 1,000,100 (page nested loop) Cw(Tmp1) = 20,000 Cw(б) = 20,000 t(Tmp2) = t(Tmp1) / ν(Dept, dname) = 100,000 / 500 = 200 P(Tmp2) = 200 / 5 = 40 Cw(Tmp2) = 40 Cost = C( ) + Cw(Tmp1) + C(б) + Cw(Tmp2) = 1,000,100 + 20,000 + 20,000 + 40 = 1,040,140 I/O Tmp2 б dname=Toy Tmp1 Emp Dept

EQUALITY JOIN (Cont…) • Emp б dname=Toy(Dept) Cw(б) = 100 t(Tmp1) = t(Dept) / ν(Dept, dno) = 1000 / 500 = 2 P(Tmp1) = 1 Cw(Tmp1) = 1 C( ) = P(Tmp1) + P(Tmp1) ×P(Emp) = 1 +1×10,000 = 10,001 (page nested loop) t(Tmp2) = t(Emp) × t(Tmp1) / ν(Emp,dno) = 100,000 ×2 / 1000 = 200 P(Tmp2) = 200 / 5 = 40 Cw(Tmp2) = 40 Cost = C(б) + Cw(Tmp1) + C( ) + Cw(Tmp2) = 100 + 1 + 10,001 + 40 = 10,142 I/O Tmp2 Emp Tmp1 б dname=Toy Dept

Statistical Summaries of Data • Equi-height histograms

Review of System R • Use of dynamic programming and interesting order as a heuristic. • Dynamic programming when optimizing a query with multiple join predicates • Assumption: To obtain an optimal plan for a query Q consisting of k joins, it suffices to consider only the optimal plans for subexpressions of Q that consist of (k-1) joins and extend those plans with an additional join. • Prune suboptimal plans for subexpressions of Q consisting of (k-1) joins. • For example, the optimal plan for {R1, R2, R3, R4} is obtained by picking the plan with the cheapest cost from among the optimal plans for: Join({R1, R2, R3}, R4) Join({R1, R2, R4}, R3) Join({R1, R3, R4}, R2) Join({R2, R3, R4}, R1) Instead of analyzing O(n!) plans, it considers O(n 2n-1) plans where n is the number of relations.

Left Outer Join • R Left Outer Join S

Merging Views • If one or more relations in a query are views, and each is defined using a conjunctive predicate, then the view definition can simply be “unfolded” to obtain a single block SQL query. • May not work when the views are more complex using aggregates or eliminate duplicates using “Select distinct”.

Nested Subqueries • “Flatten” nested queries whenever possible ?

Nested Subqueries • “Flatten” nested queries whenever possible

“Flattening” Queries with Aggregates • Aggregates are more tricky. Consider:

“Flattening” Queries with Aggregates • What is wrong with this? SELECT Dept.name FROM Dept, Emp WHERE Dept.name = Emp.dept_name GROUP BY Dept.name HAVING Dept.num-of-machines < Count(Emp.*)

“Flattening” Queries with Aggregates • Correct flattening will use outerjoin:

V Storage Manager