Topic for Thursday?

Topic for Thursday?

Miscellaneous Topics in Databases

Parallel DBMS

Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute to scan. 1 Terabyte Bandwidth 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Parallel DBMS: Intro • Parallelism is natural to DBMS processing • Pipeline parallelism: many machines each doing one step in a multi-step process. • Partition parallelism: many machines doing the same thing to different pieces of data. • Both are natural in DBMS! Any Any Sequential Sequential Pipeline Program Program Sequential Any Any Sequential Partition Sequential Sequential Sequential Sequential Program Program outputs split N ways, inputs merge

Some || Terminology Ideal • Speed-Up • More resources means proportionally less time for given amount of data. • Scale-Up • If resources increased in proportion to increase in data size, time is constant. • Why Realistic <> Ideal? Realistic Xact/sec. (throughput) degree of ||-ism Realistic Ideal sec./Xact (response time) degree of ||-ism

Introduction • Parallel machines are becoming quite common and affordable • Prices of microprocessors, memory and disks have dropped sharply • Recent desktop computers feature multiple processors and this trend is projected to accelerate • Databases are growing increasingly large • large volumes of transaction data are collected and stored for later analysis. • multimedia objects like images are increasingly stored in databases • Large-scale parallel database systems increasingly used for: • storing large volumes of data • processing time-consuming decision-support queries • providing high throughput for transaction processing

Google data centers around the world, as of 2008

Parallelism in Databases • Data can be partitioned across multiple disks for parallel I/O. • Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel • data can be partitioned and each processor can work independently on its own partition • Results merged when done • Different queries can be run in parallel with each other. • Concurrency control takes care of conflicts. • Thus, databases naturally lend themselves to parallelism.

Partitioning • Horizontal partitioning(shard) • involves putting different rows into different tables • Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest • Vertical partitioning • involves creating tables with fewer columns and using additional tables to store the remaining columns • partitions columns even when already normalized • called "row splitting" (the row is split by its columns) • Ex: split (slow to find) dynamic data from (fast to find) static data in a table where the dynamic data is not used as often as the static

Comparison of Partitioning Techniques • Evaluate how well partitioning techniques support the following types of data access: 1.Scanning the entire relation. 2.Locating a tuple associatively – point queries. • E.g., r.A = 25. 3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries. • E.g., 10 r.A < 25.

Handling Skew using Histograms • Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion • Assume uniform distribution within each range of the histogram • Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation

Interquery Parallelism • Queries/transactions execute in parallel with one another • concurrent processing • Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second. • Easiest form of parallelism to support

Intraquery Parallelism • Execution of a single query in parallel on multiple processors/disks; important for speeding up long-running queries • Two complementary forms of intraquery parallelism : • Intraoperation Parallelism – parallelize the execution of each individual operation in the query (each CPU runs on a subset of tuples) • Interoperation Parallelism – execute the different operations in a query expression in parallel. (each CPU runs a subset of operations on the data)

Parallel Join • The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output. • Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally. • In a final step, the results from each processor can be collected together to produce the final result.

Query Optimization • Query optimization in parallel databases is more complex than in sequential databases • Cost models are more complicated, since we must take into account partitioning costs and issues such as skew and resource contention • When scheduling execution tree in parallel system, must decide: • How to parallelize each operation • how many processors to use for it • What operations to pipeline • what operations to execute independently in parallel • what operations to execute sequentially • Determining the amount of resources to allocate for each operation is a problem • E.g., allocating more processors than optimal can result in high communication overhead

Deductive Databases

Overview of Deductive Databases • Declarative Language • Language to specify rules • Inference Engine (Deduction Machine) • Can deduce new facts by interpreting the rules • Related to logic programming • Prolog language (Prolog => Programming in logic) • Uses backward chaining to evaluate • Top-down application of the rules • Consists of: • Facts • Similar to relation specification without the necessity of including attribute names • Rules • Similar to relational views (virtual relations that are not stored)

Prolog/Datalog Notation • Facts are provided as predicates • Predicatehas • a name • a fixed number of arguments • Convention: • Constants are numeric or character strings • Variables start with upper case letters • E.g., SUPERVISE(Supervisor, Supervisee) • States that Supervisor SUPERVISE(s) Supervisee

Prolog/Datalog Notation • Rule • Is of the form head :- body • where :- is read as if and only iff • E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) • E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

Prolog/Datalog Notation • Query • Involves a predicate symbol followed by some variable arguments to answer the question • where :- is read as if and only iff • E.g., SUPERIOR(james,Y)? • E.g., SUBORDINATE(james,X)?

Supervisory tree Prolog notation

Proving a new fact

Data Mining

Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%

Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

Why Use Data Mining Today? Human analysis skills are inadequate: • Volume and dimensionality of the data • High data growth rate Availability of: • Data • Storage • Computational power • Off-the-shelf software • Expertise

The Knowledge Discovery Process Steps: • Identify business problem • Data mining • Action • Evaluation and measurement • Deployment and integration into businesses processes

Preprocessing and Mining Knowledge Patterns PreprocessedData TargetData Interpretation ModelConstruction Original Data Preprocessing DataIntegrationand Selection

Data Mining Techniques • Supervised learning • Classification and regression • Unsupervised learning • Clustering • Dependency modeling • Associations, summarization, causality • Outlier and deviation detection • Trend analysis and change detection

Example Application: Sky Survey • Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete • Goal: Generate a catalog with all objects and their type • Method: Use decision trees as data mining model • Results: • 94% accuracy in predicting sky object classes • Increased number of faint objects classified by 300% • Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time

Classification Example • Example training database • Two predictor attributes:Age and Car-type (Sport, Minivan and Truck) • Age is ordered, Car-type iscategorical attribute • Class label indicateswhether person boughtproduct • Dependent attribute is categorical

Goals and Requirements • Goals: • To produce an accurate classifier/regression function • To understand the structure of the problem • Requirements on the model: • High accuracy • Understandable by humans, interpretable • Fast construction for very large training databases

Minivan YES Sports,Truck YES NO 0 30 60 Age What are Decision Trees? Age <30 >=30 Car Type YES Minivan Sports, Truck NO YES

Density-Based Clustering • A cluster is defined as a connected dense component. • Density is defined in terms of number of neighbors of a point. • We can find clusters of arbitrary shape

Market Basket Analysis • Consider shopping cart filled with several items • Market basket analysis tries to answer the following questions: • Who makes purchases? • What do customers buy together? • In what order do customers purchase items?

Market Basket Analysis (Contd.) • Coocurrences • 80% of all customers purchase items X, Y and Z together. • Association rules • 60% of all customers who purchase X and Y also buy Z. • Sequential patterns • 60% of customers who first buy X also purchase Y within three weeks.

Spatial Data

What is a Spatial Database? • Database that: • Stores spatial objects • Manipulates spatial objects just like other objects in the database

What is Spatial data? • Data which describes either location or shapee.g.House or Fire Hydrant location Roads, Rivers, Pipelines, Power lines Forests, Parks, Municipalities, Lakes • In the abstract, reductionist view of the computer, these entities are represented as Points, Lines, and Polygons.

Roads are represented as Lines Mail Boxes are represented as Points

Topic Three Land Use Classifications arerepresented as Polygons

Topic Three Combination of all the previous data

Spatial Relationships • Not just interested in location, also interested in “Relationships” between objects that are very hard to model outside the spatial domain. • The most common relationships are • Proximity : distance • Adjacency : “touching” and “connectivity” • Containment : inside/overlapping

Spatial Relationships Distance between a toxic waste dump and a piece of property you were considering buying.

Spatial Relationships Distance to various pubs

Spatial Relationships Adjacency: All the lots which share an edge

Connectivity: Tributary relationships in river networks

Topic for Thursday?