270 likes | 374 Vues
MOCHA is an innovative middleware solution designed to address challenges in integrating distributed and heterogeneous data sources. It promotes automatic code deployment and leverages data processing at the source site, utilizing powerful machines for on-site data distillation. The architecture ensures efficient query processing by optimizing operator placement and minimizing data movement across networks. With features like user-defined types, XML metadata exchange, and filter-based data processing, MOCHA enhances the scalability and performance of client-server connectivity in complex data environments.
E N D
MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources Manuel Rodriguez-Martinez Nick Roussopoulos
Client Client Motivation Data Sources are distributed and heterogeneous: Fact of Life ... Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Not a Good Idea Client Client Client-Server Connectivity 2-tier architecture means FAT Clients Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator Middleware Integration Service Middleware is a 3-tier connectivity solution – Thin Clients Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Problem 1: Code Deployment • User-defined types and functions • Polygon • Composite() – image aggregation • Porting and manualinstallation of code • Operating system • Hardware platform • Expensive Software Maintenance • Updates • Version management • Security • Software certification M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator Problem 1: Code Deployment Not Scalable – Expensive System Growth Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
Problem 2: Query Processing • Operator placement options • Limited by site-dependent software • Composite() – got to have it before using it! • Most processing at Integration Server • Powerful Data Servers are under-utilized • I/O Nodes • Excessive data movement over the network • Network bottleneck • Unfeasible in WANs, Internet M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator 100MB 100MB 100MB Problem 2: Query Processing Not Scalable – Inefficientevaluation of queries Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos
DAP DAP Client MOCHA Solution: Ship Code! Code Repository Catalog Informix Oracle QPC Maryland Texas Virginia Internet Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia M. Rodriguez-Martinez – N. Roussopoulos
100MB 200MB tuples tuples DAP DAP Client 350KB 200KB 150KB 150KB 150KB 200KB 350KB 200KB results results results results results results results results MOCHA Solution: Filter Data! Code Repository Catalog Informix Oracle QPC Maryland Texas Virginia Internet Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Virginia M. Rodriguez-Martinez – N. Roussopoulos
MOCHA Goals • Automatic Deployment of Code (self-extensible) • QPC ships compiled Java classes • User-defined types and functions • XML for their metadata (easy exchange) • Data processing at data source sites • Utilize powerful machines • On-site data distillation • Processing based on data movement reduction • “Filter” data at the data sources • “Expand” data near the clients M. Rodriguez-Martinez – N. Roussopoulos
Coordination Thread Execution Thread Client Client Execution Thread The MOCHA Architecture QPC Code Repository Catalog DAP DAP • Multi-threaded • Distributed Objects Informix Oracle M. Rodriguez-Martinez – N. Roussopoulos
QPC: The Integration Server QPC Controls and Coordinates Query Execution Client API Query Parser Code Repository XML Catalog Query Optimizer Catalog Manager Execution Engine SQL & XML Proc. Interface Code Loader DAP Access API DAP M. Rodriguez-Martinez – N. Roussopoulos
100MB 100MB tuples tuples 150KB results 100MB 100MB tuples tuples DAP: The Facilitator of Data DAP Provides QPC with Remote Access to the Data DAP Access API Control Module Execution Engine SQL & XML Proc. Interface Code Loader Data Source Access Layer Data Source JDBC I/O API DOM JNI M. Rodriguez-Martinez – N. Roussopoulos
Road Map • Introduction • Problem Definition • MOCHA Architecture • Query Processing • Experiments • Summary M. Rodriguez-Martinez – N. Roussopoulos
Processing The Queries • Issue 1: Placement and deployment of operators • Which operators go to QPC, and which go to the DAPs? • Issue 2: How to determine this placement? • Dynamic programming [SAC+79], [ML86] • But search space is enormous • Placement of UDF, joins, execution sites … • Plenty of “bad” plans • In MOCHA: Query Optimization based on heuristics • Network usually is the critical factor optimize for it first • CPU and I/O are cheaper optimize for them later • Quickly converge to a “good” plan M. Rodriguez-Martinez – N. Roussopoulos
Operator Placement • Data-Reducing Operators • “Filter” the data • Aggregates, predicates, projections, semi-joins • Composite(), Overlaps() , AvgEnergy() • Push to the DAPs • Code Shipping policy (Unique to MOCHA) • Only send back distilled results • Less data movement • Cost: • Computation cost • Transfer of filtered results M. Rodriguez-Martinez – N. Roussopoulos
Operator Placement • Data-Inflating Operators • “Expand” the data • projections, image processing, some joins … • DoubleResolution(), RotateSolid() • Pull to the QPC • Data Shipping policy [FJK96] • Only send back raw arguments • Less data movement • Cost: • Computation cost • Transfer of raw argument values M. Rodriguez-Martinez – N. Roussopoulos
is Data-Inflating VRF 1 is Data-ReducingVRF < 1 Composite() DoubleRes() Placement Metric: VRF Volume Reduction Factor: Given operator and relation R, then • VDT - volume of data transmitted after applying to R • VDA - volume of data originally present in R M. Rodriguez-Martinez – N. Roussopoulos
Goal: Plans with small CVRF Cumulative Volume Reduction Factor: Given a plan P to solve query Q over relations R1, …, Rn • CVDT - volume of data transmitted by applying • all operators in P to R1, …, Rn • CVDA- volume of data originally present in R1, …, Rn Search Space Optimizer searches for plans that move minimal amount of data. CVRF(Plan) [0,1] M. Rodriguez-Martinez – N. Roussopoulos
Performance Evaluation • Goals of this study: • Measure how good code shipping can be • Validate heuristics being proposed • VRF • CVRF • Guide implementation of the optimizer • Configured MOCHA with plans that place operators based on heuristics. M. Rodriguez-Martinez – N. Roussopoulos
Experimental Environment • Sequoia 2000 Benchmark • scientific data - points, polygons, satellite images • Distributed applications • Software and Hardware: • JDK 1.2 • QPC - Sun Ultra 60, Solaris 2.6 • DAPs - Sun Ultra 1, Sun Ultra5, Solaris 2.6 • Data Sources • 2 Informix IUS 9.12 Server • 10 Mpbs Ethernet M. Rodriguez-Martinez – N. Roussopoulos
DAP QPC QPC QPC DAP DAP Q1 Q2 Q3 Query Class Reducing vs. Inflating • Query classes • Composite of all images • Clipping and sub-setting • Double resolution of images • Performance gains • composites • 99% data reduction • 4-1 better performance • clipping and expansion • 80% data reduction • 3-1 better performance • Validates heuristics Runnning Time (secs) M. Rodriguez-Martinez – N. Roussopoulos
Runnning Time (secs) .50 .75 1 0 .25 Selectivity QPC DAP DAP DAP DAP DAP QPC QPC QPC QPC VRF vs Selectivity • Select graphs identifiers based on number of vertices and arc length • Selectivity [HS93] and cardinality [HKWY97] are not enough for distributed predicate placement • Need to also consider size of arguments for predicates! • Consider 50% selectivity • DAP CVRF = 0.01 • QPC CVRF = 1 • VRF is a better metric M. Rodriguez-Martinez – N. Roussopoulos
Implementation Status • Operational System • SIGMOD 2000 Demo • Experimental deployment of MOCHA • NASA Earth Scientists (ESIP Federation) • Goddard Space Flight Center • NCSA • Land Cover Visualization Tool M. Rodriguez-Martinez – N. Roussopoulos
Summary and Conclusions • Proposed a new Middleware Architecture: MOCHA • Automatic Code Deployment (self-extensible) • Shipping Java classes • Query processing based on data movement reduction • Proposed VRF metric for placement of functions • Better than selectivity and result cardinality • Future work • Deployment of MOCHA for NASA ESIP Federation • Full implementation of MOCHA Optimizer • More Info: • http://mocha.umiacs.umd.edu/ M. Rodriguez-Martinez – N. Roussopoulos
Integration Server Catalog Client Client Translator Translator Translator Translator 200MB 200MB 100MB 100MB 100MB 200MB Problem 2: Query Processing Not Scalable – Inefficientevaluation of queries Internet Oracle 8i Informix XML Data Text Data M. Rodriguez-Martinez – N. Roussopoulos