Critical Design Review: Distributed Database Architecture and SQL for Astro Data Management

PS1 PSPSObject Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA

Detail Design • General Concepts • Distributed Database architecture • Ingest Workflow • Prototype

Zones Declination (Dec) Right Ascension (RA) Zones (spatial partitioning and indexing algorithm) • Partition and bin the data into declination zones • ZoneID = floor ((dec + 90.0) / zoneHeight) • Few tricks required to handle spherical geometry • Place the data close on disk • Cluster Index on ZoneID and RA • Fully implemented in SQL • Efficient • Nearby searches • Cross-Match (especially) • Fundamental role in addressing the critical requirements • Data volume management • Association Speed • Spatial capabilities

Zoned Table ZoneID = floor ((dec + 90.0) / zoneHeight) * ZoneHeight = 8 arcsec in this example

SQL CrossNeighbors SELECT * FROM prObj1 z1 JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID JOIN prObj2 z2 ON ZZ.ZoneID2 = z2.zoneID WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))

Good CPU Usage

Partitions • SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server. • Partitioning makes management and access of large tables and indexes more efficient • Enables parallel I/O • Reduces the amount of data that needs to be accessed • Related tables can be aligned and collocated in the same place speeding up JOINS

Partitions • 2 key elements • Partitioning function • Specifies how the table or index is partitioned • Partitioning schemas • Using a partitioning function, the schema specifies the placement of the partitions on file groups • Data can be managed very efficiently using Partition Switching • Add a table as a partition to an existing table • Switch a partition from one partitioned table to another • Reassign a partition to form a single table • Main requirement • The table must be constrained on the partitioning column

Partitions • For the PS1 design, • Partitions mean File Group Partitions • Tables are partitioned into ranges of ObjectID, which correspond to declination ranges. • ObjectID boundaries are selected so that each partition has a similar number of objects.

Distributed Partitioned Views • Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers

Concept: Slices • In the PS1 design, the bigger tables will be partitioned across servers • To avoid confusion with the File Group Partitioning, we call them “Slices” • Data is glued together using Distributed Partitioned Views • The ODM will manage slices. Using slices improves system scalability. • For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges. • ObjectID boundaries are selected so that each slice has a similar number of objects.

Detail Design Outline • General Concepts • Distributed Database architecture • Ingest Workflow • Prototype

PS1 Distributed DB system objZoneIndx orphans_l1 Detections_l1 LnkToObj_l1 detections detections objZoneIndx Orphans_ln Detections_ln LnkToObj_ln Linked servers Load Support1 Load Supportn LoadAdmin PartitionsMap Linked servers P1 Pm [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] Meta PS1 PartitionsMap Objects LnkToObj Meta Detections PS1 database Query Manager (QM) Legend Database Full table [partitioned table] Output table Partitioned View Web Based Interface (WBI)

Design Decisions: ObjID • Objects have their positional information encoded in their objID • fGetPanObjID (ra, dec, zoneH) • ZoneID is the most significant part of the ID • It gives scalability, performance, and spatial functionality • Object tables are range partitioned according to their object ID

ObjectID Clusters Data Spatially Dec = –16.71611583 ZH = 0.008333 ZID = (Dec+90) / ZH = 08794.0661 ObjectID = 087941012871550661 RA = 101.287155 ObjectID is unique when objects are separated by >0.0043 arcsec

Design Decisions: DetectID • Detections have their positional information encoded in the detection identifier • fGetDetectID (dec, observationID, runningID, zoneH) • Primary key (objID, detectionID), to align detections with objects within partitions • Provides efficient access to all detections associated to one object • Provides efficient access to all detections of nearby objects

DetectionID Clusters Data in Zones Dec = –16.71611583 ZH = 0.008333 ZID = (Dec+90) / ZH = 08794.0661 DetectID = 0879410500001234567 ObservationID = 1050000 Running ID = 1234567

ODM Capacity 5.3.1.3 The PS1 ODM shall be able to ingest into the ODM a total of • 1.51011 P2 detections • 8.31010 cumulative sky (stack) detections • 5.5109 celestial objects together with their linkages.

PS1 Table Sizes - Monolithic Sizes are in TB

What goes into the main Server Linked servers P1 Pm PS1 PartitionsMap Objects LnkToObj Meta PS1 database Objects LnkToObj Meta PartitionsMap Legend Database Full table [partitioned table] Output table Distributed Partitioned View

What goes into slices Linked servers P1 Pm [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta PS1 PartitionsMap Objects LnkToObj Meta PS1 database [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta PartitionsMap Legend Database Full table [partitioned table] Output table Distributed Partitioned View

Duplication of Objects & LnkToObj • Objects are distributed across slices • Objects, P2ToObj, and StackToObj are duplicated in the slices to parallelize “inserts” & “updates” • Detections belong into their object’s slice • Orphans belong to the slice where their position would allocate them • Orphans near slices’ boundaries will need special treatment • Objects keep their original object identifier • Even though positional refinement might change their zoneID and therefore the most significant part of their identifier

Glue = Distributed Views Linked servers P1 Pm [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta PS1 PartitionsMap Objects LnkToObj Meta Detections PS1 database Detections Legend Database Full table [partitioned table] Output table Distributed Partitioned View

Partitioning in Main Server • Main server is partitioned (objects) and collocated (lnkToObj) by objid • Slices are partitioned (objects) and collocated (lnkToObj) by objid Linked servers P1 Pm PS1 PS1 database Query Manager (QM) Web Based Interface (WBI)

PS1 Table Sizes - Main Server Sizes are in TB

PS1 Table Sizes - Each Slice Sizes are in TB

PS1 Table Sizes - All Servers Sizes are in TB

PS1 Distributed DB system objZoneIndx orphans_l1 Detections_l1 LnkToObj_l1 detections detections objZoneIndx Orphans_ln Detections_ln LnkToObj_ln Linked servers Load Support1 Load Supportn LoadAdmin PartitionsMap Linked servers P1 Pm [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta PS1 PartitionsMap Objects LnkToObj Meta Detections PS1 database Query Manager (QM) Legend Database Full table [partitioned table] Output table Partitioned View Web Based Interface (WBI)

“Insert” & “Update” • SQLInsert and Update are expensive operations due to logging and re-indexing • In the PS1 design, Insert and Update have been re-factored into sequences of: Merge + Constrain + Switch Partition • Frequency • f1: daily • f2: at least monthly • f3: TBD (likely to be every 6 months)

Ingest Workflow X(1”) DZone DXO_1a X(2”) NoMatch Resolve Detect DXO_2a P2PsfFits Orphans P2ToObj ObjectsZ CSV

Ingest @ frequency = f1 Orphans_1 P2ToPsfFits_1 P2ToObj_1 StackToObj P2ToObj Objects 1 11 2 12 13 3 Orphans_1 Stack*_1 P2ToPsfFits_1 Objects_1 P2ToObj_1 ObjectsZ P2ToObj P2PsfFits Metadata+ Orphans SLICE_1 LOADER MAIN

Updates @ frequency = f2 StackToObj P2ToObj Objects 1 11 2 12 3 13 Objects P2ToObj_1 Stack*_1 P2ToPsfFits_1 Orphans_1 Objects_1 SLICE_1 LOADER MAIN Metadata+

Updates @ frequency = f2 StackToObj P2ToObj Objects 1 11 2 12 13 3 Objects Objects_1 P2ToPsfFits_1 Stack*_1 Orphans_1 P2ToObj_1 Objects_1 Objects Metadata+ SLICE_1 LOADER MAIN

Snapshots @ frequency = f3 Objects P2ToObj StackToObj Objects 1 2 3 Snapshot Metadata+ MAIN

Batch Update of a Partition select into 1 1 2 1 2 3 A1 A2 A3 … merged select into … where select into … where select into … where B1 + PK index B2 + PK index B3 + PK index switch switch switch B1

Scaling-out • Apply Ping-Pong strategy to satisfy query performance during ingest 2 x ( 1 main + m slices) [Objects_p1] [LnkToObj_p1] [Detections_p1] [Objects_p2] [LnkToObj_p2] [Detections_p2] Meta Linked servers P1 P2 Pm P1 [Objects_pm] [LnkToObj_pm] [Detections_pm] [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta P2 P3 Pm-1 Pm PS1 PS1 Detections Detections PartitionsMap Objects LnkToObj Meta PartitionsMap Objects LnkToObj Meta PS1 database Query Manager (QM) Legend Database Duplicate Full table [partitioned table] Partitioned View Duplicate P view

Scaling-out • More robustness, fault-tolerance, and reabilability calls for 3 x ( 1 main + m slices) [Objects_p1] [LnkToObj_p1] [Detections_p1] [Objects_p2] [LnkToObj_p2] [Detections_p2] Meta Linked servers P1 P2 Pm P1 [Objects_pm] [LnkToObj_pm] [Detections_pm] [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta P2 P3 Pm-1 Pm PS1 PS1 Detections Detections PartitionsMap Objects LnkToObj Meta PartitionsMap Objects LnkToObj Meta PS1 database Query Manager (QM) Legend Database Duplicate Full table [partitioned table] Partitioned View Duplicate P view

Adding New slices SQL Server range partitioning capabilities make it easy • Recalculate partitioning limits • Transfer data to new slices • Remove data from slices • Define an d Apply new partitioning schema • Add new partitions to main server • Apply new partitioning schema to main server

Adding New Slices

ODM Ingest Performance 5.3.1.6 The PS1 ODM shall be able to ingest the data from the IPP at two times the nominal daily arrival rate* * The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365. • Nominal daily data rate: • 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day • 8.31010 / 3.5 / 365 = 6.5107 stack detections / day

Number of Objects * “SDSS” includes a mirror of 11.3 <  < 30 objects to  < 0 Total GB of csv loaded data: 300 GB CSV Bulk insert load: 8 MB/s Binary Bulk insert: 18-20 MB/s CreationStarted: October 15th 2007 Finished: October 29th 2007 (??) Includes • 10 epochs of P2PsfFits detections • 1 epoch of Stack detections

Prototype in Context

Size of Prototype Database Table sizes are in billions of rows

Size of Prototype Database Table sizes are in GB 9.6 TB of data in a distributed database

Well-Balanced Partitions

Ingest and Association Times

Ingest and Association Times Educated Guess Wild Guess

Critical Design Review: Distributed Database Architecture and SQL for Astro Data Management

Critical Design Review: Distributed Database Architecture and SQL for Astro Data Management

Presentation Transcript

Object Oriented Design

Object Oriented Design

Object Design

Data Manager

Object Oriented Design

Object Design Design Patterns

PS1-ICanPresentation

Design Manager Server Instrumentation Instrumentation Data Documentation

TSA and PS1

PS1 Prototype Systems Design Jan Vandenberg, JHU

Object Design Review

Object-oriented Design

Object-oriented Design

Data Object

PS1 Online Store