1 / 15

The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J

The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J. who we are. The Hadoop RDBMS. Standard ANSI SQL Horizontal Scale- Out Real -Time Updates ACID Transactions Powers OLAP and OLTP Seamless BI Integration. Splice Machine Proprietary and Confidential.

abba
Télécharger la présentation

The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J

  2. who we are The Hadoop RDBMS • Standard ANSI SQL • Horizontal Scale-Out • Real-Time Updates • ACID Transactions • Powers OLAP and OLTP • Seamless BI Integration Splice Machine Proprietary and Confidential

  3. serialization and write pipelining • Serialization Goals • Disk Usage Parity with Data Supplied • Predicate evaluation use byte[] comparisons (sorted) • Memory and CPU efficient (fast) • Lazy Serialization and Deserialization • Write Pipelining Goals • Non-blocking Writes • Transactional Awareness • Small Network Footprint • Handle Failure, Location, and Retry Semantics

  4. Single Column Encoding • All Columns encoded in a single cell • separated by 0x00 byte • Nulls are encoded either as “explicit null” or as an absent field • Cell value prefixed by an Index containing • which fields are present in cell • whether the field is • Scalar (1-9 Bytes) • Float (4 Bytes) • Double (8 Bytes) • Other (1 – N Bytes)

  5. Example Insert • Table Schema: (a int, b string) • Insert row (1,’bob’): • All columns packed together • 1 0x00 ‘bob’ • Index prepended • {1(s),2(o)}0x00 1 0x00 ‘bob’

  6. Example Insert w/ nulls • Row (1,null) • nulls left absent • 1 • Index prepended (field B is not present) • {1(s)} 0x00 1

  7. Example: Update • Row already present: {1(s),2(o)} • set a = 2 • Pack entry • 2 • prepend index (field B is not present) • {1(s)}0x00 2

  8. Decoding • Indexes are cached • Most data looks like it’s predecessor • Values are read in reverse timestamp order • Updates before inserts • Seek through bytes for fields of interest • Once a field is populated, ignore all other values for that field.

  9. Example Decoding • Start with (NULL,NULL) • 2 KeyValues present: • {1(s)}0x00 2 • {1(s),2(o)} 0x00 1 0x00 ‘bob’ • Read first KeyValue, fill field 1 • Row: (2,NULL) • Read second KeyValue, skip field 1(already filled), fill field 2: • Row: (2,’bob’)

  10. Index Decoding • Index encoded differently depending on number of columns present and type • Uncompressed: 1 bit for present, 2 bits for type • Compressed: Run-length encoded (field 1-3, scalar, 5-8 double…) • Sparse: Delta encoded (index,type) pairs • Sparse compressed: Run-length encoded (index,type) pairs

  11. Write Pipeline • Asynchronous but guaranteed delivery • Operate in Bulk • Row or Size bounded • Highly Configurable • Utilizes Cached Region Locations • Server component modeled after Java’s NIO • Attach Handlers for different RDBMS features • Handle retries, failure, and SQL semantics • Wrong Region, Region Too Busy, Primary Key Violation, Unique Constraint Violation

  12. Write Pipeline Base Element • Rows are encoded into custom KVPairs • all rows for a family and column are grouped together • <byte[],byte[]> • Exploded into Put only to write to HBase • Timestamps added on server side • Supports snappy compression

  13. Write Pipeline Client • Tree Based Buffer • Table -> Region -> N Buffers • Rows are buffered on client side in memory • N is configurable • When buffer fills • asynchronously write batch to Region • Handles HBase “difficulties” gracefully • Wrong Region • Re-bucket • Too Busy • Add delay and possibly back-off • etc.

  14. Write Pipeline Server Side • Coprocessor based • Limited number of concurrent writes to a server • excess write requests are rejected • prevents IPC thread starvation • SQL Based Handlers for parallel writes • Indexes, Primary Key Constraints, Unique Constraints • Writes occur in a single WALEdit on each region

  15. Interests • Other items we have done or interested in… • Burstable Tries Implementation of Memstore • Pluggable Cost Based Genetic Algorithm for Assignment Manager • Columnar Representations and in-memory processing. • Concurrent Bloom Filter (i.e. Thread Safe BitSet) • We are hiring • Just Completed $15M Series B Raise • careers@splicemachine.com

More Related