Why we chose MongoDB To Put big-data ‘on the map’

Why we chose MongoDB To Put big-data ‘on the map’
March 2011 @nknize +Nicholas Knize

“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one location…this capability allows for unprecedented situational awareness and information sharing” -Gen. Doug Frasier TST Products ACCOMPLISHING THE IMPOSSIBLE

Expose enterprise data in a geo-temporal user defined environment Provide a flexible and scalable spatial indexing framework for heterogeneous data Visualize spatially referenced data on 3D globe & 2D maps Manage real-time data feeds and mobile messaging View data over geo-rectified imagery with 3D terrain Support mission planning and simulation Provide real-time collaboration and sharing Ispatial overview ACCOMPLISHING THE IMPOSSIBLE

Why NoSQL?!?(CAVEATS) Engineering with Constraints Use the right tool for the job Understand your needs! Unbounded Engineering Relational is not always bad Why nosql? ACCOMPLISHING THE IMPOSSIBLE

Desired Data Store Characteristic for ‘Big Data’ Horizontally scalable– Large volume / elastic Vertically scalable – Heterogeneous data types (“Data Stack”) Widely Distributed – Reduce the distance bits must travel Fault Tolerant – Replication Strategy and Consistency model High Availability – Node recovery Fast – Reads or writes (can’t always have both) Big data storage characteristics ACCOMPLISHING THE IMPOSSIBLE

RDBMS Strengths Battle tested, Battle proven – Relational Model dates back to 1969 Plethora of Relational Experience – Full-Time DBAs, Training & Certs Company Backed – Safe choice for business / mission critical systems Fewer Alternatives – Non-relational is a 5 year old know-it-all Mostly Standardized – SQL ISO/IEC 9075 Accepted Standard Theoretically Sound – Based on 100 years of First-Order Logic theory Rdbms strengths ACCOMPLISHING THE IMPOSSIBLE

Relational on ACID Atomicity – If one fails, we all fail! Consistency – All data constraints (normalized schema) cascades, triggers, etc. must be met before transaction succeeds. (LATENCY) Isolation – Synchronization, no operation can see a transaction that hasn’t yet completed Durability – Once a transaction is committed it will remain committed even in power loss crashes or other hardware errors. Acid theory ACCOMPLISHING THE IMPOSSIBLE

RDBMS Weaknesses Writes are accomplished using in-place update on disk (crazy disk swapping rate) Table joins, updates, and large queries quickly outgrow disk cache requiring many random disk seeks (performance bottleneck!!) Strict consistency requirements impacts scalability (e.g. Postgres uses Multiversion Consistency, commonly resulting in stale data) As data centers grow, the probability of node failure (due to Disk Writes, Consistency, and Atomic operations) increases Rdbms weaknesses ACCOMPLISHING THE IMPOSSIBLE

Subset of Evaluated NoSQL Options Cassandra Nice Bring Your Own Index (BYOI) design … but Java, Java, Java… Memory management can be an issue Adding new nodes can be a pain (Token Changes, nodetool) Key-Value store…good for simple data models Hbase Nice BigTable model Theory grounded heavily in C.A.P, inflexible trade-offs Complicated setup and maintenance CouchDB Provides some GeoSpatial functionality HEAVILY dependent on Map-Reduce model (complicated design) Erlang based – poor multi-threaded heap management NOSQL Options ACCOMPLISHING THE IMPOSSIBLE

Why MongoDB for Thermopylae? Documents based on Javascript Object Notation (JSON) – A GEOJSON match made in heaven! C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging Disk storage is memory mapped, enabling fast swapping when necessary Built in auto-failover with replica sets and fast recovery with journaling Tunable Consistency – Consistency defined at application layer Schema Flexible – Retains friendly properties of SQL while enabling ad-hoc queries Provided initial spatial indexing support – Point based only! Why tst likes mongodb ACCOMPLISHING THE IMPOSSIBLE

... The Spatial Indexer wasn’t quite right MongoDB (like nearly all relational DBs) uses a b-Tree Data structure for storing sorted data in log time Great for indexing numerical and text documents (attribute data) Cannot store multi-dimension data – NOT GEOMETRY FRIENDLY Mongodb spatial indexer ACCOMPLISHING THE IMPOSSIBLE

How does MongoDB solve the dimensionality problem? Space Filling Curve A continuous line that intersects every point in a two-dimensional plane Use Geohash to represent lat/lon values Interleave the bits of a lat/long pair Base32 encode the result Dimensionality reduction ACCOMPLISHING THE IMPOSSIBLE

Issues with the Geohash over b-Tree approach Neighbors aren’t so close! Neighboring points on the Geoid may end up on opposite ends of the plane Impacts search efficiency What about Geometry? Doesn’t support > 2D Mongo uses Multi-Location documents which really just indexes multiple points that link back to a single document Geohashbtree issues ACCOMPLISHING THE IMPOSSIBLE

Potential Solutions Constrain the system to single point searches Multi-dimension support will be exponentially complex (won’t scale) Interpolate points along the edge of the shape Multi-dimension support will be exponentially complex (won’t scale) Customize the spatial indexer Selected approach solutions to geohash problem ACCOMPLISHING THE IMPOSSIBLE

Mongo Multi-location Document Clipping Issues($within search doesn’t always work w/ multi-location) Case 1: Case 3: Fail! Success! Case 2: Success! Case 4: Fail! Multi-Location Document (aka. Polygon) Search Polygon

Thermopylae Custom Tuned MongoDB for Geo TST Leverage’s Guttman’s1984 Research in R/R* Trees R-Trees organize any-dimensional data by representing the data as a minimum bounding box. Each node bounds it’s children. A node can have many objects in it (max: m min: ceil(m/2) ) Inserts and merges optimized by minimizing overlaps The leaves point to the actual objects (stored on disk probably) Height balanced – search isalways O(log n) Custom tuned spatial indexer ACCOMPLISHING THE IMPOSSIBLE

Spatial Indexing at Scale with R-Trees Spatial data represented as minimum bounding rectangles (n-dimension) Index represented as: <I, tuple> where: I = (I0, I1, … In) : n = number of dimensions Each I is a set in the form of [min,max] describing MBR range along a dimension tuple-identifier includes a key that contains a data-center, server identifier Rtree theory ACCOMPLISHING THE IMPOSSIBLE

m n o p a c d e k l b f g h i j Spatial Index Example Sample insertion result for 4th degree tree Objective: Minimize overlaps

T-Sciences Custom Tuned Spatial Indexer Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search 28% reduction in number of nodes touched Optimize Deletes – Leverages R* split approach for rebalancing tree when nodes become underfull Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning

Example Use Case – OSINT (Foursquare Data) Sample Foursquare data set mashed with Government Intel Data 1 million Geo Document test (points and polys) 4 server replica set ~350ms query response ~300% improvement over PostGIS

Community Support Thermopylae contributes fixes to the codebase http://github.com/mongodb TST will work with 10gen to fold into the baseline Active developer collaboration IRC: #mongodbfreenode.net

THANK YOU Questions? Nicholas Knize nknize@t-sciences.com

Backup

Thermopylae Sciences & Technology – Who are we? Advanced technology w/ 160+ employees Core customers in national security, venues and events, military and police, and city planning Partnered with Google and imagery providers Long term relationship focused – TS/SCI Staff TST + 10gen + Google = Game-changing approach ENTERPRISE PARTNER Who Are these guys? ACCOMPLISHING THE IMPOSSIBLE

Key Customers - Government US Dept of State Bureau of Diplomatic Security Build and support 30 TB Google Earth Globe with multi-terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework. US Army Intelligence Security Command Provide expertise in managing technology integration – prime contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework. US Southern Command Coordinate Intelligence management systems spatial data collection, indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest. Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard) Government customers ACCOMPLISHING THE IMPOSSIBLE

Key Customers - Commercial iSpatial framework serves thousands of mobile devices Baltimore Grand Prix Cleveland Cavaliers USGIF Las Vegas Motor Speedway Commercial customers ACCOMPLISHING THE IMPOSSIBLE

MongoDB– The Best of Both Worlds! MONGODB Best of both worlds ACCOMPLISHING THE IMPOSSIBLE

Big Data Scaling - Terminology Shard – Stores a single partition (subset) of the big dataset. Replica – A copy of a partition following a consistency model (delta, eventual, causal, etc.) Slice – Single Operating System in a large pool of heterogeneous operating systems (virtualization). SLIDESHOW HEADER ACCOMPLISHING THE IMPOSSIBLE

Why we chose MongoDB To Put big-data ‘on the map’