Project "Velocity" Under the hood

Project "Velocity"Under the hood  Anil Nori Distinguished Engineer Microsoft Corporation

Outline • Recap velocity first look • Velocity vision • Velocity in action • Velocity architecture • Futures • Q & A?

Velocity Recap • An explicit, distributed, in-memory application cache for all kinds of data (CLR objects, rows, XML, Binary data etc.) • Fuse "memory" across machines into a unified cache Clients can be spread across machines or processes Clients Access the Cache as if it was a large single cache Unified Cache View Cache Layer distributes data across the various cache nodes in a cluster

Enable Next Generation Applications • Data-centric, event driven • Data from multiple sources • Multi-tiered, de-centralized data, and logic • Distributed, loosely coupled, anywhere execution • High scale • Large number of apps, users, and transactions • Software + Services oriented • REST based access • High scale at low CAPEX and OPEX

Velocity Vision

Velocity In Action

Deployment Users Application / Web Tier … Cache Client 2 Velocity Client Velocity Client Velocity Client Cache Tier Velocity Service Server 3 Server 2 Server 1 Velocity Service Data manage, Object Manager Velocity Service Configuration Manager Common Availability Substrate One of the Velocity Service Hosts the Configuration Manager Clustering Substrate (Cluster Management) Configuration Store (Can be database, File share, etc.) Stores Global Cache Policies Stores Current Partitioning Information

Scale: Replicated Cache Using the Routing table client routes the PUT to cache2 (primary) node Application PUT • Queues the PUT operation • PUTs locally • Returns control • Propagates operation to Cache1 and Cache3 Velocity Client1 Get(K2) Routing Table Cache1 Cache2 Cache3 Routing layer Primary for (K1,V1) Primary for (K3,V3) Primary for (K2,V2) Replication Agent K2, V2 (K2, V2) K2, V2 K2, V2 K1, V1 K3, V3 K3, V3 K3, V3 K1, V1 K1, V1 K2, V2

Scale: Partitioned Cache Application Using the Routing table client routes the PUT to cache2 (primary) node PUT Get(K2) Velocity Client1 Velocity Client2 Routing Table Routing Table Cache2 Cache3 Cache1 Routing Table Routing Table Routing Table Primary for K1,V1 Primary for K2,V2 Primary for K3,V3 (K2, V2) K2, V2 K1, V1 K3, V3 K2, V2 Operations queue for notifications, to bring up a new secondary, etc.

Routing Table • Routing table a copy of subset of the Global Partition Map • Partition (key hash id ranges)  nodes • Incrementally built on-demand • Similar to DNS tables • Maintained incrementally • Velocity data nodes and client nodes may keep routing table • Client side routing table for direct dispatch of cache operations (e.g., GET, PUT) to the appropriate nodes

Key Mapping Region Name Hashed into Region Id ID Ranges mapped to Nodes Keys Bucketized into Regions Velocity Service Velocity Service Velocity Service

Local Cache • Local Cache can help speed up access on clients • Uses notification mechanism to refresh the cache on cache item changes Get(K2) Put(K2, v2) Get(K2) Velocity Client Velocity Client Local Cache Local Cache Routing Table Routing Table Cache2 Cache3 Cache1 K2, V2 K2, V2 Primary for K1,V1 Primary for K3,V3 Primary for K2,V2 K1, V1 K3, V3

Availability Using the Routing table client routes the PUT to cache2 (primary) node Application PUT Get(K2) • Queues the PUT operation • PUTs locally • Propagates operation to secondaries (cache1 & cache3) • Waits for a quorum of acks • Returns control Velocity Client1 Velocity Client Routing Table Routing Table Cache1 Cache2 Cache3 Primary for (K1,V1) Primary for (K3,V3) Primary for (K2,V2) Replication Agent K3, V3 K1, V1 K2, V2 (K2, V2) K2, V2 K2, V2 Secondary for (K1,V1), (K2,V2) Secondary for (K1,V1), (K3,V3) Secondary for (K2,V2), (K3,V3) K2, V2 K1, V1 K3, V3 K1, V1 K3, V3

Failover Cache4 PM analyzes the info on secondaries of all primary partitions of Cache2 to elect the primaries. Partition Manager Primary for (K4,V4) Global Partition Map K4, V4 Picks Cache1 as the primary for (K2,V2). Sends messages to the secondary caches, Cache1 and Cache3. Updates GPM Secondary for K1, V1 Detects Cache 2 failure. Notifies PM (on Cache4) Cache1 polls secondaries (Cache2) to ensure it has the latest data; otherwise, it will give up primary ownership Cache1 initiates reconfiguration. After reconfig, Cache1 is primary for (K1, V1) and (K2, V2) Cache1 Cache2 Cache3 Routing Table Reconfiguration Agent K1, V1 Primary for (K2,V2) Primary for (K3,V3) Replication Agent K3, V3 K2, V2 Local Partition Map K2, V2 K3, V3 Secondary for Secondary for Secondary for K2, V2 K1, V1 K3, V3 K3, V3 K4, V4

Embedded Cache • Velocity client and server components run as part of the application process • Avoids serialization and network costs • Provides high performance, low latency access • Guaranteeing locality and load balancing is tricky • Better suited for replicated caches Application Application Application Velocity Components Velocity Components Velocity Components K2, V2 K2, V2 K2, V2 K3, V3 K3, V3 K3, V3 K1, V1 K1, V1 K1, V1

Cache Event Notifications Register Notification for Key "a", "b" Map Keys to Partition Ranges Application Velocity Client1 Poll Required Nodes Routing Table Nodes Return List of Changes Cache2 Cache3 Cache1 Routing Table Routing Table Routing Table K2, V2 Primary for K1,V1 Primary for K3,V3 Primary for K2,V2 K1, V1 K3, V3

Velocity Architecture

Design Areas • Performance • Memory management • Scale • Availability • Consistency • Manageability

Velocity Components Client Layer Cache API Tools Integration Administration and Monitoring Federated Query Processor Cache Monitors Dispatch Manager Local Cache Distributed Components Cache API & Service Layer Cache API Common Availability Substrate (CAS) Cache Service Routing Table Distributed Object Manager Dispatch Manager Local Partition Map Distributed Manager Replication Agent Local Store Components In-memory Data Manager Notification Management Reconfiguration Agent Object Manager Policy Management Query Processor Region Management Cluster Substrate (Fabric) Failure Detection Raw Transport Reliable Message DM API Hash, B-trees

DM: Optimistic Version Based Updates • GetCacheItem returns a version object • Every update to an object internally increments it's version • Supply the version obtained along with the Put/Remove • Put/Remove will succeed only if the passed in version matches the version in the cache Two clients access the same item Both update the item Second Client gets in first; put succeeds because item version matches; atomically increments the version First client tries put; Fails because the versions don’t match

DM: Pessimistic Locking Client1: GetAndLock ("k1") Client3: Get ("k1") Client2: GetAndLock ("k1") • GetAndLock – Get the object and take a lock • Other clients calling GetAndLock will fail (not block) • Lock Expiry time expires lock • Regular Get does not block • Regular Put will override the lock • PutAndUnlock –Put the new object and unlock existing lock • UnLock – explicitly unlocks the object given the lock handle GetAndLock gets lock handle Regular Get succeeds Other GetAndLock on same item fails K1

OM: Eviction • Expiry only eviction which • Evicts expired items alone. (LowWaterMark < Data < HighWaterMark) • Periodic • Per partition • Hard-eviction • Evicts expired items and some non-expired items (in LRU order). (Data> HighWaterMark) • Per request • Can be turned off • Memory pressure based eviction • A thread for detecting memory pressure (polling per second) • Avoids paging • Triggers hard-eviction (mentioned above) at 85% system memory usage and asks for releasing 5% of system memory • Callbacks when eviction is triggered

OM: Cache Event Notifications • Each Operation Gets a LSN (Logical Sequence Number) • LSN used for synchronizing secondaries • Client caches the last known LSN • Request to server includes last LSN • Server only returns changes from that LSN • If LSN too old, server rejects request • Client raises a callback informing about event loss • Server returns "coarse" grained events • All events related to partition is returned • Client filters out events

OM: Persistence • Callback for read-through, write-behind • Specified at Named Cache Level • Read-Through • Called when item not present in cache • Callback returns the object/serialized bytes • Write-Behind • Writes to cache are queued • Callback called asynchronously in batches • Re-tries upon failure

Distributed Components Velocity Data & Master Node 106 Velocity Data & Master Node 107 Partition Manager Partition Manager Partition Management Partition Management GPM Store Load Balancer/ Placement Advisor Load Balancer/ Placement Advisor Velocity Data Node 100 Global Partition Map Global Partition Map Global Partition Map P S S P P S S S S S S S Reconfiguration Agent CAS Data Node 104 Data Node 103 Data Node 102 Data Node 101 Data Node 105 Velocity Application Node Local Partition Map Replication Agent Velocity->CAS Client (Routing Table) Velocity Components P P P P P Cluster Leader Elector S S S S S S S S S S S S S S S Failure Detector S S S Ring Topology Fabric Secondary Primary Secondary Fabric

Fabric: Cluster Management 200 180 210 Routing Table at Node 64: • Successor = 76 • Predecessor = 50 • Neighborhood = (83, 76, 50, 46) • Routing nodes = (200, 2, 30, 46, 50, • 64, 64, 64, 64, 64, • 83, 98, 135, 200) 174 218 151 225 r7 250 135 r6 r-6 2 120 17 r5 r-5 103 98 30 r4 r-4 90 40 83 46 76 50 64

CAS: Replication Agent • Replication for high availability • Single master based – primaries and secondaries • Write operations are always sent to primary • Primary propagates operations to write quorum of secondaries • “Operations” queue for • Replication • Reconfiguration • Notifications

CAS: Partition Manager • Partition Map • Partition is a range of hash values (hash of region/item keys) • Partition map is map of partitions to nodes • Global Partition Map (GPM) for all partitions in all named caches • Local partition map on a Velocity node is a list of partitions local to the node • Partition Manager (PM) • Maintains and manages the Global Partition Map • PM is notified of any node failure in the cluster • PM initiates reconfiguration • Any node can act as PM but only one PM active for the cluster at any time • GPM can be replicated or rebuilt for availability

CAS: Reconfiguration Agent • Four types of reconfiguration • Primary failover • Switching to a new primary • Removing a failed secondary • Adding a new secondary • Partition Manager elects the new primary and initiates reconfiguration • The new primary – • collects the copies of items from the secondaries, reconciles them and generates latest copies • sends the latest copies to all the secondaries

Futures

Executing A LINQ Query from toy in catalog<Toy>() where toy.ToyPrice > 300 select toy; Velocity Client Object Manager Object Manager Object Manager Cache API Query Processor Query Processor Query Processor Federated Query Processor In-memory Data Manager In-memory Data Manager In-memory Data Manager Dispatch Manager Local Cache from toy in catalog<Toy>() where toy.ToyPrice > 300 select toy; Cache1 Cache2 Cache3 Primary Regions Primary Regions Primary Regions ToyRegion Toy3, 400 Toy2, 350 Toy1, 500 Toy4, 100

Executing A LINQ Query from toy in catalog.GetRegion<Toy>(“ToyRegion”) where toy.ToyPrice > 300 select toy; Velocity Client Object Manager Object Manager Object Manager Cache API Query Processor Query Processor Query Processor Federated Query Processor In-memory Data Manager In-memory Data Manager In-memory Data Manager Dispatch Manager Local Cache from toy in catalog.GetRegion<Toy>(“ToyRegion”) where toy.ToyPrice > 300 select toy; Cache1 Cache2 Cache3 Primary Regions Primary Regions Primary Regions ToyRegion Toy3, 400 Toy2, 350 Toy1, 500 Toy4, 100

Co-locating Computation: Velocity-HPC HPC Job Framework Central Market Data Store (~1 TB Tick Data) Velocity Data Cache Market Data Final Results Store Final Results Market Data Velocity Intermediate Store Scratch Job Input Rollup Operation Split Method Scratch Keys Calculation Operation

Scaling Computation: Velocity-HPC HPC Job Framework Central Market Data Store (~1 TB Tick Data) Final Results Store Final Results Market Data Velocity Intermediate Store Scratch Job Input Rollup Operation Split Method Calculation Operation Velocity Node Scratch Keys

Velocity In The Cloud • Integration with Windows Azure/SSDS • Application hosted in the cloud • Velocity as cache layer within Windows Azure/SSDS • On-premises application and cache with Windows Azure/SSDS backend service • Velocity as a cache service

Application Hosted In The Cloud Windows Azure Service Role Windows Azure Service Role Windows Azure Service Role Velocity Client Velocity Client Velocity Client Velocity Cache Storage/SSDS

On-Premises Applications Application Application ASP.NET Application Velocity Client Velocity Client Velocity Client Velocity Cache Storage/SSDS

Velocity As A Service Application Application ASP.NET Application Velocity Client Velocity Client Velocity Client Velocity Caching Service Storage/SSDS

demo

Q & A

Evals & Recordings Please fill out your evaluation for this session at: This session will be available as a recording at: www.microsoftpdc.com

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Project "Velocity" Under the hood