Reconfigurable Technologies for Next Generation Internet and Cluster Computing

Reconfigurable Technologies for Next Generation Internet and Cluster Computing Ph.D. Dissertation Defense Deepak Unnikrishnan Chair: Prof. Russell Tessier Funded by NSF Grant CNS-0831940 Department of Electrical and Computer Engineering

Outline • Motivation • Research goals and preview of results • Contributions • Network virtualization • Distributed cluster computing • Conclusions • Directions for future work

Motivation • Modern web and mobile applications are marked by distinct networking and computing characteristics • Streaming multimedia - Network latency and bandwidth • E-commerce and social networks - Network security • Recommendation systems - Response time versus accuracy • There exists a gap between what applications demand and what the infrastructure can offer

Limitations of Infrastructure • Networking • Fewer protocols with well-known vulnerabilities in middle and lower layers • Limited programmability for existing networking devices • Computing • Backend formed from clusters of commodity machines (e.g. datacenters) • Fairly homogeneous and generalized solutions for all application workloads • Growing energy consumption

Technology choices • Existing technology choices • Microprocessors – Fairly general, low cost and programmable solutions • ASICs – High cost, limited programmability, tuned for performance • Future Internet systems will require specialized platforms that offer programmabilityandperformance at low cost • Opportunities for Field Programmable Gate Arrays (FPGAs) • Programmable hardware • Data-parallel architecture • Fast design turnaround • Built-in specialized blocks - DSPs, serial protocol IPs and soft microprocessors

FPGA Architecture • 2D array of logic blocks (Lookup Tables) • Programmable logic and routing circuitry

Research Statement Architect scalable systems that integrate FPGAs with commodity microprocessors to address the diversity issues in next-generation networking and cluster computing. • Research Challenges • Architecture • The need for scalable techniques in hardware to realize next-generationnetworking and data processing applications • Programming Models • Narrowing the design gap between software and hardware developers for rapid application development on programmable hardware • Heterogeneous computing frameworks • Software frameworks that can utilize the capabilities of specialized hardware (e.g. FPGAs) in distributed computing environments

Research Contributions • An architecture to deploy novel networking protocols on programmable hardware in a shared fashion that supports: • Scalability using virtual network migration, Isolation with partial reconfiguration • 10-100X better than state-of-the-art techniques • A programming model to describe shared networking features in programmable hardware • Reusable interfaces with no loss in packet forwarding performance • A heterogeneous cluster computing framework that uses asynchronous accumulative updates to accelerate computation • Upto 150X faster than existing cluster computing frameworks (e.g. MapReduce)

Part I – Scalable Network Virtualization using FPGAs

Network Virtualization • Motivation • Test and deploy new networking protocols in real networks without traffic disruptions • Towards network as a service model • Idea - Share the physical network between virtual network slices

Node Virtualization • Key issues to consider • Scalability (Share the resource b/w many users and networks) • Flexibility (to support custom data plane and control plane policies) • Isolation (to prevent traffic disruptions and avoid security threats) • Performance ( to support realistic traffic capacity) • Existing Techniques • Host/container virtualization on commodity processors (e.g. VINI, PlanetLab) • ASICs [e.g. Supercharging PlanetLab, CaviumOcteon, Cisco Nexus 7000 ] • Limitations • Limited packet forwarding performance in host virtualization [ 10-100 Mbps] • Limited customizability in virtualized ASICs

Dataplane virtualization using FPGAs [FPGA’10] • Share FPGA between virtual dataplanes • Packet forwarding handled in hardware • Control planes execute in virtualized OpenVZ containers OpenVZ OpenVZ OpenVZ Software bridge Host OS Kernel driver PCI 1G Ethernet I/F Vrouter Vrouter NetFPGA

Scalability • Dataplane scaling • Since FPGA only supports 4-5 IP data planes, spawn additional virtual networks in a virtualized server • Map virtual dataplanes with high bandwidth to FPGA. Use virtualized server instances (OpenVZ) to run low throughput dataplanes • Forwarding Table Scaling • Share off-chip memory (SRAM) to store virtual forwarding tables • Prefix conflicts avoided through prefix relocation [Tcomputer’12] FPGA IPv4 MAC Queues MAC Queues Design Select Output Queue Routing on Flat Labels CPU Transceiver Linux IPv6 (OpenVZ) eth eth S/W Bridge S/W Bridge IPv4 (OpenVZ )

IPv4 MAC Queues MAC Queues Output Queue Design Select Routing on Flat Labels Source Sink OpenVZ eth eth S/W Bridge S/W Bridge OpenVZ Virtual Network Migration • Swap virtual networks between FPGA and software when throughput requirements change • 12 seconds for full reconfiguration Fully Reconfigure FPGA FPGA IPv6 Linux

Evaluation • Throughput and Latency • 10-100X better throughput and latency versus Click/OpenVZ • Upto 15 virtual networks (4 in FPGA, 11 in Click/OpenVZ)

Isolation of Virtual Networks in FPGA [SIGCOMM CCR] • Static reconfiguration provides poor traffic isolation in shared virtual networks • Requires reconfigures the entire chip • All shared virtual networks must either shutdown or must be migrated to software containers • Partial reconfiguration - Reconfigure part of the chip Virtex II with two virtual dataplanes placed on either sides of the chip

Virtual Network Reconfiguration Static reconfiguration 12 seconds Partial reconfiguration 20x reduction in reconfiguration time Partial reconfiguration benefits when virtual networks are often reconfigured

ReClick – Programming Models for FPGA Dataplanes [ANCS’11] Motivation • New networking techniques require unique dataplane processing capabilities • Onion routing (Encrypted packet headers), • Path splicing (Additional header bits between transport and IP) • Describing new dataplane features on FPGAs is difficult • Requires specification using Hardware Description Languages (HDLs) • Limited opportunities for design reuse • Dataplanes that share the FPGA often perform common operations • Example: CRC, checksum calculation, TTL updates

Goals • Specify common networking features with operations on packet fields • E.g. set a field in packet word, insert a field in a packet word • Share and reuse common networking features between dataplanes in the FPGA • A library of common FPGA packet processing components ReClick description Library Component Click dataplane Verilog HDL FPGA CAD

ReClick Library

Part II – Maestro: Accelerating Iterative Algorithms using Asynchronous Accumulative Updates on FPGAs

Motivation • Iterative algorithms - Arrive at a result by repetitively performing the same set of operations on a large dataset • Form the basis for many machine learning and data mining algorithms in web applications • Pagerank (Google search), • Adsorption (Youtube recommendation systems), • Katz metric (Social network analysis) • Existing approach – MapReduce on commodity hardware clusters (e.g datacenters) • Limitations • Synchronous barriers, repeated disk accesses between iterations • Sequential execution in general-purpose processors • All data treated equally – lack of prioritization

Maestro Contributions • We present Asynchronous Accumulative updates (AAU) as an approach to eliminate the need for strict synchronization barriers in heterogeneous clusters (e.g. FPGAs+CPUs) • Demonstrate a scalable hardware architecture that utilizes the data-parallel nature of FPGAs to implement the AAU model • Evaluate the model in a laboratory cluster of four FPGAs

Background on Iterative Algorithms • Perform the same set of operations repetitively on a dataset • Algorithm converges when the difference between two iterations is sufficiently small • Example: PageRank A D B C

Synchronous Implementation - MapReduce • Limitations - Sync barriers, Repeated disk access between iterations, Need to store shuffled values at each node • 1.3 million node web graph ~ 45 minutes Distributed File System A B Map PR(A) PR(B) Shuffle Iteration Sync barrier C Reduce PR(C)=PR(A)+PR(B) RAM Distributed File System • Improvements – Perform in-memory computation [1.2-5X speedup]* • *iMapReduce, Zhang et al. IEEE Intl. Symposium on Parallel and Distributed Processing Workshops

Accumulative Updates • Key Idea - • Accumulate changes from other nodes • Update the value with the accumulated changes • Propagate the “change” in value • Accumulative updates can be proven to converge to the same solution A B ΔPR(A) ΔPR(B) ΔPR(C)=ΔPR(A)+ΔPR(B) Accumulate ΔPR PR PR(C) = PR(C)+ΔPR(C) Send g ( ΔPR(C) ) to neighbors ΔPR(C) = 0 Update

Asynchronous Accumulative Updates • Updates and accumulates may be performed asynchronously • Allows streaming communication without synchronization barriers • Useful in heterogeneous systems • Faster nodes can make independent progress Updates Updates A B time time Sync barrier C Asynchronous Computation Synchronous computation

Implementation of AAU model in CPU network • Key selection policies • Round-robin • Priority based selection • Update keys with highest Δv • In PageRank, pages with highest • “changes” in rank values msg Accumulate state table (RAM) Update network operator must be commutative, associative and distributive

Maestro - Cluster • FPGAs provide fine-grained parallelism suited to iterative computation • FPGA assistants – interface FPGAs with DFS (Intel Quad core CPU) • NetFPGA 1G Router • Goal: Scalable architecture to parallelize the AAU model on FPGAs

FPGA Architecture Packet communication • Approach: Partition state table Update/ Accum Prioritized Key Selection Data consistency State Table

Design challenge – Data Consistency • Compute units must have exclusive access to KV pairs • Coarse-grained locking - Global lock on the state table • Limits memory utilization • Fine-grained locking – E.g. Provide a lock bit per key per compute unit • Not scalable FPGA state table P1 P2 P3 P4

Consistency • Our solution: Based on principles of cache coherence in Symmetric Multi-processor systems • Compute units gain exclusive access only after checking the state of KV pair in other compute units – Snoopy bus protocol • Simultaneous KV pair accesses are serialized, enforcing strict consistency • Implementation - Coherence controller module within each compute unit Bus Arbiter FPGA state table P1 P2 P3 P4 Snoopy bus

Prioritizing KV Pair Updates in FPGA • Some updates to KV pairs may be more relevant than others • E.g: pages with higher rank values in PageRank • Identify the K highest priority keys from N state table entries • Approach 1: Sort all N keys and select top K entries. • Complexity=O(N.log(N)) • Approach 2: • Assumption - The distribution of KV pairs in a sample represents statistical sampling of distribution of KV pairs in the state table • Sample S KV pairs and sort the samples. The Kth highest sample provides an approximation for Kth highest priority key in the state table. • Select all entries higher than the threshold for update

Prioritizing KV Pair Updates in FPGA • A parallel hardware selection circuit performs key value pair selection in O(S) time complexity and O(K) space complexity

Maestro lab prototype • Cluster Scaling • Add FPGA boards • Scaling capacity within FPGA • Add more TX/RX processors • (requires reconfiguration) • Interchange TX and RX • processors • (via software registers)

Evaluation • Evaluation systems • Hadoop – Open-source implementation of MapReduce • Maiter – Asynchronous implementation of AAU on 4 CPUs • Maestro - Asynchronous implementation of AAU on 4 Altera DE-4 FPGAs • Dataset • Graph size selected to fill the capacity of DRAM in each Altera DE-4 board (~1.3 million nodes per board) • Node and edge degrees follow log-normal distribution σ=2.5, µ=0.5 • MOD Partition function

Single-node Configuration Graph size=1.3M (900MB) • Close to linear speedup for PageRank and Katz • Upto 7X speedup (154X versus Hadoop) for PageRank • With 8 processors, approx 40% of FPGA is used

Multi-node configuration • Increasing transmitters improves network utilization • Balanced transmitters and receivers yield the highest speedup Graph size=2.6M (1.8GB) Graph size=5.2M (2.6GB) Two nodes Four nodes

Network trace Maestro better utilizes network bandwidth as computation is scaled within FPGA Maestro Maiter

Scaling cluster and problem sizes • For 1 node Ptx = 8 • For 2 and 4 node configurations, Ptx:Prx = 4:4

Scalability for fixed problem size • Experiment data limited to 4 nodes • Ideal execution time on n FPGA nodes = T/n • Factors that canlimit scalability • Link capacity • Receiver processor (Prx) capacity 4 FPGAs 2 FPGAs N/w trace when a 1.2M problem is parallelized on 2 and 4 nodes

Summary of Contributions • Architecture • Heterogeneous dataplanes to scale virtual networks (10-100X speedup) • Virtual network migration and partial reconfiguration for customization • Programming Models • Faster design cycles using hierarchical composition of reusable networking components • Heterogeneous Computing Frameworks • Asynchronous accumulative updates can provide significant speedup in heterogeneous clusters (154X versus MapReduce)

Directions for Future Work • FPGA-based Network virtualization • Evaluation of FPGA-dataplanes in existing virtual network testbeds (e.g. GENI, PlanetLab) • Virtual network hypervisor to manage dataplane migration • Heterogeneous Distributed Clusters • Evaluating Maestro with better clustering algorithms • Evaluate the suitability of AAU model in other data-parallel architectures e.g. multicores/GPGPUs

Publications

Backup slides

Maiter scaling on Amazon EC2

Execution time

Factors that can limit scalability • Link capacity • Receiver (Prx) capacity • Link capacity • Receiver processor capacity • Max FPGAs = Min(nmax,link, nmax,rx) Prx

Maestro Energy/Cost Analysis

Backup

Reconfigurable Technologies for Next Generation Internet and Cluster Computing

Reconfigurable Technologies for Next Generation Internet and Cluster Computing

Presentation Transcript

Architecting Next-generation Internet Technologies

Next Generation Sequencing Technologies

Next Generation Sequencing Technologies

Next Generation Sequencing Technologies

Architecting Next-generation Internet Technologies

Reconfigurable Computing

Next Generation Sequencing Technologies

Next Generation Internet

Next generation Pedagogies and Technologies

Reconfigurable Computing

Reconfigurable Computing

Next Generation Internet (NGI)

Next Generation Internet Services

Technologies for Cluster Computing

“Next Generation” Technologies for the “Next Generation” Library User

Next Generation Wireless Technologies

Next Generation Internet Address

Next Generation Internet

Reconfigurable Computing

Technologies for Cluster Computing

Reconfigurable Computing

Reconfigurable Computing