1 / 58

Extensible Message Layers for Multimedia Cluster Computers

Extensible Message Layers for Multimedia Cluster Computers. Dr. Craig Ulmer. Center for Experimental Research in Computer Systems. Outline. Background Evolution of cluster computers Multimedia of “Resource-rich” cluster computers Design of extensible message layers

vivian
Télécharger la présentation

Extensible Message Layers for Multimedia Cluster Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extensible Message Layers forMultimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems

  2. Outline • Background • Evolution of cluster computers • Multimedia of “Resource-rich” cluster computers • Design of extensible message layers • GRIM: General-purpose Reliable In-order Messages • Extensions • Integrating peripheral devices • Streaming computations • Host-to-host performance • Concluding remarks

  3. Background An Evolution of Cluster Computers

  4. CPU CPU CPU Memory Memory Memory CPU Memory I/O Bus I/O Bus I/O Bus I/O Bus Network Interface Network Interface Network Interface Network Interface System Area Network Cluster Computers • Cost-effective alternative to supercomputers • Number of commodity workstations • Specialized network hardware and software • Result: Large pool of host processors

  5. Peripheral Devices Host CPUs Improving Cluster Computers • Adding more host CPUs • Adding intelligent peripheral devices

  6. CPU SAN NI Ethernet Media Capture Storage Host Peripheral Device Trends • Increasingly independent, intelligent peripheral devices • Feature on-card processing and memory facilities • Migration of computing power and bandwidth requirements to peripherals

  7. Host Host Host CPU CPU Ethernet System Area Network Video Capture SAN NI SAN NI Storage FPGA Host Host Host Host Host Cluster Resource-Rich Cluster Computers • Inclusion of diverse peripheral devices • Ethernet server cards, multimedia capture devices, embedded storage, computational accelerators • Processing takes place in host CPUs and peripherals

  8. Benefits of Resource-Rich Clusters • Employ cluster computing in new applications • Real-time constraints • I/O intensive • Network • Example: Digital libraries • Enormous amounts of data • Large number of network users • Example: Multimedia • Capture and process large streams of multimedia data • CAVE or Visualization clusters

  9. Extensible Message Layers Supporting Resource-Rich Cluster Computers

  10. CPU CPU CPU CPU CPU CPU CPU CPU CPU ? ? ? ? ? ? FPGA RAID FPGA FPGA Video Capture RAID RAID Ethernet Ethernet Problem: Utilizing distributed cluster resources • How is efficient intra-cluster communication provided? • How can applications make use of resources?

  11. Answer: Flexible “Message Layer” Communication Software • Message layers are enabling technology for clusters • Enable cluster to function as single image multiprocessor system • Current message layers • Optimized for transmissions between host CPUs • Peripheral devices only available in context of the local host • What is needed • Support efficient communication with host CPUs and peripherals • Ability to harness peripheral devices as pool of resources

  12. GRIM: An Implementation A message layer for resource-rich clusters

  13. GRIM Core General-purpose Reliable In-order Message Layer (GRIM) • Message layer for resource-rich clusters • Myrinet SAN backbone • Both host CPUs and peripheral devices are endpoints • Communication core implemented in NI CPU Network Interface Card System Area Network FPGA Card Storage Card

  14. DATA DATA DATA Sending Endpoint Receiving Endpoint SAN PCI PCI Sending Endpoint Receiving Endpoint SAN ACK ACK PCI PCI ACK Network Interface Network Interface Network Interface Network Interface Send Sending Endpoint Receiving Endpoint SAN PCI PCI Network Interface Network Interface Reply Per-hop Flow Control • End-to-end flow control necessary for reliable delivery • Prevents buffer overflows in communication path • Endpoint-managed schemes • Impractical for peripheral devices • Per-hop flow control scheme • Transfer data as soon as next stage can accept • Optimistic approach

  15. Endpoint 1 Scheduler Logical Channel Network Logical Channel Endpoint n Network Interface Logical Channels • Multiple endpoints in a host share the NI • Employ multiple logical channels in the NI • Each endpoint owns one or more logical channels • Logical channel provides virtual interface to network

  16. AM_fetch_file() AM_return_file() Programming Interfaces: Active Messages • Message specifies function to be executed at receiver • Similar to remote procedure calls, but lightweight • Invoke operations at remote resources • Useful for constructing device-specific APIs • Example: Interactions with remote storage controller CPU Storage Controller NI NI SAN

  17. Memory Memory CPU CPU NI NI SAN Programming Interfaces: Remote Memory • Transfer blocks of data from one host to another • Receiving NI executes transfer directly • Read and Write operations • NI interacts with kernel driver to translate virtual addresses • Optional notification mechanisms

  18. Integrating Peripheral Devices Hardware Extensibility

  19. In GRIM peripherals are endpoints Intelligent peripherals Operate autonomously On-card message queues Process incoming active messages Eject outgoing active messages Legacy peripherals Managed by host application or Remote memory operations Legacy Peripheral Device Peripheral Device Overview CPU Peripheral Device CPU NI

  20. Server adaptor card • Networked host on PCI card • AM handlers for LAN-SAN bridge Server Adaptor Ethernet PCI i960 SCSI • Video capture card • Specialized DMA engine • AM handlers capture data Host Memory Video Capture A/D PCI DMA Frame Buffer • Video display card • Manipulate frame buffer • Remote memory writes Video Display AGP Frame Buffer D/A Peripheral Devices Examples

  21. SRAM 0 SRAM 1 SRAM 2 SRAM 3 Control & Switching FPGA PCI Celoxica RC-1000 FPGA Card • FPGAs provide acceleration • Load with application-specific circuits • Celoxica RC-1000 FPGA card • Xilinx Virtex-1000 FPGA • 8 MB SRAM • Hardware implementation • Endpoint as state machines • AM handlers are circuits

  22. Input Queues Output Queues Application Data Communication Library API Memory API User Circuit API User Circuit 1 User Circuit n FPGA Endpoint Organization FPGA Card Memory Frame Circuit Canvas FPGA

  23. Example FPGA Configuration • Cryptography configuration • DES, RC6, MD5, and ALU • Occupies 70% of FPGA • Newer FPGAs 8x in size • Operates with 20 MHz clock • Newer FPGAs 6x faster • 4KB Payload => 55 s (73MB/s)

  24. Configuration C Circuit E Circuit F Function Fault Circuit G (150 ms) Message: Use Circuit F Expansion: Sharing the FPGA • FPGA has limited space for hardware circuits • Host reconfigures FPGA on demand • FPGA Function Fault Configuration A Configuration B Circuit X Configuration C Host CPU Circuit Y Circuit E Circuit F Circuit G Configuration A Circuit X State Storage Circuit Y SRAM 0 FPGA

  25. Extension: Streaming Computations Software extensibility

  26. Streaming Computation Overview • Programming method for distributed resources • Establish pipeline for streaming operations • Example: Multimedia processing • Celoxica RC-1000 FPGA endpoint CPU CPU CPU CPU Media Processor Media Processor Video Capture Media Processor NI NI NI NI System Area Network

  27. Computational Circuits Circuit 1: FFT Circuit N: Encrypt Streaming Fundamentals • Computation: How is a computation performed? • Active message approach • Forwarding: Where are results transmitted? • Programmable forwarding directory FPGA In Message Out Message Forwarding Directory Destination: FPGA Forward Entry: X AM: Perform FFT Destination: Host Forward Entry: X AM: Receive FFT

  28. Host-to-Host Performance Transferring data betweentwo host-level endpoints

  29. Active Messages Remote Memory Operations Host-to-Host Communication Performance • Host-to-Host transfers standard benchmark • Three phases of data transfer • Injection most challenging • Overall communication path CPU CPU Memory Memory 2 1 3 NI SAN NI Source Destination

  30. Host-NI: Data Injections • Host-NI transfers challenging • Host lacks DMA engine • Multiple transfer methods • Programmed I/O • DMA • Automatically select methodResult: Tunable PCI Injection Library (TPIL) CPU Main Memory Cache Memory Controller PCI Bus Peripheral Device PCI DMA Memory

  31. TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host Bandwidth (MBytes/s) Injection Size (Bytes)

  32. Message 1 Message 2 Message 3 Message 1 Message 2 Message 3 Message 1 Message 2 Message 3 Message 1 Message 2 Sending Host-NI NI-NI Overall Transmission Time Message 1 Message 2 Receiving NI-Host Overall Transmission Time Overall Transmission Time Message 1 Message 2 time Overall Communication Pipeline • Three phases of transmission • Optimization: Use fragmentation to increase utilization • Optimization: Allow cut-through transmissions

  33. Overall Host-to-Host Performance Bandwidth (MBytes/s) Message Size (Bytes)

  34. Bandwidth (MB/s) MB/s Comparison to Existing Message Layers Latency (μs) μs

  35. Concluding Remarks

  36. Key Contributions • Framework for communication in resource-rich clusters • Reliable delivery mechanisms, virtualized network interface, and flexible programming interfaces • Comparable performance to state-of-the-art message layers • Extensible for peripheral devices • Suitable for intelligent and legacy peripherals • Methods for managing card resources • Extensible for higher-level programming abstractions • Endpoint-level: Streaming computations and sockets emulation • NI-level: Multicast support

  37. Future Directions • Continued work with GRIM • Video card vendors opening cards to developers • Myrinet connected embedded devices • Adaptation to other network substrates • Gigabit Ethernet appealing because of cost • Modification to transmission protocols • InfiniBand technology promising • Active system area networks • FPGA chips beginning to feature gigabit transceivers • Use FPGA chips as networked processing device

  38. Additional Research Projects

  39. Wireless Sensor Networks • NASA JPL Research • In-situ WSNs • Exploration of Mars • Communication • Self organization • Routing • SensorSim • Java simulator • Evaluate protocols

  40. PeZ: Pole-Zero Editor for MATLAB

  41. Related Publications • A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002. • Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002. • A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000. • An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000. Papers and Software Available at http://www.CraigUlmer.com/research

  42. Backup Slides

  43. Clocks Acquire SRAM 8 Detect New Message 4 Fetch Header 7 Lookup Forwarding 5 Fetch Payload 1024 Computation 1024 Store Results 1024 Store Header 16 Update Queues 3 1 Release SRAM Performance: FPGA Computations Clock Speed: 20MHz Operation Latency: 55 s (4KB 73MB/s)

  44. Control/ Status Port SRAM 0 (Incoming Queues) SRAM 1 (User Page 0) SRAM 2 (User Page 1) SRAM 3 (Outgoing Queues) Fetch/Decode Scratchpad Controller Scratchpad Controller Message Generator Results Cache Port A Port B Port C Built-in ALU Ops

  45. Page Frame 1 Page Frame 1 Page Frame 1 Host CPU User Page X Page Fault Page Frame 1 User-defined Circuits SRAM 1 Page Frame 2 FPGA SRAM 2 Expansion: Sharing On-Card Memory • Limited on-card memory for storing application data • Construct virtual memorysystem for on-card memory • Swap space is host memory

  46. CPU Memory Lock RC-1000 Challenges • Hardware implementation • Queue state machines • Memory locking • SRAM single ported • Arbitrate for use • CPU / NI contention • NI manages FPGA lock User Circuits SRAM FPGA NI

  47. Example: Autonomous Spaceborne Clusters • NASA Remote Exploration and Experimentation • Spaceborne vehicle processes data locally • Clusters in the sky • Number of peripheral devices • Data sensors • FPGA & DSPs • Adaptive hardware • Modify functionality after deployment

  48. Performance: Card Interactions • Acquire FPGA SRAM • CPU-NI: 20 s • NI: 8 s • Inject 4 KB message to FPGA • CPU: 58 s (70 MB/s) • NI: 32 s (128 MB/s) • Release FPGA SRAM • CPU-NI: 8 s • NI: 5 s CPU User Circuits SRAM FPGA Memory Lock NI

  49. Client Client Client Client Client Client Intelligent LAN Adaptor Intelligent LAN Adaptor Intelligent LAN Adaptor CPU CPU CPU Storage Adaptor Storage Adaptor Storage Adaptor SAN NI Files I-R SAN NI SAN NI Files A-H Files S-Z Example: Digital Libraries • Enormous amount of data and users • Intelligent LAN and storage cards to manage requests SAN Backbone

  50. Cyclone Systems I2O Server Adaptor Card • Networked host on a PCI card • Integration with GRIM • Interact directly with the NI • Ported host-level endpoint software • Utilized as a LAN-SAN bridge Daughter Card 10/100 Ethernet Primary PCI Interface Secondary PCI Interface Host System i960 Rx Processor 10/100 Ethernet SCSI DMA Engines DMA Engine SCSI Local Bus ROM DRAM

More Related