1 / 15

Climate Machine Update

Climate Machine Update. David Donofrio RAMP Retreat 8/20/2008 . Agenda. Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps. A New Approach to HPC. Current HPC Design approach:

astrid
Télécharger la présentation

Climate Machine Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

  2. Agenda • Project Overview • Tensilica Architecture and Design Flow • Tensilica Tools Demo • Why we need RAMP • Current Progress • Next Steps

  3. A New Approach to HPC • Current HPC Design approach: • Leverage commodity processors from Intel, AMD, etc • Once machine is built, optimize problems to run on it • Power wall prevents scaling to exaflop performance • Power is the new design point Olukotun and Sutter Moore’s Law still in effect - but number of processors double every 18 months rather than clock rate

  4. A New Approach to HPC • Our approach: • Identify application, then tailor machine using semi-custom design • Optimize CPU architecture and further extend with semi-custom ISA • Leverage auto-tuning to access architecture specific optimizations • Even if each simple core is 1/4 as computationally efficient as a complex core you can fit hundreds on a single die and be 100x more power efficient • Learn from embedded market where Flops / Watt and rapid design cycles are crucial • Start with building blocks from embedded designs rather than full custom ASIC • Preserve ability to run general purpose C code • Application Target: 1km Scale Climate Model Tailor machine architecture to application to reduce waste

  5. Climate Model Resource Requirements • DOE has identified high-resolution climate modeling as a leading justification for exascale computing • Must express 20M way parallelism • Requires performance of 200 Pflops peak • Simulation must run 1000x faster than real time • Amenable to massively concurrent architectures composed of power efficient embedded cores. • Actively working with the climate science community to enable new Icosahedral model NASA Randall / CSU

  6. Tensilica Processor Design Flow • Complete Solution: Hardware, Software and Verification • Fully customizable • Required base ISA ensures general purpose applications • Processor configuration submitted to Tensilica’s servers where synthesis is performed • Returned design can be spun for ASIC or FPGA • Bit file available for Avnet boards • Building block approach drastically reduces design cycle time compared to full-custom design Tensilica Inc.

  7. Tensilica Architecture Features • Verilog-like TIE language allows for custom ISA extensions • Functional and performance verification built in • Auto generated compiler intrinsics • 64-bit IEEE-DP floating point coded up in TIE and available • Custom VLIW support • Inter-processor communication easily enabled through: • TIE Ports • TIE Queues • Access to direct HW support for interprocessor communication • TIE Lookups • Allows interface to external ROMs or other RTL block

  8. Tensilica Architecture Overview Tensilica Inc.

  9. Tensilica Performance Debug • Processor viewed as black box • State can be compressed (via HW) and pushed out JTAG port • Intended for program replay • Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail • $ hit miss with virtual address • Branch taken / not taken • Call / return • Resource dependency • Etc… • Opportunity for hundreds of performance counters to be made available Tensilica Inc.

  10. Tensilica Tools Demo

  11. Why we need RAMP • Fast, accurate emulation enables: • Dual nested loop of HW / SW co-design • Preliminary work using Stanford SM sim shows significant improvement in power eff. using automated HW/SW co-tuning • RAMP critical to accelerate • Rapid prototyping and analysis of Tensilica architectural options • Inter-processor communication architecture exploration • Running FULL climate code providing a more complete performance picture • Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5 • Extensive HW performance counter data enables an emulation environment with similar resolution but much greater speed Tensilica provided emulation environment kick-starts this effort

  12. Current Status • ML505 used for initial design exploration • Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t • Runs at 50MHz • ASIC in 65G process runs at 650MHz • OnChip Debug working • Can load / run programs using main memory synthesized from BRAM • DRAM interface coded - currently being debugged • RTL license recently obtained - full simulation environment (in ModelSim) being brought up

  13. Next Steps… • Transition to BEE3 from ML505 • Bring up XTOS environment on single xtensa processor on BEE3 • Run single column of climate code on single processor • Demo at SC’08 in November • Continue HW / SW co-tuning optimization • Begin multi-processor emulation • Emulation of single socket, 32 core, using networked BEE3s • Running full 2 Million line climate model

  14. Backup

  15. The Need for Exascale Computing Icosahedral • DOE has identified high-resolution climate modeling as leading justification for exascale computing • 1 km resolution targeted for accurate cloud resolving model • Difficult to scale existing systems • HPC design using commodity processors estimated to draw 179MW • BlueGene design estimated to draw 20MW • Leveraging embedded cores and more application specific design a power envelope of 3-5MW is projected Randall / CSU LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.

More Related