1 / 29

High-Performance Computing from Smart Phone, Multi-core CPU to Graphics Processing Unit

High-Performance Computing from Smart Phone, Multi-core CPU to Graphics Processing Unit Jih -Kwon Peir , 裴季鯤 University of Florida June11, 2014. Outline. Organization of a PC system Multi-core Processor Organization Intel cores Processors in Smart Phones IPhone, Galaxy, HTC

Télécharger la présentation

High-Performance Computing from Smart Phone, Multi-core CPU to Graphics Processing Unit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. High-Performance Computingfrom Smart Phone, Multi-core CPU to Graphics Processing Unit Jih-Kwon Peir, 裴季鯤University of Florida • June11, 2014

  2. Outline • Organization of a PC system • Multi-core Processor Organization • Intel cores • Processors in Smart Phones • IPhone, Galaxy, HTC • Processors in Game Consoles • PS4, Xbox One • Graphics Processing Unit (GPU) • Nvidia, AMD 06/11/2014 SWUFE

  3. Basic PC Organization Processor 06/11/2014 SWUFE

  4. Today’s Typical PC with GPU CPU (host) GPU w/ local DRAM (device) 06/11/2014 SWUFE

  5. Processor Is Everywhere • Desktop, Laptop: Intel, AMD, Others • Graphics Processing Unit (GPU): Nvidia, AMD, etc. • Smart Phone: iPhone, Galaxy, HTC, etc. • Tablet: iPad, Android Tablets • Game Console: PS4. XBOX One • Clusters, Warehouse-scale Server: Google Search Engineering, Cloud Computing, Data Center Server • Embedded Systems: ARM, MIPS, Others 06/11/2014 SWUFE

  6. What is Computer Architecture (Organization)? • Functional operation of the individual HW units within a computer system, and the flow of information and control among them. Programming Parallelism Technology Language Interface Computer Architecture: Interface Design (ISA) Hardware Organization OS Applications Measurement & Evaluation 06/11/2014 SWUFE

  7. Moving to Multicore (CMP) • Old CW: Uniprocessor performance 2X / 1.5 yrs • New CW: Power Wall + ILP Wall + Memory Wall New Brick Wall  Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) • Simpler processors, more power efficient • Exploit TLP and DLP, not ILP • How to use it: Programmer / compiler involvement 06/11/2014 SWUFE

  8. Uniprocessor Performance Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency

  9. Intel Processor Architecture • Intel processor architecture and technology Map (2012) • Nehalem – 2008, 45nm, quad cores, 3 mem channels, early i7 • Sandy Bridge (Westwere) – 2009, 32nm, upgrade tech re-map • Sandy Bridge-E – lunched Nov. 2011, 32nm, 6 cores, X79 platform, LGA2011 socket, 4 mem channels, 51.2 GB/sec • Ivy Bridge - 22nm, Mar/April, 2012 • Ivy Bridge-E – 22nm, target 4Q, 2012, Ivy Bridge-E will be compatible with today's Intel X79 platform, and LGA2011 socket. • Intel uses "tick-tock" method of processor design for several generations • The "tock" of this design mentality is a new microarchitecture • The "tick" is an upgraded process technology Ivy Bridge-E 22nm TOCK Ivy Bridge 22nm TICK Haswell Broadwell 9

  10. Intel Processor Architecture – Nehalem • Dramatic architecture change – removal of front-side bus • Introduced in 2008, new processor, new CPU socket, new memory architecture, new chipset, new motherboards, and new overclocking methods.  • On-die QPI links, three DDR channels, and large (8MB) L3 cache, used in Intel Core i7, configured with 1-8 cores • New SSE 4.2, better branch prediction, prefetch, SMT (2 threads/core) • The QPI has a bandwidth of 12.8GB/s in each direction simultaneously for a combined bi-directional bandwidth of 25.6GB/s, handle multiple PCI-E through 5520 IOH, flexible configurations • Three DDR channels, higher memory bandwidth

  11. Intel Core i7 – Sandy Bridge-E 6 Cores Large L3: 15MB 4 mem. Channels: 51GB/sec

  12. Intel Devil’s Canyon (Haswell), first 4GHz CPU, plus 20th-Year Pentium processor in Computex(6/3/14) 4th generation cores 06/11/2014 SWUFE

  13. Intel Devil’s CanyonFamily 06/11/2014 SWUFE

  14. Early iPhone iPhone3G: Samsung ARM 11 processor running at 412 MHz iPhone3GS: Samsung ARM Cortex A8 600MHz iPhone4: Apple A4 (S5L8930), 750-800MHz (ARM based ISA) iPhone4S, (iPAD2): Apple A5 (S5L8940), 1GHz, dual-core, SOC (iPad 3): Apple A5X (S5L8945), 1GHz, dual-core, SOC iPhone5: AppleA5xxx (S5L8950), 1GB RAM, SOC (with SGX543 GPU variant), speed and core unknown (announced 9/12/2012)

  15. iPhone 5, iPhone 5C, iPhone 5S

  16. iPhone 6 Left to right: iPhone 3G, iPhone 4, iPhone 5, iPhone 6 mockup (4.7” also has 5.5” later), Retina iPad mini iPhone6: 64-bit 20-nanometer A8 chip from TSMC (depart from Samsung); The A8 chip is rumored both a quad-core 64-bit processor and quad-core graphics;may 2GB of RAM; has 16, 32, 64 GB, and a whopping 128GB of flash RAM powered by iOS8. Series 6XT PowerVR GPUs offers 50% benchmark performance increase to previous chips, good for gaming purpose. For camera, debatable but have a higher pixel count than current iPhones or iPads. Primary > 8 Megapixel, secondary > 1.2 Megapixel. LOOK OUT  iPhone 7 is coming with A9 processor!!

  17. Galaxy S5 (May, 14) vs. iPhone 5S (Sep. 13) Processor:Galaxy S5 – Quad-core 2.5 GHz 32-bit krait 400 processor, Qualcomm MSM8974AC Snapdragon 801 chipset,, 2GB RAM, 16-32GB Flash; iPhone 5S – Dual-core Apple A7 64-bit 1.4GHz, 1GB RAM, 16-64GB FlashOS: Android 4.4.2 vs iOS 7 (more efficient) Camera: Galaxy S5 – 16-megapixel ISOCELL sensor, 2MP front camera; iPhone 5S – 8-megapixel, 1.2MP front camera Dimension: Galaxy S5 – 142 x 72.5 x 8.1mm, 145g; iPhone 5S – 123.8 x 58.6 x 7.6mm, 112gScreen: Galaxy S5 – Super AMOLED, 5.1-inch, 1080p resolution; iPhone 5S – IPS LCD, 4-inch, 1,536 x 640 resolution

  18. Qualcomm Snapdragon 805Snapdragon 801 in Galaxy S5, HTC one-M8, Sony Xperia Z2 06/11/2014 SWUFE

  19. Near Field Communication (NFC) • NFCis a standard for smartphones and similar devices to establish radio communication by touching or bringing them in few inches, used in Android, Window phones (not in iPhone yet). • NFCstandards cover communications protocols and data exchange formats, and based on existing radio-frequency identification (RFID) standards. Low power, shorter distance than Bluetooth. • Commerce: contactless payment systems,  e.g. Google Wallet similar tocredit cards and other smartcards. • Intel core processor has built-in NXP PN544PC NFC RFID reader chip, capturing the credit card's ID number and transmitting encrypted to merchant, and to MasterCard's MasterPasse-wallet. • Debate for 'mobile wallet'by NFC, or 'digital wallet‘ by PayPal's. PayPal's promotes 'digital wallet' in the cloud, not only to mobile phone, but variety of devices: laptop, iPad, ultrabook or Xbox. • Communication: Android Beam,  Jelly Bean,  S-Beam use NFC for connection, also can connect to Bluetooth and Wi-Fi. • Social Networking: sharing contacts, photos, videos, files, and entering multiplayer mobile games. 06/11/2014 SWUFE

  20. Gaming Console: Xbox One vs. PS4 CPU:AMD 8-core Jaguar CPU, XBOX runs 1.75GHz, while PS4 1.6GHz GPU and RAM: Xbox One - Comparable to AMD Radeon HD 7000-series, 8GB DDR3 RAM and 32MB eSRAM; PS4 - Comparable to Radeon HD 7000-series, 8GB GDDR5 RAM Memory Bandwidth: PS4’s has176GB/second, to 68GB/second for the Xbox One. GDDR5 is only currently available in 512MB chips, so the console will need a whopping 16 of them. Performance: PS4 performs 50% better due to its GDDR5 RAM. (CPU does not matter much.) Micorsoft increases GPU speed from 800MHz to 853MHz, effectiveness is limited. 06/11/2014 SWUFE

  21. AMD Radeon HD 7000-series 06/11/2014 SWUFE

  22. Comparison: GPU vs Multicore CPU • Difference in utilizing on-chip transistors: • CPU has significant cache space and control logic for general-purpose applications • GPU builds large number of replicated cores for data-parallel, thread-parallel computations • New APU chip combines both! • AMD Radeon HD 7000-series

  23. Nvidia Fermi Graphics Processors - GTX580 • 16 SMs, 32 core / SM • 512 cores • 3 Billion transistors • 768 KB shared L2 cache for SMs (new) • 6 DRAM channels • Host interface • GigaThread scheduler 23

  24. Streaming Multiprocessor in Fermi • Two Warp schedulers and Dispatch units • 16KB register files • Shared instruction cache • 64 KB Shared L1 data cache and Shared (local) memory • Two sets of ALU, 16 cores each (2 cycle initiation latency) • One set of LD/ST, 16 units each (2 cycle latency) • Four SFUs (8 cycle latency) • Separate INT, FP units in each core 24

  25. Comparison: Keplervs Fermi 25

  26. CUDA Programming Model Host invokes Kernels/Grids to execute on GPU, back to Host Three-level parallelism: Grid, Block, Thread Thread Application Host execution kernel 0 Block 0 Block 1 Block 2 Block 3 ... ... ... ... Host execution kernel 1 Block 0 Block N … … ... ... …

  27. Intel Clusters – Xeon Phi (Knights Corner) 60 06/11/2014 SWUFE

  28. Summary • High-performance, Low-power, Multi-core Processor is everywhere • Smartphone, Tablet, PC, Game Console, Graphics, Communication, Security, Data-Center Cluster, Supercomputer, etc. • Personal electric devices, home appliances, cars and transportation vehicles, medical equipment, E-commerce, E-Bank, Security, Cloud Computing, etc. • Important to have basic understanding about processors!! 06/11/2014 SWUFE

  29. Thank You! Questions? 06/11/2014 SWUFE

More Related