FLASH MEMORY FOR FULL-THROTTLE GPU ACCELERATION
FLASH MEMORY FOR FULL-THROTTLE GPU ACCELERATION
FLASH MEMORY FOR FULL-THROTTLE GPU ACCELERATION
E N D
Presentation Transcript
FLASH MEMORY FOR FULL-THROTTLE GPU ACCELERATION
Vincent Brisebois • 12 years at Autodesk Media & Entertainment Tech Support / Product Specialist / Product Designer • Member of the Visual Effects Society Technology Council • 2 years at Fusion-io Entertainment Business Development Performance Computing Industry Manager
THE DATA SUPPLY PROBLEM LEADS TO IDLE CAPACITY Growing Performance Gap CPUs Relative Performance Memory Storage 1985 1990 1995 2000 2005 2010 According to Moore's Law, processing performance doubles every 18 months 3
TECHNOLOGY ENABLERS PCIe ECOSYSTEM SOFTWARE-ENABLED REPROGRAMMABLE CONTROLLERS FLASH MEMORY
ARCHITECTURAL HIGHLIGHTS NAND Flash Chips Heat Sink/FPGA Parity Chip Reliability • N+1 redundancy • Like having a RAID between chips • Without the capacity sacrifice • Over Provisioning • Reserve space for handling individual pixels dying • Reserve space is adjustable if higher write performance is needed • High ECC strength • 72 bit error correction
NETWORKED STORAGE DATA SUPPLY CHAIN FROM APPLICATION TO FLASH Application Server Processor Network Adapter Network Switch Network Adapter Storage Appliance Processor Disk RAID Controller SAS/SATA Bus and Protocol SSD SSD RAM Battery/Sup er Capacitors NAND Flash Embedded CPU 9 Intermediary components required All adding access delay, cost, complexity, and lowering reliability (especially the super capacitors) Requests must do a round trip touching everything TWICE… May 24, 2012 7 Fusion-io Confidential
SSD DATA SUPPLY CHAIN FROM APPLICATION TO FLASH Application Server Processor Disk RAID Controller SAS/SATA Bus and Protocol SSD SSD RAM Battery/Sup er Capacitors NAND Flash Embedded CPU 5 Intermediary components required All adding access delay, cost, complexity, and lowering reliability (especially the super capacitors) May 24, 2012 8 Fusion-io Confidential
FUSION-IO DATA SUPPLY CHAIN FROM APPLICATION TO FLASH NAND Flash Application Server Processor 0 Intermediary components required No need for super capacitors because data is not "buffered” in DRAM May 24, 2012 9 Fusion-io Confidential
FUSION-IO FIRST MOVER MILESTONES ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ 2006 2007 2008 2009 2010 2011 2012 Mission to consolidate memory and storage ioMemory technology unveiled First products launched HP OEMs products Dell OEMs products IPO on NYSE 1 Billion IOPS ioTurbine acquired 2,500 customers 1 million IOPS IBM Quicksilver IBM OEMs products VSL introduced ioDrive2 announced >120 channel and alliance partners Dell strategic investment Samsung strategic investment 50+ Petabytes shipped ioFX ioMemory SDK May 24, 2012 10
COMPREHENSIVE CUSTOMER SUCCESS MANUFACTURING/ GOVERNMENT FINANCIALS W EB TECHNOLOGY RETAIL ® The world’s leading Q&A site 5x 30x 40x 15x 15x FASTER DATABASE REPLICATION FASTER DATA WAREHOUSE QUERIES QUERY PROCESSING THROUGHPUT FASTER DATA ANALYSIS FASTER QUERIES 30+ case studies at http://fusionio.com/casetudies May 24, 2012 13
FUSION-IO ACCELERATES Big Data Analytics Collaboration Databases Virtualization Search KVM Lotus INFORMIX ORACLE Text HPC Messaging Workstation Development Caching Web Security/Logging MQ LAMP GPFS May 24, 2012 14
IOMEMORY PLATFORM May 24, 2012 15
FUSION IOFX MEMORY TIER ▸ Tuned for sustained performance in multithreaded applications ▸ Work on 2K, 4K and 5K digital content interactively, in full resolution ▸ Manipulate stereoscopic content in real-time ▸ Accelerate video and image editing and compositing ▸ Speed video playback ▸ Powerful throughput to maximize GPU processing ▸ Simplify and accelerate encoding and transcoding ▸ Accelerate compiling code for software programmers 420GB 1.4 GB/s Read 700MB/s Write 42µs QDP MLC May 24, 2012 16
IOMEMORY PERFORMANCE Capacity 365GB Duo 2.4TB 400GB Duo 1.2TB ioFX NAND Type MLC MLC SLC SLC MLC Read Bandwidth 910 MB/s 3.0 GB/s 1.4 GB/s 3.0 GB/s 1.5 GB/s Write Bandwidth 590 MB/s 2.5 GB/s 1.3 GB/s 2.6 GB/s 700 MB/s Read IOPS (Seq) 415,000 892,000 351,000 702,000 Write IOPS (Seq) 535,000 935,000 511,000 937,000 Read IOPS (Rand) 137,000 285,000 Write IOPS (Rand) 535,000 725,000 Read Latency 68 us 68 us 47 us 47 us 68 us Write Latency 15 us 15 us 15 us 15 us 15 us Bus Interface PCIe 2.0 x4 PCIe 2.0 x8 PCIe 2.0 x4 PCIe 2.0 x8 PCIe 2.0 x4 May 24, 2012 17
FLASH MEMORY EVOLUTION | Native Access ioMemory as Block Device ioMemory as Transparent Cache ioMemory with direct access I/O ioMemory with memory semantics Legacy SSDs Application Application Application Application Application Application Open Source Extensions Open Source Extensions Application OS Block I/O OS Block I/O OS Block I/O Direct-access I/O API Family Memory Semantics API Family File System File System File System directFS – native file system service Host Host Block Layer directFS Block Layer Block Layer SAS/SATA Network directCache VSL VSL Remote RAID Controller Virtual Storage Layer VSL VSL VSL Flash Layer Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store May 24, 2012 18
FLASH MEMORY EVOLUTION | Native Access ioMemory as Block Device ioMemory as Transparent Cache ioMemory with direct access I/O ioMemory with memory semantics Legacy SSDs Application Application Application Application Application Application Open Source Extensions Open Source Extensions Direct I/O Application OS Block I/O OS Block I/O OS Block I/O Direct-access I/O API Family Memory Semantics API Family File System File System File System directFS – native file system service Host Host Block Layer directFS Block Layer Block Layer SAS/SATA Network directCache VSL VSL Remote RAID Controller Virtual Storage Layer VSL VSL VSL Flash Layer Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store May 24, 2012 19
IOMEMORY AS BLOCK DEVICE Demo May 24, 2012 20
SYSTEM DIAGRAM May 24, 2012 21
QUADRO DUAL COPY ENGINE May 24, 2012 22
OPENGL PIXEL BUFFER OBJECTS (PBO) File system direct I/O file_handle = CreateFile(LPCSTR(video_file), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING, NULL); GPU DMA-able system buffer glGenBuffers(1, &buffer_handle); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, buffer_handle); glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, size, NULL, GL_DYNAMIC_DRAW); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0); May 24, 2012 23
READ FROM IOMEMORY Map PBO for write glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, buffer_handle); void pbomem = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_WRITE_ONLY); glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,0); Read from ioMemory BOOL ret = ReadFile(file_handle, pbomem, size, &num_bytes_read, NULL); Unmap PBO for DMA glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, buffer_handle); glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB); glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,0); May 24, 2012 24
TRANSFER TO GPU glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, buffer_handle); glBindTexture(GL_TEXTURE_2D, texture_handle); glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA, GL_UNSIGNED_BYTE, 0); glBindTexture(GL_TEXTURE_2D, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); Barrier sync DMA GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0); glClientWaitSync(fence,0,0); glDeleteSync(fence); May 24, 2012 25
PIPELINE Read from ioMemory DMA to GPU Draw from GPU ioFX ioFX May 24, 2012 26
CUDA GPU DIRECT Copy data directly to/from CUDA pinned host memory Avoid one copy Peer to peer transfers between GPUs Utilizes PCIe DMA Peer to peer memory access between GPUs NUMA from within CUDA kernels Pipeline transfers for GP-GPU Read from ioMemory Write to ioMemory Unified Virtual Address Space ! May 24, 2012 27
CUDA OS-pinned CUDA buffer // Alloc OS-pinned memory cudaHostAlloc((void**)&h_odata, memSize, (wc) ? cudaHostAllocWriteCombined : 0); Read from ioMemory fd = open("/mnt/cudaMemory", O_RDWR | O_DIRECT); if (fd != 0) { rc= read(fd, h_odata, memSize); Copy (DMA) to GPU cudaMemcpyAsync(d_idata, h_odata, memSize, cudaMemcpyHostToDevice, stream); May 24, 2012 28
PROGRAMMING PATTERNS Pipelines CPU threads CUDA streams Ring buffers Parallel DMA Direct I/O But ioMemory is much more than a block device It’s non-volatile memory with native access semantics… May 24, 2012 29
EXPLOITING NATIVE CHARACTERISTICS OF IOMEMORY 1. Native log-append writes incorporates copy-on-write basics 2. Native block mapping and allocation incorporate file system basics 3. Native large virtual address space incorporates sparse semantics 4. Native storage methods incorporate key-value store basics May 24, 2012 30
SDK INTRO Fusion-io Software Development Kit Enables Native Flash Memory Access: directPrimitives API, • including Atomic Writes and the MySQL InnoDB extension directKey-Value Store API * • directFS, native file-access layer * • Auto-Commit Memory API • Extended Memory API • May 24, 2012 31
FLASH MEMORY EVOLUTION: NATIVE API ACCESS | Native Access ioMemory as Block Device ioMemory as Transparent Cache ioMemory with direct access I/O Legacy SSDs Application Application Application Application Application Open Source Extensions Application direct Cache OS Block I/O OS Block I/O OS Block I/O direct I/O Primitives Key-Value Store API direct API File System File System File System directFS – native file system service Host Host Block Layer Block Layer Block Layer SAS/SATA Network directCache VSL Remote RAID Controller Virtual Storage Layer VSL VSL Flash Layer Read/Write Read/Write Read/Write Read/Write May 24, 2012 32
FLASH MEMORY EVOLUTION: NATIVE API ACCESS | Native Access ioMemory as Block Device ioMemory as Transparent Cache ioMemory with direct access I/O Legacy SSDs Application Application Application Application Application Open Source Extensions Direct I/O Application direct Cache OS Block I/O OS Block I/O OS Block I/O direct I/O Primitives Key-Value Store API direct API File System File System File System directFS – native file system service Host Host Block Layer Block Layer Block Layer SAS/SATA Network directCache VSL Remote RAID Controller Virtual Storage Layer VSL VSL Flash Layer Read/Write Read/Write Read/Write Read/Write May 24, 2012 33
KEY-VALUE STORE API LIBRARY Application Key Value API and Library atomic write() Lookup: exists() Atomic delete (PTRIM) Coordinated Garbage Collection VSL – Dynamic provisioning, Block allocation, logging etc. Citrusleaf NoSQL Demo – April 2012 400,000 transactions/second on a single server May 24, 2012 35
CUDA & KEY-VALUE STORE OS-pinned CUDA buffer // Alloc OS-pinned memory cudaHostAlloc((void**)&h_odata, memSize, (wc) ? cudaHostAllocWriteCombined : 0); KeyGet from ioMemory rc = directKeyGet(”key”, h_odata, &memSize); Copy (DMA) to GPU cudaMemcpyAsync(d_idata, h_odata, memSize, cudaMemcpyHostToDevice, stream); May 24, 2012 36
DIRECTFS – NATIVE FILE SERVICES LAYER Application DirectFS – Namespace File/Offset ->Sparse Address atomic write() Lookup: exists() Atomic delete (PTRIM) VSL – Dynamic provisioning, Block allocation, logging etc. May 24, 2012 37
CUDA & DIRECTFS OS-pinned CUDA buffer // Alloc OS-pinned memory cudaHostAlloc((void**)&h_odata, memSize, (wc) ? cudaHostAllocWriteCombined : 0); Read from ioMemory fd = open("/mnt/cudaMemory", O_RDWR | O_DIRECT); if (fd != 0) { rc= read(fd, h_odata, memSize); Copy (DMA) to GPU cudaMemcpyAsync(d_idata, h_odata, memSize, cudaMemcpyHostToDevice, stream); May 24, 2012 38
FLASH MEMORY EVOLUTION: NATIVE API ACCESS | Native Access ioMemory as Block Device ioMemory as Transparent Cache ioMemory with direct access I/O ioMemory with memory semantics Legacy SSDs Application Application Application Application Application Application Open Source Extensions Open Source Extensions Application Auto-Commit direct Cache OS Block I/O OS Block I/O OS Block I/O direct I/O Primitives Key-Value Store API Extended Memory Memory Memory pointed direct Check- API File System File System File System directFS – native file system service Host Host Block Layer directFS Block Layer Block Layer SAS/SATA Network directCache VSL VSL Remote RAID Controller Virtual Storage Layer VSL VSL VSL Flash Layer Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store May 24, 2012 40
CONCLUSION Early-access to ioMemory SDK libraries and technical documentation http://developer.fusionio.com ioMemory SDK Web Seminars: Wednesday directPrimitives API, including Atomic Writes and the MySQL InnoDB extension May 2 May 9 directKey-Value Store API May 23 directFS, native file-access layer May 30 Auto-Commit Memory API June 6 Extended Memory API May 24, 2012 43
WHAT WE WANT TO SEE We encourage you to “Go Native” and engage us in discussion as to where you want to see the technology grow. We would love your input. May 24, 2012 44