1 / 29

Visual Object Recognition Accelerator Based on Approximate In-Memory Processing

Visual Object Recognition Accelerator Based on Approximate In-Memory Processing. Yeseong Kim, Mohsen Imani , Tajana Rosing University of California, San Diego Department of Computer Science and Engineering seelab.ucsd.edu. Internet of Things and Big Data.

Télécharger la présentation

Visual Object Recognition Accelerator Based on Approximate In-Memory Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visual Object Recognition Accelerator Based on Approximate In-Memory Processing Yeseong Kim, Mohsen Imani, Tajana Rosing University of California, San Diego Department of Computer Science and Engineering seelab.ucsd.edu

  2. Internet of Things and Big Data • Internet of Things: Billions-trillions of interconnected devices • 1.8 zettabytes of data generated in 2015, increased by 50% in 2020! • Diverse applications handle Big Data in a system-efficient way http://www.iottechworld.com/

  3. Cost of Operations DRAM consumes 170x more energy than FPU Mult Ref: Dally, Tutorial, NIPS’15

  4. Processing In Memory • Processing In Memory (PIM):Performing a part of computation tasks inside the memory General Purpose Processor Core Core Core Core Core Core Core Core Large Memory for Big Data Large Memory for Big Data Computational logic

  5. Supporting In-Memory Operations Bitwise Search Operation Addition/ Multiplication Supported Operations OR, AND, XOR Multiple row Search/ Nearest Search Matrix Multiplication Example of Operations Classifications Clustering Database Deep learning Security Multimedia HD computing Graph processing Query processing Applications

  6. Machine Learning Acceleration • Machine learning is a popular choice to handle and assimilate Big Data, but usually requires lots of computation & tuning • e.g., Deep Neural network • AdaBoost: One of the best off-the-shelf learning algorithm • Exploit ensemble of weak learning models (e.g., decision trees)=> robust & general purpose Computation Acceleration Machine Learning (ML) Model Dataset Is this a face? What’s the probability?

  7. DNN vs AdaBoost • Deep neural networks: • Show superior quality in the recognition tasks • Accelerated on FPGA, ASIC-based, and PIM-based designs • Significant energy and performance issues due to the high computation complexity and large memory footprints of models • In contrast our design targets AdaBoost which: • Is viable solution for diverse object recognition tasks without losing generality • Has been widely used in computer vision field • Its learning method is often relatively light-weight, and shows better accuracy than DNN in some cases of the image recognition, e.g., face detection • Requires less effort to tune parameters and the trained models are easy to interpret

  8. Example of AdaBoost Functionality Training set: 10 points (represented by plus or minus)Original Status: Equal Weights for all training samples www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt

  9. Example of AdaBoost Functionality (cont’d) Round 1: Three “plus” points are not correctly classified;They are given higher weights. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt

  10. Example of AdaBoost Functionality (cont’d) Round 2: Three “minuse” points are not correctly classified;They are given higher weights. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt

  11. Example of AdaBoost Functionality (cont’d) Round 3: One “minuse” and two “plus” points are not correctly classified;They are given higher weights. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt

  12. Example of AdaBoost Functionality (cont’d) Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt

  13. Object Recognition Acceleration • We conducted a case study for popular classification problem: • Image object recognition • Classification decision happens with ensemble of DT-MEM blocks • Additional memory blocks are designed for in-memory feature extraction Input image

  14. Histogram of Oriented Gradient (HoG) • Describes the shape and appearance of target objects • Computes the gradient values of all pixels by considering its adjacent pixels • Gradient can represents by a vector which has an orientation (direction) and magnitude • There are 256 values for a pixel of an image color channel, each cell includes 9 pixels.  prohibitively huge memory size for all pixel combinations HoG

  15. Approximate HoG • Optimize this memory size by storing only approximate and representative values • For example, in MNIST hand-written alphabet the input pixels using two values, e.g., for black and white • MNIST: 87% of pixels of the MNIST images are either 0 and 255. • WebFaces: many pixels have similar values in the middle range

  16. Feature Extraction Acceleration • The address decoder quantizes each pixel into a Q levels • E.g. Q = 4, the 256 pixel values are quantized to 4 values, 00, 01, 10, and 11. • The quantized value are concatenated to form a memory address which indicates a row of the crossbar memory block • Each row of the recipe memory includes: • di: bin index of the vector direction • mi: the magnitude Original computation In-memory computation

  17. Haar-like Feature Extraction • Feature’s value is calculated as the difference between the sum of the pixels within white and black rectangle regions

  18. Original image Integral Image Facial Haar features 1 1 1 1 2 3 1 1 1 2 4 6 3 6 9 1 1 1 Stores Pixel sum of Rect(from top-left corner to this point) p1 p2 D p3 p4 Calculate Haar-feature value: Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B) Face Detection in Sub-window D 3 Need 4 corner values • How to add all pixel values in the red region? • Should we add all pixel values? • Can we do better? D= P4 - P2 - P3 + P1 Ref: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration

  19. In-Memory Haar-Like Feature Extraction • Memory block has to be initialized with the integral image • Computes a Haar-like feature from the two in-memory additions • Subsequent subtraction and weighting are processed by a small CMOS-based • Implementing weighted subtractor block and weighting logic using shift operations • Memory optimized for write latency

  20. In-Memory Decision Tree • A DT-MEM implements a decision tree based on the concept of auto-associative memory • 1) Activates the decision stump of the root node • 2) Auto-associative memory: performs the similarity search for the two enabled rows • 3) Repeatedly search a similar row with the given buffer data until the node type flag is 0 • 4) Tree based adder: combines different features based on their weights

  21. Experimental Setup • C++ cycle-accurate simulator to model the ORCHARD functionality • Circuit level simulation to support performance and energy consumption of proposed hardware • Cadence Virtuoso to support 45nm CMOS technology • VTEAM memristor model [*] for our memory design simulation: • RON and ROFF of 10kΩ and 10MΩ respectively • Evaluate the energy and performance efficiency to existing processor-based implementation • Measured the power consumption of Intel Xeon E5440 processor and ARM Cortex A53 processor [*] S. Kvatinsky et al., “Vteam: a general model for voltage-controlled memristors,” TCAS II, vol. 62, no. 8, pp. 786–790, 2015.

  22. Object Recognition Accuracy • The proposed design successfully recognizes different objectsusing the same acceleration strategy • Benchmarks: • MNIST • 10000 WebFace • INRIA Pedestrian • UIUC Vehicle

  23. Tradeoff: Accuracy vs. Approximation • The approximate in-memory feature extraction increases with minimal accuracy loss, e.g., • MNIST: only 0.4% (97.5%) for Q = 2 and L = 1024 • WebFace: only 0.3 % error (96.7%) for Q = 6 and L = 2048

  24. Energy & Performance Comparison • ORCHARD executing all the tasks inside memory: • 1,896X energy efficiency and 376X speedup as compared to Intel Xeon E5440 • 552X energy efficiency and 2,654X speedup as compared to ARM Cortex A53

  25. In-Memory Computing Accelerator Classification Clustering Hyperdimensional Classification Supporting both Training and Testing Kmeans Adaboost Hyperdimensional Clustering DNN, CNN Decision Tree kNN Database Graph Processing Query Processing Graph Processing

  26. Conclusion • We propose ORCHARD which: • Accelerates two well-known feature extractors fully in memory • Accelerating decision tree as a base learner of Adaboost using CAM and crossbar memory • Supports approximate in-memory computing • Our evaluation: • Tested on four practical image recognition tasks • ORCHARD achieves energy efficiency improvement up to 1896x and 376x speedup • The accuracy loss due to the approximation is minimal: only 0.3%

  27. Energy/Execution Breakdown • Domination of feature extractor: requires many memory operations • Haar-like feature extractor: • Consumes 93% power to write the integral image • 63% latency of write operations (write latency optimized) • DT-MEM: only 4% of the energy and parallelizable for different weak kernels

  28. Area Overhead • The crossbar memory of for two feature extractors take 63.7% of the total area • DT-MEM blocks take 31.9% of the total area (86% CAM, 7.5% latch) • Tree-based adder takes 3.5% of the total area

More Related