1 / 31

Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk †

Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules. Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk † †Dept. of Computing, Imperial College London, UK

oliveb
Télécharger la présentation

Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk †

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk† †Dept. of Computing, Imperial College London, UK ∗School of Microelectronics, Fudan University, China

  2. Outline • motivation • EURECAarchitecture • Architectureoptimisation • results • summary

  3. Summary of Contributions 1. Reconfiguration strategies for data access patterns • Three categories for dynamic data accesses • Reconfiguration modes: maximise configuration sharing 2. Architecture design space exploration • Multiplexers or permutation network • Optimal EURECA module size 3. Prototype chip layout • Manually developed with SMIC 130-nm technology • Results measured using Cadence tools

  4. Static vs dynamic data access • conditional arithmetic operators • dynamic data access patterns for (i=0; i<n; i+=N) #parallel unroll N for(j=0; j<N; j++){ k=i*N+j; d[k]=a[k+1] * c[k]; } for (i=0; i<n; i+=N) #parallel unrollN for(j=0; j<N; j++){ k=i*N+j; d[k]=a[b[k+1]] * c[k]; } static:easy dynamic:hard

  5. Example: dynamic data access d for (i=0; i<n; i++){ d[k]=a[b[k+1]] * c[k]; } C connectionsfor onedata-path:easy

  6. Example: dynamic data access for (i=0; i<n; i+=N) #parallel unroll N=32 for(j=0; j<32; j++){ k=i*32+j; d[k]=a[b[k+1]] * c[k]; } connectionsfor 32data-paths:hard

  7. Implementation 1: multiplexors • congested routing • 1024-to-1024 bitconnections • unroutable in XC6V-SX475T • expensiveusermultiplexers d 32outputports,each32-bitwide

  8. EURECA module: execution flow Initialconfiguration CDN: Configuration Distribution Network Initialconfiguration

  9. EURECA module: execution flow CG: Configuration Generator Data-paths CG InitialConnections

  10. Run-time reconfiguration flow CG Runtimeconfiguration InitialConnections

  11. Run-time reconfiguration flow CG Initialconfiguration

  12. Run-time reconfiguration flow data-pathdata CG Initialconfiguration memorydata

  13. 1. Optimising Data Access Patterns • Static(a): accesses with fixed strides • Dynamic size(b): linear accesses with variable vector size • Dynamic offset(c): vector access with dynamic offsets • Random(d): each access with a dynamic offset

  14. Dynamic FIFOs • Dynamic sizeDynamicFIFOs:BRAMsorganisedasFIFOs • AsingleconfigurationissharedbyallruntimereconfigurableconnectionsinaEURECAmodule • EnablesignalstoFIFOsaredynamicallyconnectedtocorrespondingFIFOports

  15. Dynamic cache • Dynamic offsetDynamiccache:BRAMsassharedcache • Singleconfigurationsharedbyallruntimereconfigurableconnections • InputaddresstoBRAMsaredynamicallyconnected

  16. Dynamic shared cache • RandomDynamicsharedcache:accessconflicts • Accessconflictshappenwhen2ormoredata-pathstrytoaccessthesameportatthesametime:thelowertwodata-pathsinthefigure • Eachreconfigurableconnectionhasaseparateconfiguration • Conflictedports(thelowestdata-path)aredisabledbyscheduler

  17. 2. Optimising Connection Network • Multiplexersorpermutationnetwork? • Threeparameterstoevaluate: • Numberofpinstosupportruntimereconfiguration • Siliconareatoimplementthenetwork • Logicrequiredtogenerateruntimeconfigurationsforthenetwork

  18. Connection Network • Multiplexersorpermutationnetwork? • Multiplexersareselectedasconfigurationscanbeshared • Toreconfigureapermutationnetworkwithinasinglecycletakesanunacceptableamountofpinsandconfigurationgenerationlogic

  19. Architecture Efficiency • Architectureefficiencyisevaluatedwiththeareasavedbyruntimereconfiguration,multipliedbyoverallareaoverhead Original chip area Original application resource usage Runtime reconfigurable application resource usage EURECA chip area

  20. EURECAOptimisation • Connectionnetwork:multiplexersorpermutationnetwork • Memorygroupsize:thenumberofBRAMblocksconnectedtoasingleEURECAmodule • Circuit models for application area reduction and module area • Small memory group size: small reductions in design area • Large memory group size: large module area overhead • Efficiencyreachesmaximumat32BRAMs, supported by circuit models

  21. 3. EURECAPrototype • PrototypelayoutdevelopedwithSMIC130-nmtechnology • Classicisland-styleFPGAarchitectureadapted • SmallFPGAsizeduetotape-outbudget • AsingleEURECAmoduleaddedtoonecolumnofBRAMs(8) • PrototypelayoutdevelopedwithSMIC130-nmtechnology • AEURECAmoduleisthesameareaas2.72CLBcolumns • AEURECAmodulebrings1.15nsdelay(typicalcircuitcriticalpathdelayonthisarchitecture20-40ns)

  22. EURECAPrototype

  23. Application 1: Large-scale Sorting • Large-scalesorting:dynamicFIFOs • Ateachcycle,thereareunknownamountofvaluescommittedfromeachFIFO • Thestartingaddresschangesfromcycletocycle

  24. Application 2: Memcached • Memcached:dynamiccache • Avectordatapointedbyadynamicpointerisreadeachcycle • Theaddresschangesfromcycletocycle

  25. Application 3: SpMV • SpMV:dynamicsharedcache • Accessedlocationsinthevectordependsonthepostionofnon-zerosinthesparsematrix • Eachvectoraccessisrandom • Aschedulermoduletoonlyenableoneruntimereconfigurationforamemoryport

  26. Experimental Setup • Targetchip • PrototypeEURECAchip • Area,delaymeasuredfromCadencesimulationresults • synthesisenvironment • Design Compiler (DC) for circuit synthesis • ABC for mapping • A graph matching algorithm for packing • Simulated annealing algorithm for placement • Path-finder for routing

  27. ApplicationPerformance • Threeimplementationsdevelopedforeachapplication • StaticdesignuseLUTstosupportalldynamicdataaccesses • EURECAsupportsdynamicaccesseswithoriginalarchitecture • Dynamicusestheoptimisedarchitecture,supportmaximumconfigurationsharing

  28. ApplicationPerformance • Threeimplementationsdevelopedforeachapplication • Upto11.2timesarea-delayreductionscomparedwithstaticdesign • Upto1.39timesarea-delayreductioncomparedwithoriginalEURECAarchitecture

  29. ApplicationPerformance • Area-delayreductionsgrowslinearlyasthenumberofBRAMsconnectedtoaEURECAmodulegrows • Areareductionincreasesto16for32BRAMs

  30. Current and future work • EURECAfull-stackcompiler • EURECAprogrammingmodels • operationmapping • dataaccessmapping • many-regioncommunication • EURECAsimulatorforarchitectureoptimisation • AutomaticallyexploredesignspacewithgeneratedarchitecturefilesandVerilogmodules

  31. Summary: EURECA • Routing challenge: dynamic data access applications • Threedataaccesspatterns • Optimisedreconfigurationstrategies • Designspaceexplorationandprototypelayout • Experimentalresults: prototypelayout+synthesistool • smallareaanddelayoverhead • Applicationperformance • upto1/11.2xarea-delayproductcomparedwithstaticarchitecture • upto1/1.39xarea-delayproductcomparedwithinitialarchitecture

More Related