480 likes | 575 Vues
Learn techniques, like memory absolute & relative, instruction selection, task partitioning to reduce overhead & latency in programming. Discover hardware instructions & coding skills to optimize code performance.
E N D
IXP Training Part 3Programming Tips 2011.04.12
Outline • Memory Absolute • Instruction Selection • Task Partition • Memory Relative • Reducing Overhead • Reduce the number of memory accesses • Reduce average access latency • Hiding Overhead NCKU CSIE CIAL Lab
Memory Absolute Tips • Instruction Selection • General Coding Skill • Use Hardware Instruction • Task Partition • Multi-Processing • Context-Pipelining NCKU CSIE CIAL Lab
Coding Skill • Loop Unrolling • Shift Operation • Inline Function • __inline & __forceinline • Branch Prediction • Branch Prediction Penalty NCKU CSIE CIAL Lab
Hardware Instruction • POP_COUNT • FFS • Multiply • CRC • Hashing • CAM NCKU CSIE CIAL Lab
POP_COUNT--Brief • Population Count • Report number of bit set in a 32-bit register • Example: • pop_count( 0x3121 ) = ? • 0011 0001 0010 0001 • Result = 5 NCKU CSIE CIAL Lab
POP_COUNT--Naïve Implementation unsigned int pop_count_for (unsigned int x) { unsigned int y=0; unsigned int i; for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y; } NCKU CSIE CIAL Lab
POP_COUNT--Faster Implementation unsigned int pop_count_agg(unsigned int x) { x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f);} } Reference http://aggregate.org/MAGIC/ NCKU CSIE CIAL Lab
POP_COUNT--Hardware Instruction unsigned int pop_count_hardware(unsigned int x) { return pop_count (x); } NCKU CSIE CIAL Lab
POP_COUNT--Additional Information • Bitmap-RFC (Liu, TECS 2008) NCKU CSIE CIAL Lab
FFS • Find the first bit set in data and return its position • Example: • ffs ( 0x3121 ) = 0 • 0011 0001 0010 0001 • ffs ( 0x3120 ) = 5 • 0011 0001 0010 0000 • ffs ( 0x3100 ) = 8 • 0011 0001 0000 0000 NCKU CSIE CIAL Lab
Multiply • Specific Multiply Instruction • Multiply_24x8() • Multiply_16x16() • Multiply_32x32_hi() • Multiply_32x32_lo() NCKU CSIE CIAL Lab
CRC • Example of CRC operation crc_write( 0x42424242); crc_32_be( source_address, bytes_0_3 ); crc_32_be( dest_address, bytes_0_3 ); … Cache_index = crc_read(); NCKU CSIE CIAL Lab
Hash • Hash_48() • Hash_64() • Hash_128() • Example: SIGNAL sig_hash; Hash48(data_out, data_in, count, sig_done, &sig_hash); __wait_for_all(&sig_hash); NCKU CSIE CIAL Lab
CAM--Brief • Each ME has 16 32-bit CAM entries • The CAM is private to other MEs • With lookup operation, each entries is searching in parallel • With a success lookup, the index of matched entries will be returned • Else, the index of entries to be replaced will be returned NCKU CSIE CIAL Lab
Content Addressable Memory--Structure • cam_lookup_t NCKU CSIE CIAL Lab
CAM--Usage cam_lookup_t cam_result; cam_result = cam_lookup( data ); if( cam_result.hit == 1 ) { Access Entry cam_result.entry_num; … } else { …… cam_write( cam_result.entry_num, data, 15 ); } NCKU CSIE CIAL Lab
Task Partition • Multi-Processing • More Computing Power • Easy to implement • Context-Pipelining • More Useable Resource • Hard to balance NCKU CSIE CIAL Lab
Memory Relative Tips--Reducing Overhead • Reduce the number of memory accesses • Wide-word Accesses • Result Caches • Reduce average access latency • Multi-level Memory Hierarchy • Data Cache NCKU CSIE CIAL Lab
Wide-Word Accesses--Brief • Batch Access the needed data • Reduce the necessary accesses • Useful when the data are linked-list like structure NCKU CSIE CIAL Lab
Wild-Word Access--Example NCKU CSIE CIAL Lab
Wide-Word Accesses--Usage (One Node per Access) __declspec(sram_read_reg) UINT32 A; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A ...... ---------------------------------------------- Result: 8 Accesses are needed NCKU CSIE CIAL Lab
Wide-Word Accesses--Usage (Two Node per Access) __declspec(sram_read_reg) UINT32 A[2]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A ...... ---------------------------------------------- Result: 4 Accesses are needed NCKU CSIE CIAL Lab
Wide-Word Accesses--Usage (Four Node per Access) __declspec(sram_read_reg) UINT32 A[4]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A ...... ---------------------------------------------- Result: 2 Accesses are needed NCKU CSIE CIAL Lab
Wide-Word Accesses--Experiment • Platform: IXP2800 • Total Accesses: 8 LW (8*4 Byte) NCKU CSIE CIAL Lab
Wide-Word Accesses--Limitation • Data must be contiguous • Suitable for linear search • Not support random accesses • Number of Transfer Registers are fixed • Each thread has 16 read / write registers • The Tx-Regs may be reserved by others NCKU CSIE CIAL Lab
Resulting Cache--Brief • Caching the result of application • If same fields appear again, the cached result is return • Memory accesses are reduced when cache hit. • Depends on time locality of the traffic NCKU CSIE CIAL Lab
Result Cache--IXP2400 • No hardware cache is supported in IXP2400 ME • Not easy to implement set-associative cache • Replacement policy will also be an overhead NCKU CSIE CIAL Lab
Result Cache--Design Consideration • Shared or Private Cache ? • Size of Cache ? • Works with specific Hardware ? • Miss penalty handling ? NCKU CSIE CIAL Lab
Result Cache--Experiment NCKU CSIE CIAL Lab
Multi-Level Memory Hierarchy--Brief • Reduce the average access latency • Number of accesses remained unchanged • If data can fit in faster memory, then do it NCKU CSIE CIAL Lab
Multi-Level Memory Hierarchy--Data Placement • Size smaller while read-only • Hard Code • Size smaller while need updating • Local Memory • Size larger • Scratchpad • Size largest • SRAM NCKU CSIE CIAL Lab
Multi-Level Memory Hierarchy--Packet Data Type • Packet related data • Temporary Data • Valid with specific packet • Local Memory • Flow related data • Related to specific flow • Spatial Locality • Wide-Word Access • Application related data • Valid with specific application • Temporal Locality • Result Cache NCKU CSIE CIAL Lab
Split-Cache (Z. Liu, IET-COM 2007) • Two separate hardware for application data and flow data NCKU CSIE CIAL Lab
Data Cache--Brief • Hardware Cache Mechanism that cached the data for packet processing • App-Cache • Flow-Cache • However, not supported by IXP2400 NCKU CSIE CIAL Lab
Data Cache--CAM + Local Memory • CAM works with Local Memory acts like hardware cache • However, number of CAM entries is less • Each CAM entry may co-worked with several Local Memory Cache entry NCKU CSIE CIAL Lab
Memory Relative Tips--Hiding Overhead • Not really reduce the overhead, but overlapped it • Hardware Multi-Threading • Asynchronous Memory NCKU CSIE CIAL Lab
Hardware Multi-Threading • Swap out itself and let another thread to execute while access memory • Each thread kept its own set of registers, thus no stack are needed for thread swapping • Round Robin Scheduling • No thread preemptive NCKU CSIE CIAL Lab
Asynchronous Memory--Brief • Thread will not be blocked when issue a memory request • Thus, thread can issues multiple memory requests at a time NCKU CSIE CIAL Lab
Asynchronous Memory--Example (1 Issue) Read X __wait_for_all ( &sig_x ) Read Y __wait_for_all ( &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab
Asynchronous Memory--Example (2 Issue) Read X Read Y __wait_for_all ( &sig_x, &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab
Wild-Word Access +Multiple Issues NCKU CSIE CIAL Lab
Wild-Word Access +Multiple Issues (1LW, 2 Issue) NCKU CSIE CIAL Lab
Wild-Word Access +Multiple Issues (2LW, 2 Issue) NCKU CSIE CIAL Lab
Wild-Word Access +Multiple Issues (4LW, 2 Issue) NCKU CSIE CIAL Lab
Wild-Word Access +Multiple Issues (Experiment) NCKU CSIE CIAL Lab
Reference (1) • Jayaram Mudigonda, Harrick M. Vin, Raj Yavatkar, “Overcoming the memory wall in packet processing: hammers or ladders?”, ANCS 2005 • Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor”, ACM TECS 2008. NCKU CSIE CIAL Lab
Reference (2) • Z. Liu, K. Zheng, B. Liu, “Hybrid cache architecture for high-speed packet processing”, IET-COM 2007 NCKU CSIE CIAL Lab