Performance Modeling for Fast IP Lookups

Performance Modeling for Fast IP Lookups Girija Narlikar Joint work with Francis Zane Bell Laboratories, Lucent Technologies Appeared in Proc. SIGMETRICS ’01

What is IP Lookup? • Input: Table of IP address prefixes (networks), stream of packets • Output: Longest matching prefix for each packet • Applications: routing, accounting, clustering router action dest addr 11  1100  010  0110  10  a1 lookup(10011001) = 10  lookup(11001011) = 1100  lookup(11011010) = 11  a2 a5 a3 a2 a4 a1 a5

Hardware Vs Software • Core routers: ASICs perform IP lookup • Worst case performance • Edge routers: Software IP lookup • Eg, PCs, network processors (eg, IXP, Cport, Xstream, …) • Average case performance • Memory hierarchy matters Memory L1 cache CPU L2 cache 2 cycles 10 cycles 100 cycles

Memory gets bigger and slower Cache performance must be considered

Goal • Optimize IP lookup data structures based on characteristics of route table, input traffic and hardware platform (memory hierarchy and processor) • Optimal hardware design of lookup engine for characteristic traffic and tables • Approach • Build accurate performance model to predict performance of data structures

Results • Optimizing data structures for input traffic and hardware yields higher performance • Impact of hardware improvements can be predicted

Simple lookup solution: binary tree 010  0110  10  11  1100  root 0 1 0 1 1 10  11  0 0 ~100K entries 010  0 0 0110  1100  32-bit addresses too many memory accesses

Optimization to binary tree Multi-level trie with larger strides stride = 2 00 01 10 11 - C A B C D E 010  0110  10  11  1100  stride = 2 - A A B D E D D 11 00 01 10 00 01 10 11 Trade-off between # accesses and space

Large Strides = Good Performance? 1-level trie A B C D E 010  0110  10  11  1100  stride = 4 - - - - - B A A C C C C E D D D More space more replication more cache misses poor performance?

Non-uniform distribution of prefixes

Non-uniform accesses to prefixes

Optimizing for cache performance too many cache misses too many memory accesses optimal

Performance Model Inputs • Hardware parameters • Distribution of packets to prefixes in route table • Lookup data structure Output • Average lookup time

Space of data structures Multi-level tries with splay trees at trie leafs trie splay trees (binary, self-adjusting) Goal: find the appropriate number of levels and stride values

Performance model Average lookup time = M1 x tL1 + M2 x tL2 + H x ttrie + T x ttree M1= # L1 cache misses tL1 = L1 miss latency M2 = # L2 cache misses tL2 = L2 miss latency H = # trie nodes visited ttrie= time to visit a trie node T = # tree nodes visited ttree = time to visit a tree node L1 cache Memory CPU L2 cache tL1 tL2

Predicting cache misses Access probabilities for N memory blocks p1 p2 pN C cache blocks 1/C 1/C P (miss for mem blk i ) = pi x (1/C-pi ) = pi x (1-C pi ) direct-mapped pi x (1-pi )C fully associative pi x (1-Cpi /n)n n-way associative 1/C

Obtaining access counts Input: number of hits to each prefix • Trie nodes: • count(v) = count(u) • uchild(v) x x x x x x x x x x • Splay tree nodes: assume E(S) accesses per lookup in splay tree S • 3E(S) in theory; T = weighted average of E(S) • Accumulate access counts in 32n global counters to search for an n-level trie

Hardware platforms Processor L1 cache L1 miss L2 cache L2 miss 400 MHz 16KB on- 38 ns 512KB off- 100 ns Pentium-II chip, 4-waychip, 4-way 700 MHz 16KB on- 10 ns 256KB on- 100 ns Pentium-III chip, 4-way chip, 8-way tL1 tL2 ttree and ttrie obtained from vtune and confirmed via data fitting

Packet Traces Distribution of packets to prefixes in 52K Mae-East BGP table Synthetic traces: Rand-Net, Rand-IP Real traces: ISP, SDC Rand-IP ISP Rand-Net SDC Rand-IP ISP

Model Validation Using measured (not predicted) M1, M2 , T, H Avge lookup time = M1 x tL1 + M2 x tL2 + T x ttree + H x ttrie 1-level trie Measured Rand-Net Model ISP Rand-IP SDC

Model Validation (cont’d) Measured Model 1-level trie Rand-Net ISP Rand-IP SDC

Model Validation (cont’d) Measured Model 2-level trie Rand-Net ISP SDC

“Best” lookup data structures Trace Pentium II Pentium III Meas. Model Struct. Meas. Model Struct. Rand-Net T3(16,24,28) 242 235 T2(16,24) 168 164 ISP 197 202 T2(21,24) 131 149 T3(21,24,27) Rand-IP 140 142 T2(16,24) 89 108 T3(16,24,28) SDC 89 104 T1(21) 50 62 T1 (21)

Using suboptimal structures % loss in performance (Pentium II) Input trace Trace with optimal structure Rand-Net ISP Rand-IP SDC 15.7 0.0 58.3 Rand-Net ISP 29.9 29.9 11.3 Rand-IP 0.0 33.1 35.0 SDC 31.5 20.4 31.5

Impact of hardware improvementsexample:L2 cache size Pentium III

Processor and L2 speeds ISP trace, Pentium II architecture model

Conclusions • Possible to predict (within ~10% accuracy) average-case performance for IP lookup: the memory hierarchy cannot be ignored • “Best” data structure depends on input trace and lookup hardware • Performance model could be used to design future lookup architectures • can search space of hardware configurations under cost constraints

Total data structure space Pentium III

L1 cache size Pentium III

Performance Modeling for Fast IP Lookups