540 likes | 840 Vues
Longest Prefix Matching Trie-based Techniques. CS 685 Network Algorithmics Spring 2006. The Problem. Given: Database of prefixes with associated next hops, say: 1000101*  128.44.2.3 01101100*  4.33.2.1 10001*  124.33.55.12 10*  151.63.10.111 01*  4.33.2.1
 
                
                E N D
Longest Prefix MatchingTrie-based Techniques CS 685 Network Algorithmics Spring 2006 CS 685 Network Algorithmics
The Problem • Given: • Database of prefixes with associated next hops, say: 1000101* 128.44.2.3 01101100*  4.33.2.1 10001*  124.33.55.12 10*  151.63.10.111 01*  4.33.2.1 1000100101*  128.44.2.3 • Destination IP address, e.g. 120.16.8.211 • Find: the longest matching prefix and its next hop CS 685 Network Algorithmics
Constraints • Handle 150,000 prefixes in database • Complete lookup in minimum-sized (40-byte) packet transmission time • OC-768 (40 Gbps): 8 nsec • High degree of multiplexing—packets from 250,000 flows interleaved • Database updated every few milliseconds  performance  number of memory accesses CS 685 Network Algorithmics
Basic ("Unibit") Trie Approach • Recursive data structure (a tree) • Nodes represent prefixes in the database • Root corresponds to prefix of length zero • Node for prefix x has three fields: • 0 branch: pointer to node for prefix x0 (if present) • 1 branch: pointer to node for prefix x1 (if present) • Next hop info for x (if present) Example Database: a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x CS 685 Network Algorithmics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ax hz fz dw cz gu by ix ew a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x 0 1 CS 685 Network Algorithmics
Trie Search Algorithm typedef struct foo { struct foo *trie_0, *trie_1; NEXTHOPINFO trie_info; } *TRIENODE; NEXTHOPINFO best = NULL; TRIENODE np = root; unsigned int bit = 0x80000000; while (np != NULL) { if (np->trie_info) best = np->trie_info; // check next bit if (addr&bit) np = np->trie_1; else np = np->trie_0; bit >>= 1; } return best; CS 685 Network Algorithmics
Conserving Space Sparse database  wasted space • Long chains of trie nodes with only one non-NULL pointer • Solution: handle "one-way" branches with special nodes • encode the bits corresponding to the missing nodes using text strings CS 685 Network Algorithmics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ax hz fz dw cz gu by ix eu a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x 0 1 CS 685 Network Algorithmics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ax gu dw ix fz cz eu by hz a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x 0 1 00 CS 685 Network Algorithmics
Bigger Issue: Slow! • Computing one bit at a time is too slow • Worst-case: one memory access per bit (32 accesses!) • Solution: compute n bits at a time • n = stride length • Use n-bit chunks of addresses as index into array in each trie node • How to handle prefixes which are not a multiple of n in length? • Extend them, replicate entries as needed • E.g. n=3, 1* becomes 100, 101, 110, 111 CS 685 Network Algorithmics
Extending Prefixes Original Database a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  w f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x Expanded Database a0: 00*  x a1: 01*  x b0: 010000*  y b1: 010001*  y c0: 0110*  z c1: 0111* z d0: 10*  w d1: 11*  w e0: 1000*  u e1: 1001*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x Example: stride length=2 CS 685 Network Algorithmics
z u y 00 x 00 00 00 00 y u u 01 01 01 01 01 x z z 10 w 10 10 10 10 z x 11 w 11 11 11 11 Expanded Database a0: 00*  x a1: 01*  x b0: 010000*  y b1: 010001*  y c0: 0110*  z c1: 0111* z d0: 10*  w d1: 11*  w e0: 1000*  w e1: 1001*  w f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x Total cost: 40 pointers (22 null) Max #memory accesses: 3 CS 685 Network Algorithmics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x z x by u w z u z a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x 0 1 00 Total cost: 46 pointers (21 null) Max #memory accesses: 5 CS 685 Network Algorithmics
Choosing Fixed Stride Lengths • We are trading space for time: • Larger stride length  fewer memory accesses • Larger stride length  more wasted space • Use the largest stride length that will fit in memory and complete required accesses within the time budget CS 685 Network Algorithmics
Updating Insertion • Keep a unibit version of the trie, with each node labeled with longest matching prefix and its length • To insert P, search for P, remembering last node, until • Null pointer (not present), or • Reach the last stride in P • Expand P as needed to match stride length • Overwrite any existing entries with length less than P's Deletion is similar • Find entry for prefix to be deleted • Remove its entry (from unibit copy also!) • Expand any entries that were "covered" by the deleted prefix CS 685 Network Algorithmics
Variable Stride Lengths • It is not necessary that every node have the same stride length • Reduce waste by allowing stride length to vary per node • Actual stride length encoded in pointer to the trie node • Nodes with fewer used pointers can have smaller stride lengths CS 685 Network Algorithmics
z 00 x 00 00 y 0 u 01 01 01 x 1 z z 10 w 10 10 z x 11 w 11 11 Expanded Database a0: 00*  x a1: 01*  x b: 01000*  y c0: 0110*  z c1: 0111* z d0: 10*  w d1: 11*  w e: 100*  w f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x 1 bit 2 bits 2 bits 1 bit u 0 1 Total waste: 16 pointers Max #memory accesses: 3 Note: encoding stride length costs 2 bits/pointer CS 685 Network Algorithmics
Calculating Stride Lengths • How to pick stride lengths? • We have two variables to play with: height and stride length • Trie height determines lookup speed  set max height first • Call ith • Then choose strides to minimize storage • Define cost of trie T, C(T): • If T is a single node, then number of array locations in the node • Else number of array locations in root + i C(Ti), where Ti's are children of T • Straightforward recursive solution: • Root stride s results in y=2s subtries T1, ... Ty • For each possible s, recursively compute optimal strides for C(Ti)'s using height limit h-1 • Choose root stride s to minimize total cost = (2s + i C(Ti)) CS 685 Network Algorithmics
Calculating Stride Lengths • Problem: Expensive, repeated subproblems • Solution (Srinivasan & Varghese): Dynamic programming • Observe that each subtree of a variable-stride trie contains the set of prefixes as some subtree of the original unibit trie • For each node of the unibit trie, compute optimal stride and cost for that stride • Start at bottom (height = 1), work up • Determine optimal grouping of leaves in subtree • Given subtree optimal costs, compute parent optimal cost • This results in optimal stride length selections for the given set of prefixes CS 685 Network Algorithmics
Stride = 1 Cost = 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Stride = 2 Cost = 4 Stride = 0 Cost = 1 x x u u z w by z z Stride = 1 Cost = 2 Stride = 0 Cost = 1 Stride = 0 Cost = 1 0 1 00 CS 685 Network Algorithmics
Alternative Method: Level Compression • LC-trie (Nilsson & Karlsson '98) is a variable-stride trie with no empty entries in trie nodes • Procedure: • Select largest root stride that allows no empty entries • Do this recursively down through the tree • Disadvantage: cannot control height precisely CS 685 Network Algorithmics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x z w z u by z u x Stride = 1 Stride = 1 0 1 00 Stride = 1 Stride = 2 CS 685 Network Algorithmics
Performance Comparisons • MAE-East database (1997 snapshot) • ~ 40K prefixes • "Unoptimized" multibit trie: 2003 KB • Optimal fixed-stride: 737 KB, computed in 1 msec • Height limit = 4 ( 1 Gbps wire speed @ 80 nsec/access) • Optimized (S&V) variable-stride: 423 KB, computed in 1.6 sec, Height limit = 4 • LC-compressed • Height = 7 • 700 KB CS 685 Network Algorithmics
Lulea Compressed Tries • Goals: • Minimize number of memory accesses • Aggressively compress trie • Goal: so it can fit in SRAM (or even cache) • Three-level trie with strides of 16, 8, 8 • 8 mem accesses typical • Main Techniques • Leaf-pushing • Eliminate duplicate pointers from trie node arrays • Efficient bit-counting using precomputation for large bitmaps • Use of indices instead of full pointers for next-hop info CS 685 Network Algorithmics
1. Leaf-Pushing • In general, a trie node entry has associated • A pointer to a next trie node • A prefix (i.e. pointer to next-hop info) • Or both, or neither • Observation: we don't need to know about a prefix pointer along the way until we reach a leaf • So: "push" prefix pointers down to leaves • Keep only one set of pointers per node CS 685 Network Algorithmics
Leaf-Pushing: the Concept Prefixes CS 685 Network Algorithmics
u y z 00 x 00 00 00 00 u u y 01 01 01 01 01 x z z 10 w 10 10 10 10 z x 11 w 11 11 11 11 00 x 00 y 00 x 01 01 y 01 z 10 x 10 10 z 11 11 x 11 After Leaf-Pushing Cost: 20 pointers z u 00 00 u u 01 01 z w 10 10 x w 11 11 Expanded Database a0: 00*  x a1: 01*  x b0: 010000*  y b1: 010001*  y c0: 0110*  z c1: 0111* z d0: 10*  w d1: 11*  w e0: 1000*  u e1: 1001*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x Before Cost: 40 pointers (22 wasted) CS 685 Network Algorithmics
u 00 u 01 w 10 w 11 u w 2. Removing Duplicate Pointers • Leaf-pushing results in many consecutive duplicate pointers • Would like to remove redundancy and store only one copy in each node • Problem: now we can't directly index into array using address bits • Example: k=2, bits 01 = index 1 needs to be converted to index 0 somehow CS 685 Network Algorithmics
1 0 00 1 01 u 0 10 w 11 2. Removing Duplicate Pointers Solution: Add a bitmap: one bit per original entry • 1 indicates new value • 0 indicates duplicate of previous value • To convert index i, count 1's up to position i in the bitmap, and subtract 1 Example: old index 1  new index 0 old index 2  new index 1 u 00 u 01 w 10 w 11 CS 685 Network Algorithmics
Bitmap for Duplicate Elimination Prefixes 100000000000100010000100000000000000000010000000000100000000000010001000000100000011000000000000000000 CS 685 Network Algorithmics
3. Efficient Bit-Counting • Lulea first-level 16-bit stride  64K entries • Impractical to count bits up to, say, entry 34578 on the fly! • Solution: Precompute (P2a) • Divide bitmap into chunks (say, 64 bits each) • Store the number of 1 bits in each chunk in an array B • Compute # 1 bits up to bit k by: chunkNum = k >> 6; posInChunk = k & 0x3f; // k mod 64 numOnes = B[chunkNum] + count1sInChunk(chunkNum,posInChunk) – 1; CS 685 Network Algorithmics
Bit-Counting Precomputation Example index = 35 Chunk Size = 8 bits 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 3 3 6 7 9 Converted index = 7 + 2 – 1 = 8 Cost: 2 memory accesses (maybe less) CS 685 Network Algorithmics
4. Efficient Pointer Representation • Observation:the number of different next-hop pointers is limited • Each corresponds to an immediate neighbor of the router • Most routers have at most a few dozen neighbors • In some cases a router might have a few hundred distinct next hops, even a thousand • Apply P7: avoid unnecessary generality • Only a few bits (say 8-12) needed to distinguish between actual next-hop possibilities • Store indices into table of next-hops info • E.g., to support up to 1024 next hops: 10 bits • 40K prefixes  40K pointers  160KB @ 32 bits, vs 50KB @ 10 bits CS 685 Network Algorithmics
Other Lulea Tricks • First level of trie uses two levels of bit-counting array • First counts bits before the 64-bit chunk • Second counts bits in the 16-bit word within chunk • Second- and third-level trie nodes are laid out differently depending on number of pointers in them • Each node has 256 entries • Categorized by number of pointers • 1-8: "sparse" — store 8-bit indices + 8 16-bit pointers (24B) • 9-64: "dense" — like first level, but only one bit-counting array (only six bits of count needed) • 65-256: "very dense" — like first level, with two bit-counting arrays: 4 64-bit chunks, 16 16-bit words CS 685 Network Algorithmics
Lulea Performance Results 1997 MAE-East database • 32K entries, 58K leaves, 56 different next hops • Resulting Trie size: 160KB • Build time: 99 msec • Almost all lookups took < 100 clock cycles (333MHz Pentium) CS 685 Network Algorithmics
Trie Bitmap(Eatherton, Dittia & Varghese) • Goal: storage, speed comparable to Lulea plus fast insertion • Main culprit in slow insertion is leaf-pushing • So get rid of leaf-pushing • Go back to storing node and prefix pointers explicitly • Use the same compression bitmap trick on both lists • Store next-hop information separately, only retrieve at the end • Like leaf-pushing, only in the control plane! • Use smaller strides to limit memory accesses to one per trie node (Lulea requires at least two) CS 685 Network Algorithmics
Storing Prefixes Explicitly • To avoid expansion/leaf pushing, we have to store prefixes in the node explicitly • There are 2k+1 – 1 possible prefixes of length k • Store list of (unique) next hop pointers for each prefix covered by this node • Use same bitmap/bit counting technique as Lulea to find pointer index • Keep trie nodes small (stride 4 or less), exploit hardware (P5) to do prefix matching, bit counting CS 685 Network Algorithmics
Example: Root node, stride = 3 * 0 a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x 0* 1 000 0 x 1* 1 001 0 w 00* 0 010 1 z 01* 0 011 0 u 10* 0 100 0 11* 0 101 0 000* 0 110 1 001* 0 111 0 010* 0 011* 1 100* 1 101* 0 to child nodes 110* 0 111* 0 CS 685 Network Algorithmics
Tree Bitmap Results • Insertions are as in simple multibit tries • May cause complete revamp of trie node, but that requires only one memory allocation • Performance comparable to Lulea, but insertion much faster CS 685 Network Algorithmics
A Different Lookup Paradigm • Can we use binary search to do longest-prefix lookups? • Observe that each prefix corresponds to a range of addresses • E.g. 204.198.76.0/24 covers the range 204.198.76.0 – 204.198.76.255 • Each prefix has two range endpoints • N disjoint prefixes divide the entire space into 2N+1 disjoint segments • By sorting range endpoints, and comparing to address, can determine exact prefix match CS 685 Network Algorithmics
Prefixes as Ranges CS 685 Network Algorithmics
Binary Search on Ranges • Store 2N endpoints in sorted order • Including the full address range for * • Store two pointers for each entry • ">" entry: next-hop info for addresses strictly greater than that value • "=" entry: next-hop info for addresses equal to that value CS 685 Network Algorithmics
> = 000000 x x 010000 y y 010001 x y 011000 x z 011111 x x 100000 u u 100111 w u 110000 z z 110011 x z 110100 u u 110111 x u 111000 z z 111011 x z 111100 x x 111111 - x Example: 6-bit addresses Example Database: a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x a: 000000-011111  x b: 010000-010001  y c: 011000-011111  z d: 100000-111111  w e: 100000-100111 u f: 110000-110011  z g: 110100-110111  u h: 111000-111011  z i: 111100-111111  x CS 685 Network Algorithmics
Range Binary Search Results • N prefixes can be searched in log2 N + 1 steps • Slow compared to multibit tries • Insertion can also be expensive • Memory-expensive • Requires 2 full-size entries per prefix • 40K prefixes, 32-bit addresses: 320KB, not counting next-hop info • Advantage: no patent restrictions! CS 685 Network Algorithmics
Binary Search on Prefix LengthsWaldvogel, et al • For same-length prefixes, a hash table gives fast comparisons • But linear search on prefix lengths is too expensive • Can we do a faster (binary) search on prefix lengths? • Challenge: how do we know whether to move "up" or "down" in length on failure? • Solution: include extra information to indicate presence of a longer prefix that might match • These are called marker entries • Each marker entry also contains best-matching prefix for that node • When searching, remember best-matching prefix when going "up" because of a marker, in case of later failure CS 685 Network Algorithmics
Example: Binary Search on Prefix Length Prefix Lengths: 1, 3, 4, 7 Example Database: a: 0*  x b: 01000*  y c: 011*  z d: 1*  w e: 100*  u f: 1100*  z g: 1101*  u h: 1110*  z i: 1111*  x length 1 0* 1* BMP a,x d,w length 3 011* 100* 110M 111M 010M BMP c,z e,u d,w d,w a,x length 4 1100* 1101* 1110* 1111* 0100M BMP f,z g,u h,z i,x a,x length 5 01000* BMP b,y Example: Search for address 011000 and 101000 CS 685 Network Algorithmics
Binary Search on Prefix Length Performance • Worst-case number of hash-table accesses: 5 • However, most prefixes are 16 or 24 bits • Arrange hash tables so these are handled in one or two accesses • This technique is very scalable for larger address lengths (e.g. 128 bits for IPv6) • Unibit Trie for IPv6: 128 accesses! CS 685 Network Algorithmics
Memory Allocation for Compressed Schemes • Problem: when using a compressed scheme (like Lulea), trie nodes are kept at minimal size • If a node grows (changes size), it must be reallocated and copied over • As we have discussed, memory allocators can perform very badly • Assume M is the size of the largest possible request • Cannot guarantee more than 1/log2 M of memory will be used! • E.g. if M=32, 20% is max guaranteed utilization • Router vendors cannot claim to support large databases CS 685 Network Algorithmics
Memory Allocation for Compressed Schemes • Solution: Compaction • Copy memory from one location to another • General-purpose OS's avoid compaction! • Reason: very hard to find and update all pointers to objects in the moved region • The good news: • Pointer usage is very constrained in IP lookup algorithms • Most lookup structures are trees  at most one pointer to any node • By storing a "parent" pointer, can easily update pointers as needed CS 685 Network Algorithmics