1 / 60

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms. Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley. Focus on 3 Network Features. In this proposal, we focus on 3 network features Packet payload inspection

tamyra
Télécharger la présentation

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley

  2. Focus on 3 Network Features • In this proposal, we focus on 3 network features • Packet payload inspection • Network security • Packet header processing • Packet forwarding, classification, etc • Packet buffering and queuing • QoS

  3. Overview of the Presentation • Packet payload inspection • Previous work • D2FA and CD2FA • New ideas to implement regular expressions • Initial results • IP Lookup • Tries and pipelined tries • Previous work: CAMP • New direction: HEXA • Hashing used for packet header processing • Why do we need better hashing? • Previous work: Segmented Hash • New direction: Peacock Hashing • Packet buffering and queuing • Previous work: multichannel packet buffer, aggregated buffer • New direction: DRAM based buffer, NP based queuing assist

  4. a 2 c a d a a b a b c c 3 5 1 b b c d b d d 4 c d Delayed Input DFA (D2FA), SIGCOMM’06 • Many transitions in a DFA • 256 transitions per state • 50+ distinct transitions per state (real world datasets) • Need 50+ words per state • Can we reduce the number of transitions in a DFA Three rules a+, b+c, c*d+ Look at state pairs: there are many common transitions. How to remove them? 4 transitions per state

  5. a a 2 2 c c a a d d a a a a b b a a b b c c c c 3 3 5 5 1 1 b b b b c c d d b b d d d d 4 4 c c d d Delayed Input DFA (D2FA), SIGCOMM’06 • Many transitions in a DFA • 256 transitions per state • 50+ distinct transitions per state (real world datasets) • Need 50+ words per state • Can we reduce the number of transitions in a DFA Alternative Representation Three rules a+, b+c, c*d+ 4 transitions per state Fewer transitions, less memory

  6. a 2 2 c a d a a b a a b c c c c 3 3 5 5 1 1 b b b c d b d d d 4 4 c d D2FA Operation Heavy edges are called default transitions Take default transitions, whenever, a labeled transition is missing DFA D2FA

  7. D2FA versus DFA • D2FAs are compact but requires multiple memory accesses • Up to 20x increased memory accesses • Not desirable in off-chip architecture • Can D2FAs match the performance of DFAs • YES!!!! • Content Addressed D2FAs (CD2FA) • CD2FAs require only one memory access per byte • Matches the performance of a DFA in cacheless system • Systems with data cache, CD2FA are 2-3x faster • CD2FAs are 10x compact than DFAs

  8. R R all U c cd,R d V a ab,cd,R b Introduction to CD2FA, ANCS’06 • How to avoid multiple memory accesses of D2FAs? • Avoid lookup to decide if default path needs to be taken • Avoid default path traversal • Solution: Assign labels to each state, labels contain: • Characters for which it has labeled transitions • Information about all of its default states • Characters for which its default states have labeled transitions find node Rat location R Content Labels find node U athash(c,d,R) find node V athash(a,b,hash(c,d,R))

  9. Introduction to CD2FA R R all all Z U c l lm,Z cd,R Y d m pq,lm,Z V a P ab,cd,R X b q hash(p,q,hash(l,m,Z)) hash(c,d,R) a d Input char = hash(a,b,hash(c,d,R)) Current state: V (label = ab,cd,R) → X (label = pq,lm,Z)

  10. Construction of CD2FA • We seek to keep the content labels small • Twin Objectives: • Ensure that states have few labeled transitions • Ensure that default paths are as small as possible • Proposed new heuristic called CRO to construct CD2FA • Details in ANCS’06 paper • Default path bound = 2 edges => CRO algorithm constructs upto 10x space efficient CD2FAs

  11. Memory Mapping in CD2FA R Z R all all U Y c l lm,R cd,R d m pq,lm,R V X a P ab,cd,R b q WE HAVE ASSUMED THAT HASHING IS COLLISION FREE hash(p,q,hash(l,m,Z)) hash(c,d,R)) hash(a,b,hash(c,d,R)) COLLISION

  12. Collision-free Memory Mapping a Four states hash(abc, …) b a b c , …. c 4 memory locations p hash(pqr, …) q p q r , …. r l hash(def, …) hash(mln, …) Add edges for all Possible choices n , …. l m m n hash(lmn, …) d hash(edf, …) d e f , …. e f

  13. Bipartite Graph Matching • Bipartite Graph • Left nodes are state content labels • Right nodes are memory locations • An edge for every choice of content label • Map state labels to unique memory locations • Perfect matching problem • With n left and right nodes • Need O(logn) random edges • n = 1M implies, we need ~20 edges per node • If we provide slight memory over-provisioning • We can uniquely map state labels with much fewer edges • In our experiments, we found perfect matching without memory over-provisioning

  14. Reg-ex – New Directions • Three Key problems with traditional DFA based reg-ex matching • 1. Employ complete signature to parse input data • Even if normal data matches only a small prefix portion • Full signature => large DFA • 2. Only one active state of execution and no memory about the previous matches • Combinations of partial matches requires new DFA states • 3. Inability to count certain sub-expressions • E.g. a{1024} will require 1024 DFA states • We aim at addressing each of these problems in the proposed research

  15. Addressing the First Problem • Divide the processing into fast and slow path • Split the signature into prefix and suffix • employ signature prefixes in fast path • Upon a match in fast path, trigger the slow path • Appropriate splitting can maintain low triggering rate • Benefits: • Fast path can employ a composite DFA for all prefixes • Due to small prefixes composite DFA will remain small • Higher parsing rate • Slow path will use separate DFA for each signature • No state explosion in slow path • Due to low triggering rate, slow path will not become a bottleneck • Reduces per-flow state • Fast path uses composite DFA, one active state per flow

  16. Fast and Slow Path Processing • Here we assume that ε fraction of the flows are diverted to the slow path • Fast path stores a per flow DFA state • Slow path may store multiple active states

  17. Splitting Reg-exes • Splitting can be performed based upon data traces • Assign probability to NFA states and make a cut so that slow path cumulative probability is low r1 = .*[gh]d[^g]*ge r2 = .*fag[^i]*i[^j]*j r3 = .*a[gh]i[^l]*[ae]c Cumulative probability of slow path = 0.05

  18. Splitting Reg-exes Fast path will contain a composite DFA (14 states) p1 = .*[gh]d[^g]*g p2 = .*fa p3 = .*a[gh]i r1 = .*[gh]d[^g]*ge r2 = .*fag[^i]*i[^j]*j r3 = .*a[gh]i[^l]*[ae]c Notice the start state Slow path will comprise of three separate DFAs, one for each signature

  19. Protection against DoS Attacks • An attacker can attack such system by sending data that match the prefixes more often than provisioned • Slow path will become the bottleneck • Solution: Look at the history and determine if a flow is an attack flow or not • Compute anomaly index: weighted moving average of the number of times a flow has triggered the slow path • If a flow has high anomaly index, send it to a low rate queue

  20. Initial Simulation Results

  21. Addressing the Second Problem • NFA: compact but O(n) active states • DFA: 1 active state but state explosion • How to avoid state explosion while also keeping the per-flow active state information small • Propose a novel machine called History based Finite Automaton or H-FA • Augment a DFA with a history buffer • Transitions are taken looking at the history buffer contents • During certain transitions, items are inserted/removed from the history buffer • Claim: a small history buffer is sufficient to avoid state explosion and also keep a single active state

  22. Example of H-FA Construction NFA state 2 is present in 4 DFA states. If remove the NFA state 2 from these DFA states, then we will have just 6 states DFA

  23. NFA state 2 is present in 4 DFA states. If remove the NFA state 2 from these DFA states, then we will have just 6 states DFA H-FA This new machine uses a history flag in addition to its transitions to make moves

  24. because flag is reset because flag is set ( ) ( ) ( ) ( ) ) ( ) ¾ ¾ ® ¾ ¾ ® ¾ ¾ ® ¾ ¾ ® ¾ ¾ ® c d a b c 0 0 , 4 0 , 1 0 0 , 3 ­ ­ reset flag set flag H-FA Input data = c d a b c ( ) 0 This new machine uses a history flag in addition to its transitions to make moves

  25. H-FA • In general, if we maintain a flag for each NFA state that represents a Kleene closure, we can avoid any state explosion • k closures will require at most k-bits in history buffer • There are some challenges associated with the efficient implementation of conditional transitions • We plan to work on these in the proposed research

  26. Addressing the Third Problem ab[^a]{1024}c def Replace flag by a counter Replace flag=1 condition with ctr=1024 Replace flag=0 condition with ctr=0 Increment ctr if ctr>0; reset when ctr reaches 1024 One of the primary goals of research to enable efficient implementation of counter conditions

  27. Early Results

  28. Overview of the Presentation • Packet payload inspection • Previous work • D2FA and CD2FA • New ideas to implement regular expressions • Initial results • IP Lookup • Tries and pipelined tries • Previous work: CAMP • New direction: HEXA • Hashing used for packet header processing • Why do we need better hashing? • Previous work: Segmented Hash • New direction: Peacock Hashing • Packet buffering and queuing • Previous work: multichannel packet buffer, aggregated buffer • New direction: DRAM based buffer, NP based queuing assist

  29. IP Address Lookup routing table nexthop prefix 0* 7 1* 5 00* 3 01* 5 001* 2 • Routing tables at router input ports contain (prefix, next hop) pairs • Address in the packet is compared to the stored prefixes, starting at left. • Prefix that matches largest number of address bits is desired match. • Packet is forwarded to the specified next hop. 011* 3 1011* 4 address: 0110 0100 1000

  30. Address Lookup Using Tries address: 0110 0100 1000 1 0 0 0 1 1 1 1 3 • Prefixes stored in “alphabetical order” in tree. • Prefixes “spelled” out by following path from top. • green dots mark prefix ends • To find best prefix, spell out address in tree. • Last green dot marks longest matching prefix. 1 0* 7 1* 5 00* 3 01* 5 001* 2 011* 3 1011* 4

  31. Pipelined Trie-based IP-lookup Stages of different size: - Requires more memory - Largest stage becomes the bottleneck Tree data-structure, prefixes in leaves (leaf pushing) Process IP address level-by-level to find the longest match 1 0 1 0 Each level in different stage → overlap multiple packets P4 = 10010* 0 1 1 0 P6 P7 P1 P2 P4 P5 P3

  32. Circular Pipeline, ANCS’06 • Use circular pipeline and allow requests to enter/exit at any stage • Mapping: • Divide the trie into multiple sub-tries • Map each sub-trie with its root starting at different stage

  33. Mapping in Circular Pipeline

  34. Circular Pipeline • Benefits: • Uniform stage sizes • Less memory – no over-provisioning is needed in face of arbitrary trie shape • Higher throughput

  35. New Direction: HEXA • HEXA (History-based Encoding, eXecution and Addressing) • Challenges the assumption that graph structures must store log2n bits pointers to identify successor nodes • If labels of the path leading to every node is unique then these labels can be used to identify the node • In tries every node has a unique path starting at the root node • Thus, labels along the path will become the identifier of the node • Note that these labels need not be explicitly stored

  36. Traditional Implementation There are nine nodes; we will need 4-bit node identifiers Total memory = 9 x 9 bits

  37. HEXA based Implementation Define HEXA identifier of a node as the path which leads to it from the root 1. - 2. 0 3. 1 4. 00 5. 01 6. 11 7. 010 8. 011 9. 0100 Notice that these identifiers are unique Thus, they can potentially be mapped to unique memory address

  38. HEXA based Implementation Use hashing to map the HEXA identifier to memory address • If we have a minimal perfect hash function f • A function that maps elements to unique location • Then we can store the trie as shown below f(-) = 4 f(0) = 7 f(1) = 9 Here we use only 3-bits per node in fast path f(00) = 2 f(01) = 8 f(11) = 1 1. - 2. 0 3. 1 4. 00 5. 01 6. 11 7. 010 8. 011 9. 0100 f(010) = 5 f(011) = 3 f(0100) = 6

  39. Devising One-to-one Mapping • Finding a minimal perfect hash function is difficult • One-to-one mapping is essential for HEXA to work • Use discriminator bits • Append c-bits to every HEXA identifier, that we can modify • Thus a node can have 2c choices of identifiers • Notice that we need to store these c-bits, thus more than just 3-bits per node are needed • With multiple choices of HEXA identifiers for a node, we can reduce the problem, to a bipartite graph matching problem • We need to find a perfect matching in the graph to map nodes to unique memory locations

  40. Devising One-to-one Mapping

  41. Initial Results • Our initial evaluation suggests that 2-bits discriminators are enough to find a perfect matching • Thus 2-bits per node is enough instead of log2n bits

  42. Initial Results • Memory comparison to Eatherton’s trie • In future • Complete evaluation of HEXA based IP lookup: throughput, die size and power estimate • Extend HEXA to string and finite automaton

  43. Overview of the Presentation • Packet payload inspection • Previous work • D2FA and CD2FA • New ideas to implement regular expressions • Initial results • IP Lookup • Tries and pipelined tries • Previous work: CAMP • New direction: HEXA • Hashing used for packet header processing • Why do we need better hashing? • Previous work: Segmented Hash • New direction: Peacock Hashing • Packet buffering and queuing • Previous work: multichannel packet buffer, aggregated buffer • New direction: DRAM based buffer, NP based queuing assist

  44. kiwi 0 1 2 3 4 5 6 7 8 9 banana watermelon apple mango cantaloupe grapes strawberry Hash Tables • Suppose our hash function gave us the following values: • hash("apple") = 5hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2 • hash("honeydew") = 6 • This is called collision • Now what

  45. Collision Resolution Policies • Linear Probing • Successively search for the first empty subsequent table entry • Linear Chaining • Link all collided entries at any bucket as a linked-list • Double Hashing • Uses a second hash function to successively index the table

  46. Performance Analysis • Average performance is O(1) • However, worst-case performance is O(n) • In fact the likelihood that a key is at a distance > 1 is pretty high These keys will take twice time to be probed Pretty high probability that throughput is half or three times lower than the peak throughput These will take thrice the time to be probed

  47. k i +1 k is mapped i +1 h ( ) to this bucket k i h ( ) k is mapped i to this bucket 2 1 1 1 2 1 2 1 2 A 4-way segmented hash table Segmented Hashing, ANCS’05 • Uses power of multiple choices • has been proposed earlier by Azar et. al • A N-way segmented hash • Logically divides the hash table array into N equal segments • Maps the incoming keys onto a bucket from each segment • Picks the bucket which is either empty or has minimum keys 1 2

  48. Segmented Hash Performance • More segments improves the probabilistic performance • With 64 segments, probability that a key is inserted at distance > 2 is nearly zero even at 100% load • Improvement in average case performance is still modest

  49. k k can go to any of the 3 buckets i i h ( ) h ( ki ) 1 h ( ki ) 2 : h ( ki ) k Adding per Segment Filters 2 1 1 1 2 1 2 0 1 2 1 0 0 1 We can select any of the above three segments and insert the key into the corresponding filter m b bits 0 1 0 0 0 1 0 1

  50. k can go to any of the 3 buckets i Selective Filter Insertion Algorithm k i h ( ) 2 1 1 1 2 1 2 0 1 2 1 0 0 1 Insert the key into segment 4, since fewer bits are set. Fewer bits are set => lower false positive h ( ki ) 1 m h ( ki ) 2 b : bits 0 1 0 0 h ( ki ) With more segments (or more choices), our algorithm sets far fewer bits in the Bloom filter k 0 1 0 1

More Related