Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley

Focus on 3 Network Features • In this proposal, we focus on 3 network features • Packet payload inspection • Network security • Packet header processing • Packet forwarding, classification, etc • Packet buffering and queuing • QoS

Overview of the Presentation • Packet payload inspection • Previous work • D2FA and CD2FA • New ideas to implement regular expressions • Initial results • IP Lookup • Tries and pipelined tries • Previous work: CAMP • New direction: HEXA • Hashing used for packet header processing • Why do we need better hashing? • Previous work: Segmented Hash • New direction: Peacock Hashing • Packet buffering and queuing • Previous work: multichannel packet buffer, aggregated buffer • New direction: DRAM based buffer, NP based queuing assist

a 2 c a d a a b a b c c 3 5 1 b b c d b d d 4 c d Delayed Input DFA (D2FA), SIGCOMM’06 • Many transitions in a DFA • 256 transitions per state • 50+ distinct transitions per state (real world datasets) • Need 50+ words per state • Can we reduce the number of transitions in a DFA Three rules a+, b+c, c*d+ Look at state pairs: there are many common transitions. How to remove them? 4 transitions per state

a a 2 2 c c a a d d a a a a b b a a b b c c c c 3 3 5 5 1 1 b b b b c c d d b b d d d d 4 4 c c d d Delayed Input DFA (D2FA), SIGCOMM’06 • Many transitions in a DFA • 256 transitions per state • 50+ distinct transitions per state (real world datasets) • Need 50+ words per state • Can we reduce the number of transitions in a DFA Alternative Representation Three rules a+, b+c, c*d+ 4 transitions per state Fewer transitions, less memory

a 2 2 c a d a a b a a b c c c c 3 3 5 5 1 1 b b b c d b d d d 4 4 c d D2FA Operation Heavy edges are called default transitions Take default transitions, whenever, a labeled transition is missing DFA D2FA

D2FA versus DFA • D2FAs are compact but requires multiple memory accesses • Up to 20x increased memory accesses • Not desirable in off-chip architecture • Can D2FAs match the performance of DFAs • YES!!!! • Content Addressed D2FAs (CD2FA) • CD2FAs require only one memory access per byte • Matches the performance of a DFA in cacheless system • Systems with data cache, CD2FA are 2-3x faster • CD2FAs are 10x compact than DFAs

R R all U c cd,R d V a ab,cd,R b Introduction to CD2FA, ANCS’06 • How to avoid multiple memory accesses of D2FAs? • Avoid lookup to decide if default path needs to be taken • Avoid default path traversal • Solution: Assign labels to each state, labels contain: • Characters for which it has labeled transitions • Information about all of its default states • Characters for which its default states have labeled transitions find node Rat location R Content Labels find node U athash(c,d,R) find node V athash(a,b,hash(c,d,R))

Introduction to CD2FA R R all all Z U c l lm,Z cd,R Y d m pq,lm,Z V a P ab,cd,R X b q hash(p,q,hash(l,m,Z)) hash(c,d,R) a d Input char = hash(a,b,hash(c,d,R)) Current state: V (label = ab,cd,R) → X (label = pq,lm,Z)

Construction of CD2FA • We seek to keep the content labels small • Twin Objectives: • Ensure that states have few labeled transitions • Ensure that default paths are as small as possible • Proposed new heuristic called CRO to construct CD2FA • Details in ANCS’06 paper • Default path bound = 2 edges => CRO algorithm constructs upto 10x space efficient CD2FAs

Memory Mapping in CD2FA R Z R all all U Y c l lm,R cd,R d m pq,lm,R V X a P ab,cd,R b q WE HAVE ASSUMED THAT HASHING IS COLLISION FREE hash(p,q,hash(l,m,Z)) hash(c,d,R)) hash(a,b,hash(c,d,R)) COLLISION

Collision-free Memory Mapping a Four states hash(abc, …) b a b c , …. c 4 memory locations p hash(pqr, …) q p q r , …. r l hash(def, …) hash(mln, …) Add edges for all Possible choices n , …. l m m n hash(lmn, …) d hash(edf, …) d e f , …. e f

Bipartite Graph Matching • Bipartite Graph • Left nodes are state content labels • Right nodes are memory locations • An edge for every choice of content label • Map state labels to unique memory locations • Perfect matching problem • With n left and right nodes • Need O(logn) random edges • n = 1M implies, we need ~20 edges per node • If we provide slight memory over-provisioning • We can uniquely map state labels with much fewer edges • In our experiments, we found perfect matching without memory over-provisioning

Reg-ex – New Directions • Three Key problems with traditional DFA based reg-ex matching • 1. Employ complete signature to parse input data • Even if normal data matches only a small prefix portion • Full signature => large DFA • 2. Only one active state of execution and no memory about the previous matches • Combinations of partial matches requires new DFA states • 3. Inability to count certain sub-expressions • E.g. a{1024} will require 1024 DFA states • We aim at addressing each of these problems in the proposed research

Addressing the First Problem • Divide the processing into fast and slow path • Split the signature into prefix and suffix • employ signature prefixes in fast path • Upon a match in fast path, trigger the slow path • Appropriate splitting can maintain low triggering rate • Benefits: • Fast path can employ a composite DFA for all prefixes • Due to small prefixes composite DFA will remain small • Higher parsing rate • Slow path will use separate DFA for each signature • No state explosion in slow path • Due to low triggering rate, slow path will not become a bottleneck • Reduces per-flow state • Fast path uses composite DFA, one active state per flow

Fast and Slow Path Processing • Here we assume that ε fraction of the flows are diverted to the slow path • Fast path stores a per flow DFA state • Slow path may store multiple active states

Splitting Reg-exes • Splitting can be performed based upon data traces • Assign probability to NFA states and make a cut so that slow path cumulative probability is low r1 = .*[gh]d[^g]*ge r2 = .*fag[^i]*i[^j]*j r3 = .*a[gh]i[^l]*[ae]c Cumulative probability of slow path = 0.05

Splitting Reg-exes Fast path will contain a composite DFA (14 states) p1 = .*[gh]d[^g]*g p2 = .*fa p3 = .*a[gh]i r1 = .*[gh]d[^g]*ge r2 = .*fag[^i]*i[^j]*j r3 = .*a[gh]i[^l]*[ae]c Notice the start state Slow path will comprise of three separate DFAs, one for each signature

Protection against DoS Attacks • An attacker can attack such system by sending data that match the prefixes more often than provisioned • Slow path will become the bottleneck • Solution: Look at the history and determine if a flow is an attack flow or not • Compute anomaly index: weighted moving average of the number of times a flow has triggered the slow path • If a flow has high anomaly index, send it to a low rate queue

Initial Simulation Results

Addressing the Second Problem • NFA: compact but O(n) active states • DFA: 1 active state but state explosion • How to avoid state explosion while also keeping the per-flow active state information small • Propose a novel machine called History based Finite Automaton or H-FA • Augment a DFA with a history buffer • Transitions are taken looking at the history buffer contents • During certain transitions, items are inserted/removed from the history buffer • Claim: a small history buffer is sufficient to avoid state explosion and also keep a single active state

Example of H-FA Construction NFA state 2 is present in 4 DFA states. If remove the NFA state 2 from these DFA states, then we will have just 6 states DFA

NFA state 2 is present in 4 DFA states. If remove the NFA state 2 from these DFA states, then we will have just 6 states DFA H-FA This new machine uses a history flag in addition to its transitions to make moves

because flag is reset because flag is set ( ) ( ) ( ) ( ) ) ( ) ¾ ¾ ® ¾ ¾ ® ¾ ¾ ® ¾ ¾ ® ¾ ¾ ® c d a b c 0 0 , 4 0 , 1 0 0 , 3 reset flag set flag H-FA Input data = c d a b c ( ) 0 This new machine uses a history flag in addition to its transitions to make moves

H-FA • In general, if we maintain a flag for each NFA state that represents a Kleene closure, we can avoid any state explosion • k closures will require at most k-bits in history buffer • There are some challenges associated with the efficient implementation of conditional transitions • We plan to work on these in the proposed research

Addressing the Third Problem ab[^a]{1024}c def Replace flag by a counter Replace flag=1 condition with ctr=1024 Replace flag=0 condition with ctr=0 Increment ctr if ctr>0; reset when ctr reaches 1024 One of the primary goals of research to enable efficient implementation of counter conditions

Early Results

IP Address Lookup routing table nexthop prefix 0* 7 1* 5 00* 3 01* 5 001* 2 • Routing tables at router input ports contain (prefix, next hop) pairs • Address in the packet is compared to the stored prefixes, starting at left. • Prefix that matches largest number of address bits is desired match. • Packet is forwarded to the specified next hop. 011* 3 1011* 4 address: 0110 0100 1000

Address Lookup Using Tries address: 0110 0100 1000 1 0 0 0 1 1 1 1 3 • Prefixes stored in “alphabetical order” in tree. • Prefixes “spelled” out by following path from top. • green dots mark prefix ends • To find best prefix, spell out address in tree. • Last green dot marks longest matching prefix. 1 0* 7 1* 5 00* 3 01* 5 001* 2 011* 3 1011* 4

Pipelined Trie-based IP-lookup Stages of different size: - Requires more memory - Largest stage becomes the bottleneck Tree data-structure, prefixes in leaves (leaf pushing) Process IP address level-by-level to find the longest match 1 0 1 0 Each level in different stage → overlap multiple packets P4 = 10010* 0 1 1 0 P6 P7 P1 P2 P4 P5 P3

Circular Pipeline, ANCS’06 • Use circular pipeline and allow requests to enter/exit at any stage • Mapping: • Divide the trie into multiple sub-tries • Map each sub-trie with its root starting at different stage

Mapping in Circular Pipeline

Circular Pipeline • Benefits: • Uniform stage sizes • Less memory – no over-provisioning is needed in face of arbitrary trie shape • Higher throughput

New Direction: HEXA • HEXA (History-based Encoding, eXecution and Addressing) • Challenges the assumption that graph structures must store log2n bits pointers to identify successor nodes • If labels of the path leading to every node is unique then these labels can be used to identify the node • In tries every node has a unique path starting at the root node • Thus, labels along the path will become the identifier of the node • Note that these labels need not be explicitly stored

Traditional Implementation There are nine nodes; we will need 4-bit node identifiers Total memory = 9 x 9 bits

HEXA based Implementation Define HEXA identifier of a node as the path which leads to it from the root 1. - 2. 0 3. 1 4. 00 5. 01 6. 11 7. 010 8. 011 9. 0100 Notice that these identifiers are unique Thus, they can potentially be mapped to unique memory address

HEXA based Implementation Use hashing to map the HEXA identifier to memory address • If we have a minimal perfect hash function f • A function that maps elements to unique location • Then we can store the trie as shown below f(-) = 4 f(0) = 7 f(1) = 9 Here we use only 3-bits per node in fast path f(00) = 2 f(01) = 8 f(11) = 1 1. - 2. 0 3. 1 4. 00 5. 01 6. 11 7. 010 8. 011 9. 0100 f(010) = 5 f(011) = 3 f(0100) = 6

Devising One-to-one Mapping • Finding a minimal perfect hash function is difficult • One-to-one mapping is essential for HEXA to work • Use discriminator bits • Append c-bits to every HEXA identifier, that we can modify • Thus a node can have 2c choices of identifiers • Notice that we need to store these c-bits, thus more than just 3-bits per node are needed • With multiple choices of HEXA identifiers for a node, we can reduce the problem, to a bipartite graph matching problem • We need to find a perfect matching in the graph to map nodes to unique memory locations

Devising One-to-one Mapping

Initial Results • Our initial evaluation suggests that 2-bits discriminators are enough to find a perfect matching • Thus 2-bits per node is enough instead of log2n bits

Initial Results • Memory comparison to Eatherton’s trie • In future • Complete evaluation of HEXA based IP lookup: throughput, die size and power estimate • Extend HEXA to string and finite automaton

kiwi 0 1 2 3 4 5 6 7 8 9 banana watermelon apple mango cantaloupe grapes strawberry Hash Tables • Suppose our hash function gave us the following values: • hash("apple") = 5hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2 • hash("honeydew") = 6 • This is called collision • Now what

Collision Resolution Policies • Linear Probing • Successively search for the first empty subsequent table entry • Linear Chaining • Link all collided entries at any bucket as a linked-list • Double Hashing • Uses a second hash function to successively index the table

Performance Analysis • Average performance is O(1) • However, worst-case performance is O(n) • In fact the likelihood that a key is at a distance > 1 is pretty high These keys will take twice time to be probed Pretty high probability that throughput is half or three times lower than the peak throughput These will take thrice the time to be probed

k i +1 k is mapped i +1 h ( ) to this bucket k i h ( ) k is mapped i to this bucket 2 1 1 1 2 1 2 1 2 A 4-way segmented hash table Segmented Hashing, ANCS’05 • Uses power of multiple choices • has been proposed earlier by Azar et. al • A N-way segmented hash • Logically divides the hash table array into N equal segments • Maps the incoming keys onto a bucket from each segment • Picks the bucket which is either empty or has minimum keys 1 2

Segmented Hash Performance • More segments improves the probabilistic performance • With 64 segments, probability that a key is inserted at distance > 2 is nearly zero even at 100% load • Improvement in average case performance is still modest

k k can go to any of the 3 buckets i i h ( ) h ( ki ) 1 h ( ki ) 2 : h ( ki ) k Adding per Segment Filters 2 1 1 1 2 1 2 0 1 2 1 0 0 1 We can select any of the above three segments and insert the key into the corresponding filter m b bits 0 1 0 0 0 1 0 1

k can go to any of the 3 buckets i Selective Filter Insertion Algorithm k i h ( ) 2 1 1 1 2 1 2 0 1 2 1 0 0 1 Insert the key into segment 4, since fewer bits are set. Fewer bits are set => lower false positive h ( ki ) 1 m h ( ki ) 2 b : bits 0 1 0 0 h ( ki ) With more segments (or more choices), our algorithm sets far fewer bits in the Bloom filter k 0 1 0 1

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

Presentation Transcript

ABC’s of Proposal Processing

STEP-UP: Doctoral Dissertation Grants

Expanding dissertation proposal

Dissertation Proposal

Lisa Keefe Doctoral Dissertation Seminar

NSF Doctoral Dissertation Improvement Grants

Proposal Processing

Hardware Acceleration of Parallel Prefix Algorithms

Doctoral Thesis Proposal

S. Louvel, Doctoral Dissertation, 1999

Writing a Dissertation Proposal

Doctoral dissertation by Markus Kettunen

Dissertation Proposal:

Expanding dissertation proposal

Dissertation Proposal

Dissertation Proposal Writing Service

Law Dissertation Proposal

Dissertation Proposal Writing

Data Plane Algorithms in Network Processing Systems

dissertation proposal writing service

Best Dissertation Proposal Writing