Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection

Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan Turner Presented by: Sailesh Kumar

Overview • Why regular expressions acceleration is important? • Introduction to our approach • Delayed Input DFA (D2FA) • D2FA construction • Simulation results • Memory mapping algorithm • Conclusion

Why Regular Expressions Acceleration? • RegEx are now widely used • Network intrusion detection systems, NIDS • Layer 7 switches, load balancing • Firewalls, filtering, authentication and monitoring • Content-based traffic management and routing • RegEx matching is expensive • Space: Large amount of memory • Bandwidth: Requires 1+ state traversal per byte • RegEx is performance bottleneck • In enterprise switches from Cisco, etc • Cisco security appliances • Use DFA, 1+ GB memory, still sub-gigabit throughput • Need to accelerate RegEx!

Can we do better? • Well studied in compiler literature • What’s different in Networking? • Can we do better? • Construction time versus execution time (grep) • Traditionally, (construction + execution) time is the metric • In networking context, execution time is critical • Also, there may be thousands of patterns • DFAs are fast • But can have exponentially large number of states • Algorithms exist to minimize number of states • Still 1) low performance and 2) gigabytes of memory • How to achieve high performance? • Use ASIC/FPGA • On-chip memories provides ample bandwidth • Volume and need for speed justifies custom solution • Limited memory, need space efficient representation!

a 2 c a d a a b a b c c 3 5 1 b b c d b d d 4 c d Introduction to Our Approach • How to represent DFAs more compactly? • Can’t reduce number of states • How about reducing number of transitions? • 256 transitions per state • 50+ distinct transitions per state (real world datasets) • Need at least 50+ words per state Three rules a+, b+c, c*d+ Look at state pairs: there are many common transitions. How to remove them? 4 transitions per state

a a 2 2 c c a a d d a a a a b b a a b b c c c c 3 3 5 5 1 1 b b b b c c d d b b d d d d 4 4 c c d d Introduction to Our Approach • How to represent DFAs more compactly? • Can’t reduce number of states • How about reducing number of transitions? • 256 transitions per state • 50+ distinct transitions per state (real world datasets) • Need at least 50+ words per state Alternative Representation Three rules a+, b+c, c*d+ 4 transitions per state Fewer transitions, less memory

a 2 2 c a d a a b a a b c c c c 3 3 5 5 1 1 b b b c d b d d d 4 4 c d D2FA Operation Heavy edges are called default transitions Take default transitions, whenever, a labeled transition is missing DFA and D2FA visits the same accepting state after consuming a character Input stream: a b d DFA D2FA

a 2 a c 2 d b b c 1 5 5 1 c 3 3 c a d 4 4 2 2 c a d a a b a a b c c c c 3 3 5 5 1 1 b b b c d b d d d 4 4 c d D2FA Operation Above two set of default transitions trees are also correct However, we may traverse 2 default transitions to consume a character Thus, we need to do more work => lower performance Any set of default transitions will suffice if there are no cycles of default transitions Thus, we need to construct trees of default transitions So, how to construct space efficient D2FAs? while keeping default paths bounded

D2FA Construction • Present systematic approach to construct D2FA • Begin with a state minimized DFA • Construct space reduction graph • Undirected graph, vertices are states of DFA • Edges exist between vertices with common transitions • Weight of an edge = # of common transitions - 1 a 2 2 c a d 3 3 a a b 3 2 a b c 3 5 1 c 3 5 1 b b 2 2 3 c 3 d b d 2 3 d 4 4 c d

a 2 c a d 2 a a b a b c 3 3 3 2 c 3 5 1 b b 3 c 5 1 2 2 3 d b d d 3 2 3 4 c d 4 D2FA Construction • Convert certain edges into default transitions • A default transition reduces w transitions (w = wt. of edge) • If we pick high weight edges => more space reduction • Find maximum weight spanning forest • Tree edges becomes the default transitions • Problem: spanning tree may have very large diameter • Longer default paths => lower performance # of transitions removed = 2+3+3+3=11 root

a 2 c a d 2 a a b a b c 3 3 3 2 c 3 5 1 b b 3 c 5 1 2 2 3 d b d d 3 2 3 4 c d 4 D2FA Construction • We need to construct bounded diameter trees • NP-hard • Small diameter bound leads to low trees weight • Less space efficient D2FA • Time-space trade-off • We propose heuristic algorithm based upon Kruskal’s algorithm to create compact bounded diameter D2FAs

D2FA Construction • Our heuristic incrementally builds spanning tree • Whenever, there is an opportunity, keep diameter small • Based upon Kruskal’s algorithm • Details in the paper

Results • We ran experiments on • Cisco RegEx rules • Linux application protocol classifier rules • Bro rules • Snort rules (subset of rules) Size of DFA versus D2FA (No default path length bound applied)

Space-Time Tradeoff Longer default path => more work but less space Space efficient region Default paths have length 4+ Requires 4+ memory accesses per character We propose memory architecture Which enables us to consume one character per clock cycle

Summary of Memory Architecture • We propose an on-chip ASIC architecture • Use multiple embedded memories to store the D2FA • Flexibility • Frequent changes to rules • D2FA requires multiple memory accesses • How to execute D2FA at memory clock rates? • We have proposed deterministic contention free memory mapping algorithm • Uniform access to memories • Enables D2FA to consume a character per memory access • Nearly zero memory fragmentation • All memories are uniformly used • Details and results in paper • At 300 MHz we achieve 5 Gbps worst-case throughput

Conclusion • Deep packet inspection has become challenging • RegEx are used to specify rules • Wire speed inspection • We presented an ASIC based architecture to perform RegEx matching at 10’s of Gigabit rates • As suggested in the public review, this paper is not the final answer to RegEx matching • But it is a good start • We are presently developing techniques to perform fast RegEx matching using commodity memories • Collaborators are welcome!!!

Thank you and Questions?

Backup Slides

D2FA Construction • Our heuristic incrementally builds spanning tree • Whenever, there is an opportunity, keep diameter small • Details in the paper • Graph with 31 states, max. wt. default transition tree • Our heuristic creates smaller default paths Our refined Kruskal’s algorithm, Avg. default path = 5 edges Kruskal’s algorithm, Max. default path = 8 edges

Multiple Memories • To achieve high performance, use multiple memories and D2FA engines • Multiple memories provide high aggregate bandwidth • Multiple engines use bandwidth effectively • However, worst case performance may be low • No better than a single memory • May need complex circuitry to handle contention • We propose deterministic contention free memory mapping and compare it to a random mapping

1 3 1 3 2 2 1 1 1 4 2 3 3 3 3 3 3 3 2 4 3 3 4 2 2 2 4 4 4 1 1 4 Memory Mapping • The memory mapping algorithm can be modeled as a graph coloring • Graph is the set of default transition trees • Colors represent the memory modules • Color nodes of the trees such that • Nodes along a default path are colored with different colors • All colors are uniformly used • We propose two methods, naïve and adaptive Adaptive coloring Naïve coloring

Results • Adaptive mapping leads to much more uniform color usage • Memories are uniformly used, little fragmentation • Up to 20% space saving with adaptive coloring • Throughput results (300 MHz dual-port eSRAM)

Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection