Efficient Memory Utilization on Network Processors for Deep Packet Inspection

Efficient Memory Utilization on Network Processors for Deep Packet Inspection Piti Piyachon Yan Luo Electrical and Computer Engineering Department University of Massachusetts Lowell

Our Contributions • Study parallelism of a pattern matching algorithm • Propose Bit-Byte Aho-Corasick Deterministic Finite Automata • Construct memory model to find optimal settings to minimize the memory usage of DFA U Mass Lowell

DPI and Pattern Matching • Deep Packet Inspection • Inspect: packet header & payload • Detect: computer viruses, worms, spam, etc. • Network intrusion detection application: Bro, Snort, etc. • Pattern Matching requirements • Matching predefined multiple patterns (keywords, or strings) at the same time • Keywords can be any size. • Keywords can be anywhere in the payload of a packet. • Matching at line speed • Flexibility to accommodate new rule sets U Mass Lowell

start state accept state accept state accept state accept state Classical Aho-Corasick (AC) DFA: example 1 • A set of keywords • {he, her, him, his} Failure edges back to state 1 are shown as dash line. Failure edges back to state 0 are not shown. U Mass Lowell

Memory Matrix Model of AC DFA • Snort (Dec’05): 2733 keywords • 256 next state pointers • width = 15 bits • > 27,000 states • keyword-ID width = 2733 bits • 27538 x (2733 + 256 x 15) = 22 MB 22 MB is too big for on-chip RAM U Mass Lowell

Bit-AC DFA (Tan-Sherwood’s Bit-Split) Need 8 bit-DFA U Mass Lowell

Memory Matrix of Bit-AC DFA • Snort (Dec’05): 2733 keywords • 2 next state pointers • width = 9 bits • 361 states • keyword-ID width = 16 bits • 1368 DFA • 1368 x 361 x (16 + 2 x 9) = 2 MB U Mass Lowell

Bit-AC DFA Techniques • Shrinking the width of keyword-ID • From 2733 to 16 bits • By dividing 2733 keywords into 171 subsets • Each subset has 16 keywords • Reducing next state pointers • From 256 to 2 pointers • By dividing each input byte into 1 bits • Need 8 bit-DFA • Extra benefits • The number of states (per DFA) reduces from ~27,000 to ~300 states. • The width of next state pointer reduces from 15 to 9 bits. • Memory • Reduced from 22 MB to 2 MB • The number of DFA = ? • With 171 subsets, each subset has 8 DFA. • Total DFA = 171 x 8 = 1,368 DFA What can we do better to reduce the memory usage? U Mass Lowell

Classical AC DFA: example 2 28 states Failure edges are not shown. U Mass Lowell

Byte-AC DFA • Considering 4 bytes at a time • 4 DFA • < 9 states / DFA • 256 next state pointers! Similar to Dharmapurikar-Lockwood’s JACK DFA, ANCS’05

Bit-Byte-AC DFA • 4 bytes at a time • Each byte divides into bits. • 32 DFA (= 4 x 8) • < 6 states/DFA • 2 next state pointers U Mass Lowell

Memory Matrix of Bit-Byte-AC DFA • Snort (Dec’05): 2733 keywords • 4 bytes at a time • < 36 states/DFA • 2 next state pointers • width = 6 bits • keyword-ID width = 3 bits • 29152 DFA (= 911 x 32) • 29152 x 36 x (3 + 2 x 6)= 1.9 MB • 1.9 MB is a little better than 2 MB. • This is because • It is not any optimal setting. • Each DFA has different number of states. • Don’t need to provide same size of memory matrix for every DFA. U Mass Lowell

Bit-Byte-AC DFA Techniques • Still keeping the width of keyword-ID as low as Bit-DFA. • Still keeping next state pointers as small as Bit-DFA. • Reducing states per DFA by • Skipping bytes • Exploiting more shared states than Bit-DFA • Results of reducing states per DFA • from ~27,000 to 36 states • The width of next state pointer reduces from 15 to 6 bits. U Mass Lowell

Construction of Bit-Byte AC DFA bit 3 of byte 0 4 bytes (considered) at a time U Mass Lowell

Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell

Construction of Bit-Byte AC DFA Failure edges are not shown. U Mass Lowell

Construction of Bit-Byte AC DFA U Mass Lowell

Construction of Bit-Byte AC DFA 32 bit-byte DFA need to be constructed. U Mass Lowell

Bit-Byte-DFA: Searching U Mass Lowell

Bit-Byte-DFA: Searching 0 A failure edge is shown as necessary. U Mass Lowell

Bit-Byte-DFA: Searching U Mass Lowell

Bit-Byte-DFA: Searching 0 A failure edge is shown as necessary. U Mass Lowell

Match=> (keyword) ‘memory’ Only all 32 bit-DFA find the match in their own! Bit-Byte-DFA: Searching U Mass Lowell

Find the optimal settings to minimize memory • When k = keywords per subset • The width of keyword-ID = k bits • k = 1, 2, 3, … , K • when K = the number of keywords in the whole set. • Snort (Dec.2005) : K = 2733 keywords • b = bit(s) extracted for each byte • b = 1, 2, 4, 8 • # of next state pointers = 2b • The example 2: b = 1 • Beyond b > 8 • > 256 next state pointers • B = Bytes considered at a time • B = 1, 2, 3, … • The example 2: B = 4 • Total Memory (T) is a function of k, b, and B. • T = f(k, b, B) U Mass Lowell

T’s Formula , and , when Total memory of all bit-ACs in all subset U Mass Lowell

T_min at k=12 Find the optimal k • Each pair of (b, B) has one optimal k for a minimal T. U Mass Lowell keywords per subset

Find the optimal b • Each setting of k, b, and B has different optimal point. • Choosing only the optimal setting to compare. • b = 2 is the best. U Mass Lowell keywords per subset

Find the optimal B • b = 2 • T reduces while B increases. • Non-linearly • B > 16, • T begins to increase. • B = 16 is the best for Snort (Dec’05). U Mass Lowell keywords per subset

Comparing with Existing Works • Tan-Sherwood’s, Brodie-Cytron-Taylor’s, and Ours • Our Bit-Byte DFA when B=16 • The optimal point at b=2 and k=12 • 272 KB • 14 % of 2001 KB (Tan’s) • 4 % of 6064 KB (Brodie’s) U Mass Lowell keywords per subset

Comparing with Existing Works • Tan-Sherwood’s and Ours: At B = 1 • (Tan’s on ASIC) • 2001 KB • k = 16is not the optimal setting for B=1. • Each bit-DFA uses same storage’s capacity, which fits the largest one (worst case). • (Ours on NP) • 396 KB < 2001 KB • k = 3 is the optimal setting for B=1. • Each bit-DFA uses exactly memory space to hold it. U Mass Lowell keywords per subset

Results with an NP Simulator • NePSim2 • An open source IXP24xx/28xx simulator • NP Architecture based on IXP2855 • 16 MicroEngines (MEs) • 512 KB • 1.4 GHz • Bit-Byte AC DFA: b=2, B=16, k=12 • T = 272 KB • 5 Gbps U Mass Lowell keywords per subset

Conclusion • Bit-Byte DFA model can reduce memory usage up to 86%. • Implementing on NP uses on-chip memory more efficiently without wasting space, comparing to ASIC. • NP has flexibility to accommodate • The optimal setting of k, b, and B. • Different sizes of Bit-Byte DFA. • New rule sets in the future. • The optimal setting may change. • The performance (using a NP simulator) satisfies line speed up to 5 Gbps throughput. U Mass Lowell keywords per subset

Thank you Question? Piti_Piyachon@student.uml.edu Yan_Luo@uml.edu U Mass Lowell

Efficient Memory Utilization on Network Processors for Deep Packet Inspection

Efficient Memory Utilization on Network Processors for Deep Packet Inspection

Presentation Transcript

Network Processors and their memory

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors

Deep Packet Inspection Which Implementation Platform?

BotFinder : Finding Bots in Network Traffic Without Deep Packet Inspection

Network Forensics Deep Packet Inspection

A Memory Efficient DFA based on Pattern Segmentation for Deep Packet Inspection

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

Cache Memory Design for Network Processors

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

An Efficient Packet Scheduling Algorithm in Network Processors

Deep Packet Inspection with Regular Expression Matching

Efficient Memory Utilization on Network Processors for Deep Packet Inspection

Deep Packet Inspection Market Segment to 2020

Packet Scheduling for Deep Packet Inspection on Multi-Core Architectures

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

Packet Classification and Filtering for Network Processors

Performance Analysis of Packet Classification Algorithms on Network Processors

Packet Classification and Filtering for Network Processors

A hybrid finite automaton for practical deep packet inspection

Deep Packet Inspection Using Parallel Bloom Filters

Performance Analysis of Packet Classification Algorithms on Network Processors