Automating Signature Generation for Polymorphic Worms: The Polygraph Approach

Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome*, Brad Karp*†, and Dawn Song* *Carnegie Mellon University †Intel Research Pittsburgh

Internet Worms • Definition: Malicious code that propagates by exploiting software • No human interaction needed • Able to spread very quickly • Slammer scanned 90% of Internet in 10 minutes

! Proposed Defense Strategy Worm Detected! • Honeycomb [Kreibich2003] • Autograph [Kim2004] • Earlybird [Singh2004]

Challenge: Polymorphic Worms • Polymorphic worms minimize invariant content • Encrypted payload • Obfuscated decryption routine • Polymorphic tools are already available • Clet,ADMmutate Do good signatures for polymorphic worms exist? Can we generate them automatically?

NOP slide Decryption Routine Decryption Key Encrypted Payload \xff\xbf GET URL HTTP/1.1 Random Headers Host: Payload Part 1 Random Headers Host: Payload Part 2 Random Headers Good News: Still some invariant content • Protocol framing • Needed to make server go down vulnerable code path • Overwritten Return Address • Needed to redirect execution to worm code • Decryption routine • Needed to decrypt main payload • BUT, code obfuscation can eliminate patterns here

NOP slide Decryption Routine Decryption Key Encrypted Payload \xff\xbf GET URL HTTP/1.1 Random Headers Host: Payload Part 1 Random Headers Host: Payload Part 2 Random Headers Bad News: Previous Approaches Insufficient • Previous approaches use a common substring • Longest substring • “HTTP/1.1” • 93% false positive rate • Most specific substring • “\xff\xbf” • .008% false positive rate (10 / 125,301)

What to do? • No one substring is specific enough • BUT, there are multiple substrings • Protocol framing • Value used to overwrite return address • (Parts of poorly obfuscated code) • Our approach: combine the substrings

Outline • Substring-based signatures insufficient • Generating signatures • Perfect (noiseless) classifier case • Signature classes & algorithms • Evaluation • Imperfect classifier case • Clustering extensions • Evaluation • Attacking the system • Conclusion

Goals • Identify classes of signatures that can: • Accurately describe polymorphic worms • Be used to filter a high speed network line • Be generated automatically and efficiently • Design and implement a system to automatically generate signatures of these classes

Polygraph Architecture Suspicious Flow Pool Network Tap Signature Generator Flow Classifier Worm Signatures Innocuous Flow Pool

GET URL HTTP/1.1 Random Headers Host: Payload Part 1 Random Headers Host: Payload Part 2 Random Headers Signature Class (I): Conjunction • Signature is a set of strings (tokens) • Flow matches signature iff it contains all tokens in the signature • O(n) time to match (n is flow length) • Generated signature: • “GET” and “HTTP/1.1” and “\r\nHost:” and “\r\nHost:” and “\xff\xbf” • .0024% false positive rate (3 / 125,301) NOP slide Decryption Routine Decryption Key Encrypted Payload \xff\xbf

Generating Conjunction Signatures • Use suffix tree to find set of tokens that: • Occur in every sample of suspicious pool • Are at least 2 bytes long • Generation time is linear in total byte size of suspicious pool • Based on a well-known string processing algorithm [Hui1992]

GET URL HTTP/1.1 Random Headers Host: Payload Part 1 Random Headers Host: Payload Part 2 Random Headers Signature Class (II): Token Subsequence • Signature is an ordered set of tokens • Flow matches iff it contains all the tokens in signature, in the given order • O(n) time to match (n is flow length) • Generated signature: • GET.*HTTP/1.1.*\r\nHost:.*\r\nHost:.*\xff\xbf • .0008% false positive rate (1 / 125,301) NOP slide Decryption Routine Decryption Key Encrypted Payload \xff\xbf

Generating Token Subsequence Signatures • Use dynamic programming to find longest common token subsequence (lcseq) between 2 samples in O(n2) time • [SmithWaterman1981] • Find lcseq of first two samples • Iteratively find lcseq of intermediate result and next sample

Experiment: Signature Generation • How many worm samples do we need? • Too few samples  signature is too specific false negatives • Experimental setup • Using a 25 day port 80 trace from lab perimeter • Innocuous pool: First 5 days (45,111 streams) • Suspicious Pool: • Using Apache exploit described earlier • Non-invariant portions filled with random bytes • Signature evaluation: • False positives:Last 10 days (125,301 streams) • False negatives: 1000 generated worm samples

GET .* HTTP/1.1\r\n.*\r\nHost: .*\xee\xb7.*\xb2\x1e.*\r\nHost: .*\xef\xa3.*\x8b\xf4.*\x89\x8b.*E\xeb.*\xff\xbf GET .* HTTP/1.1\r\n.*\r\nHost: .*\r\nHost:.*\xff\xbf Signature Generation Results

Also Works for Binary Protocols • Created polymorphic version of BIND TSIG exploit used by Li0n Worm • Single substring signatures: • 2 bytes of Ret Address: .001% false positives • 3 byte TSIG marker: .067% false positives • Conjunction: 0% false positives • Subsequence: 0% false positives • Evaluated using a 1 million request trace from a DNS server that serves a major university and several CCTLDs

Noise in Suspicious Flow Pool • What if classifier has false positives? • 3 worm samples: • GET .* HTTP/1.1\r\n.*\r\nHost: .*\r\nHost:.*\xff\xbf • 3 worm samples + 1 legit GET request: • GET .* HTTP/1.1\r\n.*\r\nHost: • 3 worm samples + a non-HTTP request: • .*

Our Approach: Hierarchical Clustering • Used for multiple sequence alignment in Bioinformatics [Gusfield1997] • Initialization: • Each sample is a cluster • Each cluster has a signature matching all samples in that cluster • Greedily merge clusters • Minimize false positive rate, using innocuous pool • Stop when any further merging results in significant false positives • Output the signature of each final cluster of sufficient size

Merge Candidate Hierarchical Clustering Worm Sample 1 Innoc Sample 1 Worm Sample 2 Innoc Sample 2 Worm Sample 3 Common substrings: HTTP/1.1, GET, … High false positive rate!

Merge Candidate Hierarchical Clustering Worm Sample 1 Innoc Sample 1 Worm Sample 2 Innoc Sample 2 Worm Sample 3 Common substrings: HTTP/1.1, GET, \xff\xbf, \xde\xad Low false positive rate (but high false negative rate)

Cluster Cluster Hierarchical Clustering Worm Sample 1 Innoc Sample 1 Worm Sample 2 Innoc Sample 2 Worm Sample 3 HTTP/1.1, GET, \xff\xbf, \xde\xad HTTP/1.1, GET, \xff\xbf

Clustering Evaluation (with noise) • Suspicious pool consists of: • 5 polymorphic worm samples • Varying number of noise samples • Noise samples chosen uniformly at random from evaluation trace • Clustering uses innocuous pool to estimate false positive rate

Clustering Results

Overtraining Attacks • Conjunction and Subsequence can be tricked into overtraining • Red herring attack • Include extra fixed tokens • Remove them over time • Result: Have to keep generating new signatures • Coincidental pattern attack • Create ‘coincidental’ patterns given a small set of worm samples • Result: more samples needed to generate a low-false-negative signature (50+)

Solution: Threshold matching • Signature classifies as worm if enough tokens are present • Implementation: Bayes Signatures • Assign each token a score based on Bayes Law • Choose highest-acceptable false positive rate • Choose threshold that gets at most that rate in innocuous training pool • Properties: • Signatures generated and matched in linear time • Not susceptible to overtraining attacks • Don’t need clustering • You get the false positive rate you specify • Currently does not use ordering

Remaining False Positives • Conjunction signature has 3 false positives • 1 of these also matched by subsequence signature • What is causing these? • Would it be so bad if 3 legitimate requests were filtered out every 10 days?

The Offending Request GET /Download/GetPaper.php?paperId=XXX HTTP/1.1 … Host: nsdi05.cs.washington.edu\r\n … POST /Author/UploadPaper.php HTTP/1.1\r\n … Host: nsdi05.cs.washington.edu\r\n … <binary data containing \xff\xbf>

Possible Fixes • Use protocol knowledge • Match on request level instead of TCP flow level • Require \xff\xbf be part of Host header • Disadvantage: need protocol knowledge • Use distance between tokens • Makes signatures more specific • Disadvantage: risks more overtraining attacks

Future Work • Defending against overtraining • Further reducing false positives • Could be reduced by learning more features (such as offsets) • But this increases risk of overtraining • Promising solution: semantic analysis • Automatically analyze how worm exploit works • Only use features that must be present • First steps in Newsome05 (NDSS) • Currently extending this work (Brumley-Newsome-Song)

Conclusions • Key observation: Content variability is limited by nature of the software vulnerability • Have shown that: • Accurate signatures can be automatically generated for polymorphic worms • Demonstrated low false positives with real exploits, on real traffic traces

Thanks! • Questions? • Contact: jnewsome@ece.cmu.edu

Coincidental Pattern Attack • Conjunction & Subsequence may overtrain • Coincidental pattern attack: • For non-invariant bytes, choose ‘a’ or ‘b’ • Result: • Suspicious pool has many substrings in common of form: ‘aabba’, ‘babba’… • Unseen worm samples will have many of these substrings, but not every one

Results with “Coincidental Pattern Attack” • False negatives: Suspicious Pool Size

Results: Multiple Worms + Noise

The Innocuous Pool • Used to determine: • How often tokens appear in legit traffic • Estimated signature false positive rates • Goals: • Representative of current traffic • Does not contain worm flows • Can be generated by: • Taking a relatively old trace • Filtering out known worms and exploits

Key Algorithm: Token Extraction • Need to identify useful tokens • Substrings that occur in worm samples • Problem: Find all substrings that: • Occur in at least k out of n samples • Are at least x bytes long • Can be solved in time linear in total length of samples using a suffix tree

Signature Class (III): Bayes • Use a Bayes classifier • Presence of a token is a feature • Hence, each token has a score: • Generated signature: • (‘GET’: .0035, ‘Host:’: .0022, ‘HTTP/1.1’: .11, ‘\xff\xbf’: 3.15) Threshold=1.99 • .008% false positive rate (10 / 125,301)

Generating Bayes Signatures • Use suffix tree to find tokens that occur in a significant number of samples • Determine probabilities: • Pr(worm) = Pr(~worm) = .5 • Pr(substring|worm): use suspicious pool • Pr(substring|~worm): use innocuous pool • Set a “certainty threshold” c • Signature matches a flow if the Bayes formula identifies it as more than c% likely to be a worm • Choose c that results in few (< 5) false positives in innocuous pool

Innocuous Pool Poisoning • Before releasing worm: • Determine what signature of worm is • Flood Internet with innocuous requests that match • Eventually included in innocuous training pool • Release worm • Polygraph will: • Generate signature for worm • See that it causes many false positives in innocuous pool • Reject signature • Solution: • Use a relatively old trace for innocuous pool • Drawback: Hierarchical clustering generates more spurious signatures

Automating Signature Generation for Polymorphic Worms: The Polygraph Approach

Automating Signature Generation for Polymorphic Worms: The Polygraph Approach

Presentation Transcript

Polygraph

Polygraph Machine

The Polygraph

Automatically Generating Models for Botnet Detection

Automatically Generating Game-Theoretic Strategies for Huge Imperfect-Information Games

AUTOMATICALLY GENERATING CONSISTENT USER INTERFACES

Automatically Generating and Optimizing User-Interfaces for Dynamic Compositions

EXE: Automatically Generating Inputs of Death

EXecution generated Executions: Automatically generating inputs of death.

Automatically Generating Gene Summaries from Biomedical Literature

Automatically Generating High-Quality User Interfaces for Appliances

Automatically Generating Custom Instruction Set Extensions

Automatically Generating Linked Data from Tables

Catch Me, If You Can: Evading Network Signatures with Web-based Polymorphic Worms

Automatically Generating Interfaces for Multi-Device Environments

Semi-Automatically Generating Data-Extraction Ontology

Automatically Generating High-Quality User Interfaces for Appliances

Automatically Generating Gene Summaries from Biomedical Literature

Automatically Generating Fictional and Factual Narratives

Polygraph: Automatically Generating Signatures for Polymorphic Worms

POLYGRAPH : Automatically Generating Signatures for Polymorphic Worms

Automatically Generating Government Linked Data from Tables